Guidance on scaling Etherpad to hundreds/thousands of concurrent editors on a single pad

## Context

Hi team. Would be great if I could please get your help on this? 🙏 🙏 🙏 
We run Etherpad as part of a collaborative deliberation platform. We need to support **300 concurrent real-time editors on a single pad**, and ultimately scale to ~1,000. We load tested with Playwright (headless Chromium) to simulate real users logging in and editing the same pad simultaneously, and we're struggling to get stable results beyond ~100 concurrent users despite the server being barely utilised.

## Our Setup

- Single Etherpad instance in Docker
- PostgreSQL 13 via ueberDB
- Node.js v20, `--max-old-space-size=4096`, `UV_THREADPOOL_SIZE=128`
- Caddy reverse proxy with WebSocket support
- Server: 12 CPU cores, 62 GB RAM
- Test clients: 4 VMs (12 CPU, 62 GB RAM each), each running 75 users via Playwright in Docker
- All 300 users connect to the **same pad**
- A synchronisation barrier ensures all users start editing at the same moment

## Server Resource Usage — The Puzzling Part

We monitored the Etherpad server during all tests. **Even when tests were failing, the server appeared largely idle:**

| Resource | Idle | Peak (during failures) |
|----------|------|-----------------------|
| CPU load | 0.14 | 7.5 (on 12 cores) |
| RAM | 2.6 GB | 3.5 GB (of 62 GB) |
| Etherpad CPU | 0.3% | ~140% |
| API response times | — | 5–70 ms |
| Network latency | — | ~0.3 ms |

The server had plenty of headroom, yet users were failing to establish Etherpad WebSocket connections. No errors in logs except PostgreSQL connection exhaustion (see below).

## Results

**75 concurrent editors (1 VM)** — Stable:
- 3/3 runs at 100% connection success

**100 concurrent editors (2 VMs)** — Stable after PgBouncer fix:
- 5/5 runs passed (~96-100% per VM)

**150 concurrent editors (3 VMs)** — Unstable:
- Results ranged from 53% to 99% across runs, with no consistent pattern to which VM's users failed to connect

**300 concurrent editors (4 VMs)** — Unstable:
- Could not achieve a stable run at full scale. Etherpad connections would intermittently fail for large portions of users on random VMs, even though the server showed no resource pressure.

## Root Cause We Found

Etherpad logs showed:

```
[ERROR] ueberDB - error: sorry, too many clients already
[ERROR] settings - error: remaining connection slots are reserved for roles with the SUPERUSER attribute
```

PostgreSQL connections spiked from ~26 to **200+** under load. The default connection string (`postgres://...`) has no pooling, so concurrent users exhausted the database.

We deployed **PgBouncer** (transaction mode, max 100 DB connections), which stabilised 100 concurrent editors. But at 150+ editors on a single pad, results remained inconsistent — despite the server showing no resource pressure.

## Questions

1. **What is the recommended architecture for scaling a single pad to hundreds or thousands of concurrent editors?** The server is barely utilised, yet connections fail intermittently. Is there an internal bottleneck we're missing?

2. **Is horizontal scaling supported?** Can we run multiple Etherpad instances with sticky sessions and shared state (e.g., Redis for socket.io) for a single pad?

3. **Are there recommended ueberDB connection pooling settings** for high concurrency? We had to add PgBouncer ourselves to avoid connection exhaustion.

4. **Are there known limits in the WebSocket/pad-join flow** that would explain intermittent failures under load when hardware resources are underutilised?

Happy to share logs or test scripts if helpful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guidance on scaling Etherpad to hundreds/thousands of concurrent editors on a single pad #7403

Context

Our Setup

Server Resource Usage — The Puzzling Part

Results

Root Cause We Found

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resource	Idle	Peak (during failures)
CPU load	0.14	7.5 (on 12 cores)
RAM	2.6 GB	3.5 GB (of 62 GB)
Etherpad CPU	0.3%	~140%
API response times	—	5–70 ms
Network latency	—	~0.3 ms

Uh oh!

Guidance on scaling Etherpad to hundreds/thousands of concurrent editors on a single pad #7403

Description

Context

Our Setup

Server Resource Usage — The Puzzling Part

Results

Root Cause We Found

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions