-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Guidance on scaling Etherpad to hundreds/thousands of concurrent editors on a single pad #7403
Description
Context
Hi team. Would be great if I could please get your help on this? 🙏 🙏 🙏
We run Etherpad as part of a collaborative deliberation platform. We need to support 300 concurrent real-time editors on a single pad, and ultimately scale to ~1,000. We load tested with Playwright (headless Chromium) to simulate real users logging in and editing the same pad simultaneously, and we're struggling to get stable results beyond ~100 concurrent users despite the server being barely utilised.
Our Setup
- Single Etherpad instance in Docker
- PostgreSQL 13 via ueberDB
- Node.js v20,
--max-old-space-size=4096,UV_THREADPOOL_SIZE=128 - Caddy reverse proxy with WebSocket support
- Server: 12 CPU cores, 62 GB RAM
- Test clients: 4 VMs (12 CPU, 62 GB RAM each), each running 75 users via Playwright in Docker
- All 300 users connect to the same pad
- A synchronisation barrier ensures all users start editing at the same moment
Server Resource Usage — The Puzzling Part
We monitored the Etherpad server during all tests. Even when tests were failing, the server appeared largely idle:
| Resource | Idle | Peak (during failures) |
|---|---|---|
| CPU load | 0.14 | 7.5 (on 12 cores) |
| RAM | 2.6 GB | 3.5 GB (of 62 GB) |
| Etherpad CPU | 0.3% | ~140% |
| API response times | — | 5–70 ms |
| Network latency | — | ~0.3 ms |
The server had plenty of headroom, yet users were failing to establish Etherpad WebSocket connections. No errors in logs except PostgreSQL connection exhaustion (see below).
Results
75 concurrent editors (1 VM) — Stable:
- 3/3 runs at 100% connection success
100 concurrent editors (2 VMs) — Stable after PgBouncer fix:
- 5/5 runs passed (~96-100% per VM)
150 concurrent editors (3 VMs) — Unstable:
- Results ranged from 53% to 99% across runs, with no consistent pattern to which VM's users failed to connect
300 concurrent editors (4 VMs) — Unstable:
- Could not achieve a stable run at full scale. Etherpad connections would intermittently fail for large portions of users on random VMs, even though the server showed no resource pressure.
Root Cause We Found
Etherpad logs showed:
[ERROR] ueberDB - error: sorry, too many clients already
[ERROR] settings - error: remaining connection slots are reserved for roles with the SUPERUSER attribute
PostgreSQL connections spiked from ~26 to 200+ under load. The default connection string (postgres://...) has no pooling, so concurrent users exhausted the database.
We deployed PgBouncer (transaction mode, max 100 DB connections), which stabilised 100 concurrent editors. But at 150+ editors on a single pad, results remained inconsistent — despite the server showing no resource pressure.
Questions
-
What is the recommended architecture for scaling a single pad to hundreds or thousands of concurrent editors? The server is barely utilised, yet connections fail intermittently. Is there an internal bottleneck we're missing?
-
Is horizontal scaling supported? Can we run multiple Etherpad instances with sticky sessions and shared state (e.g., Redis for socket.io) for a single pad?
-
Are there recommended ueberDB connection pooling settings for high concurrency? We had to add PgBouncer ourselves to avoid connection exhaustion.
-
Are there known limits in the WebSocket/pad-join flow that would explain intermittent failures under load when hardware resources are underutilised?
Happy to share logs or test scripts if helpful.