Skip to content

Guidance on scaling Etherpad to hundreds/thousands of concurrent editors on a single pad #7403

@jayhuynh

Description

@jayhuynh

Context

Hi team. Would be great if I could please get your help on this? 🙏 🙏 🙏
We run Etherpad as part of a collaborative deliberation platform. We need to support 300 concurrent real-time editors on a single pad, and ultimately scale to ~1,000. We load tested with Playwright (headless Chromium) to simulate real users logging in and editing the same pad simultaneously, and we're struggling to get stable results beyond ~100 concurrent users despite the server being barely utilised.

Our Setup

  • Single Etherpad instance in Docker
  • PostgreSQL 13 via ueberDB
  • Node.js v20, --max-old-space-size=4096, UV_THREADPOOL_SIZE=128
  • Caddy reverse proxy with WebSocket support
  • Server: 12 CPU cores, 62 GB RAM
  • Test clients: 4 VMs (12 CPU, 62 GB RAM each), each running 75 users via Playwright in Docker
  • All 300 users connect to the same pad
  • A synchronisation barrier ensures all users start editing at the same moment

Server Resource Usage — The Puzzling Part

We monitored the Etherpad server during all tests. Even when tests were failing, the server appeared largely idle:

Resource Idle Peak (during failures)
CPU load 0.14 7.5 (on 12 cores)
RAM 2.6 GB 3.5 GB (of 62 GB)
Etherpad CPU 0.3% ~140%
API response times 5–70 ms
Network latency ~0.3 ms

The server had plenty of headroom, yet users were failing to establish Etherpad WebSocket connections. No errors in logs except PostgreSQL connection exhaustion (see below).

Results

75 concurrent editors (1 VM) — Stable:

  • 3/3 runs at 100% connection success

100 concurrent editors (2 VMs) — Stable after PgBouncer fix:

  • 5/5 runs passed (~96-100% per VM)

150 concurrent editors (3 VMs) — Unstable:

  • Results ranged from 53% to 99% across runs, with no consistent pattern to which VM's users failed to connect

300 concurrent editors (4 VMs) — Unstable:

  • Could not achieve a stable run at full scale. Etherpad connections would intermittently fail for large portions of users on random VMs, even though the server showed no resource pressure.

Root Cause We Found

Etherpad logs showed:

[ERROR] ueberDB - error: sorry, too many clients already
[ERROR] settings - error: remaining connection slots are reserved for roles with the SUPERUSER attribute

PostgreSQL connections spiked from ~26 to 200+ under load. The default connection string (postgres://...) has no pooling, so concurrent users exhausted the database.

We deployed PgBouncer (transaction mode, max 100 DB connections), which stabilised 100 concurrent editors. But at 150+ editors on a single pad, results remained inconsistent — despite the server showing no resource pressure.

Questions

  1. What is the recommended architecture for scaling a single pad to hundreds or thousands of concurrent editors? The server is barely utilised, yet connections fail intermittently. Is there an internal bottleneck we're missing?

  2. Is horizontal scaling supported? Can we run multiple Etherpad instances with sticky sessions and shared state (e.g., Redis for socket.io) for a single pad?

  3. Are there recommended ueberDB connection pooling settings for high concurrency? We had to add PgBouncer ourselves to avoid connection exhaustion.

  4. Are there known limits in the WebSocket/pad-join flow that would explain intermittent failures under load when hardware resources are underutilised?

Happy to share logs or test scripts if helpful.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions