The original demo proved the relay and local execution path, but it still encoded a two-client story in too many places:
- room payloads hardcoded
architectandskeptic - default turn counts were fixed
- the operator tooling assumed exactly two client wrappers
- the docs described a specialized protocol instead of the underlying runtime
This guide describes the incremental change that generalized the demo without turning it into a larger protocol framework.
The target was an incremental architectural step:
- keep the coordinator authoritative
- keep clients generic
- distribute work across 1 to 39 connected workers
- keep the algorithm intentionally simple
- preserve a clean upgrade path for future coordinator strategies
The current demo is therefore a generalized workload distributor, not a peer-to-peer collaboration fabric.
The implementation deliberately separates feature changes from configuration and operator changes.
These changed the runtime model:
- added
execution_planto room state - locked the selected worker set at room creation
- derived the default turn budget from the locked participant count
- changed coordinator assignment from fixed roles to round-robin worker selection plus coordinator-assigned turn roles
- added room-local exclusion for abandoned workers
- ignored late results from abandoned turns
These changed how the demo is run:
- added
bin/client-worker - updated
bin/clientdefaults to a generic worker identity - changed
bin/hive-clientsfrom role launchers to generic worker fan-out - changed
setup/hive create-roomandlive-demoto discover connected relay targets - changed
run-roomto use the locked plan by default whenmax_assignmentsis not specified
Keeping those concerns separate made it possible to test the runtime logic directly while letting the shell tooling stay thin.
The execution plan struct defined in
lib/jido_hive_server/collaboration/execution_plan.ex is the core of the
change.
It currently stores:
strategyparticipant_countplanned_turn_countcompleted_turn_countround_robin_indexexcluded_target_idslocked_participants
The current strategy is intentionally minimal:
planned_turn_count = participant_count * 3
The stage mapping is fixed:
- pass 1 through all workers:
proposal - pass 2 through all workers:
critique - pass 3 through all workers:
resolution
That gives the demo a visible buildup while remaining predictable enough for tests and operator tooling.
Workers no longer imply room roles.
Each worker registers as a generic participant with a generic target. The coordinator assigns a turn role at dispatch time:
proposercriticresolver
That assignment is embedded in the collaboration envelope and echoed back in the job payload. The worker logs now show both:
- the local client identity
- the assigned role for the current turn
This keeps the worker runtime generic while preserving a structured collaboration shape for the demo.
The important semantic change is that the room budget is logical, not merely a count of job attempts.
Example:
- 10 workers selected at room creation
- default plan is 30 completed turns
- one worker drops on turn 7
- the room still targets 30 completed turns, but the remaining workers absorb the rest of the schedule
That is why abandon_turn does not increment completed_turn_count.
The room loop now drives toward a target completed-turn count instead of a raw attempt counter.
Two failure classes mattered for the generalized flow.
If a target is already offline when the coordinator is choosing the next worker,
the room simply skips it because select_next_participant/2 only considers live
targets.
If the room has already opened a turn and the worker disappears or never responds in time:
- the turn is marked
abandoned - the target is added to
execution_plan.excluded_target_ids - the room continues with the remaining workers
That room-local exclusion is important even if the target is still globally visible for a brief moment, because it prevents a single bad worker from being selected repeatedly inside the same room.
Two correctness fixes were required to make the generalized budget safe.
The original polling loop only checked for timeout before sleeping. If the poll interval exceeded the requested timeout, a late completion could arrive and still be accepted.
The fix was to:
- compute the remaining deadline on each poll
- sleep for
min(poll_interval, remaining_deadline) - fail immediately when the remaining budget reaches zero
Once a turn is abandoned, any later job.result for that job must be ignored.
ApplyResult now only mutates room state when the incoming result matches the
currently running turn. If the room has already advanced past that turn, the
result becomes a no-op.
Without that guard, timed-out workers could still increment the completed-turn count later and corrupt the room history.
The shell tooling now mirrors the runtime model:
bin/client-worker --worker-index Nlaunches one generic workerbin/hive-clientslaunches 1, 2, or a custom 1..39 worker set in one terminalsetup/hive wait-targets --count Nwaits for enough workerssetup/hive create-room --participant-count Nlocks a subset of the live workerssetup/hive live-demodefaults to all connected workers when no explicit count is provided
This means the control terminal and the server no longer need to know anything
about architect or skeptic.
Legacy wrappers still exist for compatibility, but they are no longer the primary story.
The coordinator remains authoritative for this slice because it keeps the model simple:
- workers stay stateless beyond the current prompt plus shared history
- the room can serialize collaboration cleanly
- the operator can reason about the exact planned turn budget
- tests stay deterministic
This is a good midpoint between a toy two-client demo and a more open-ended multi-agent protocol.
The current architecture is intentionally prepared for future strategies.
Clean next steps include:
- new
execution_plan.strategyvalues - dynamic stage counts
- specialized worker capability filters
- coordinator policies that choose subsets by capability or prior output quality
- richer room blocking and recovery policies
Those can now be added without rewriting the operator flow or reverting to hardcoded participant identities.