Skip to content

feat: continuous rerand#2007

Open
philsippl wants to merge 80 commits into
mainfrom
ps/cont-rerand
Open

feat: continuous rerand#2007
philsippl wants to merge 80 commits into
mainfrom
ps/cont-rerand

Conversation

@philsippl
Copy link
Copy Markdown
Contributor

@philsippl philsippl commented Feb 27, 2026

Spec and implementation of continuous rerandomization of the iris share databases.

Spec is provided in docs/specs/rerandomization.md.

Goals:

  • Continuous rerandomization of shares in persistent storage (in-memory representation stays untouched)
  • Rerand performed by separate server than matching server (for network performance reasons)
  • Minimal impact on normal server (e.g. matching, startup, ...)
  • Support for GPU and HNSW versions

High level design:

  • Rerand server operating in chunks over DB

  • The rerand servers can be at most 1 chunk (or 1 epoch at the boundary) apart from each other

  • Synchronization with matching server required:

    • at startup: while the matching server is loading the db, we freeze the rerand
    • when applying modifications: iris share table is always locked, when there is a writer
  • Rerand servers first write to "staging" schema and then apply in chunks to reduce lock times

  • Relies on the current system property that modifications are guaranteed to eventually arrive on all parties

    • This PR also makes modification handling a bit more robust by deleting only after storing a modification and recovering potentially lost modifications.

    related: support additional route ampc-common#71

@github-actions github-actions Bot added the chore label Feb 27, 2026
Comment thread iris-mpc-store/src/rerand.rs
Comment thread iris-mpc-store/src/rerand.rs Outdated
@philsippl philsippl changed the title initial implementation feat: continuous rerand Feb 27, 2026
@philsippl philsippl requested a review from dkales February 27, 2026 12:48
@philsippl philsippl marked this pull request as ready for review March 1, 2026 22:05
@philsippl philsippl requested a review from a team as a code owner March 1, 2026 22:05
@philsippl philsippl requested a review from eaypek-tfh March 2, 2026 13:23
Copy link
Copy Markdown
Collaborator

@dkales dkales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, overall strategy seems good. Have not looked at the e2e tests yet.

Comment thread iris-mpc-common/src/helpers/sync.rs Outdated
Comment thread iris-mpc-store/src/rerand.rs Outdated
@@ -138,33 +172,75 @@ pub async fn server_main(config: Config) -> Result<()> {

sync_sqs_queues(&config, &sync_result, &aws_clients).await?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the sync_sqs_queues logic not violate the assumptions on the sqs queues?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be fair IDK if this ever happens

Copy link
Copy Markdown
Contributor Author

@philsippl philsippl Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it actually is fine even in theory because of the modification sync happening before, so all nodes sync on them before. So after that, we would only delete modifications that no node has received. But this all relies on the order of the code and hopefully we anyways don't get a queue desync in prod.

Comment thread iris-mpc/src/server/mod.rs
Comment thread iris-mpc-upgrade/src/s3_coordination.rs
Comment thread iris-mpc-upgrade/src/epoch.rs
Comment thread iris-mpc-upgrade/src/epoch.rs Outdated
Comment thread iris-mpc-store/src/rerand.rs
carlomazzaferro added a commit that referenced this pull request May 27, 2026
Move delete_message in BatchProcessor::process_message from before the
per-type processing branch to after it returns Ok. The previous ordering
acked the SQS message before the modification was durably persisted, so
a crash in that window silently dropped the message. After this change,
Err from inner processing propagates without acking, so SQS visibility-
timeout redelivery retries the message.

The post-process DeleteMessage call is best-effort: if it transiently
fails after successful processing, log + counter and continue rather
than propagate. Propagating would tear down receive_batch_stream
(batch.rs:116 — any Err breaks the spawn loop) and discard the
assembled BatchQuery for what is a transient ack failure; the
modification is already persisted and downstream appliers are
idempotent.

Two adjacent hardenings on the same persistence flow:
- delete_message helper: receipt_handle.unwrap() -> ok_or_else with a
  new ReceiveRequestError::FailedToMarkRequestAsDeleted variant
- modifications_sync.rs (peer-sync apply path): panic!/unwrap() on
  unknown modification type, missing s3_url, and JoinHandle Result
  converted to typed Err returns. No happy-path behavior change.

Cherry-picked equivalents of changes in #2007 (continuous-rerand
branch, @philsippl), translated to current main shape; the rest of
#2007 remains parked on ps/cont-rerand. The best-effort delete wrap
addresses a codex review point that #2007 does not.

Linear: https://linear.app/worldcoin/issue/POP-3781
carlomazzaferro added a commit that referenced this pull request May 27, 2026
Move delete_message in BatchProcessor::process_message from before the
per-type processing branch to after it returns Ok. The previous ordering
acked the SQS message before the modification was durably persisted, so
a crash in that window silently dropped the message. After this change,
Err from inner processing propagates without acking, so SQS visibility-
timeout redelivery retries the message.

DeleteMessage itself uses strict `?` propagation. If the ack fails
after processing has succeeded, the Err tears down receive_batch_stream
(batch.rs:116) — intentional, because the modifications table is not
idempotent at the DB boundary (plain INSERT, no request_id key, see
iris-mpc-store/src/lib.rs:491-519). The cascade acts as a circuit
breaker against duplicate-row accumulation while SQS is unhealthy.
Proper modifications-table idempotency is tracked in POP-3897.

Two adjacent hardenings on the same persistence flow:
- delete_message helper: receipt_handle.unwrap() -> ok_or_else with a
  new ReceiveRequestError::FailedToMarkRequestAsDeleted variant
- modifications_sync.rs (peer-sync apply path): panic!/unwrap() on
  unknown modification type, missing s3_url, and JoinHandle Result
  converted to typed Err returns. No happy-path behavior change.

Cherry-picked equivalents of changes in #2007 (continuous-rerand
branch, @philsippl), translated to current main shape; the rest of
#2007 remains parked on ps/cont-rerand.

Linear: https://linear.app/worldcoin/issue/POP-3781
Follow-up: https://linear.app/worldcoin/issue/POP-3897
carlomazzaferro added a commit that referenced this pull request May 27, 2026
Move delete_message in BatchProcessor::process_message from before the
per-type processing branch to after it returns Ok. The previous ordering
acked the SQS message before the modification was durably persisted, so
a crash in that window silently dropped the message. After this change,
Err from inner processing propagates without acking, so SQS visibility-
timeout redelivery retries the message.

DeleteMessage itself uses strict `?` propagation. If the ack fails
after processing has succeeded, the Err tears down receive_batch_stream
(batch.rs:116) — intentional, because the modifications table is not
idempotent at the DB boundary (plain INSERT, no request_id key, see
iris-mpc-store/src/lib.rs:491-519). The cascade acts as a circuit
breaker against duplicate-row accumulation while SQS is unhealthy.
Proper modifications-table idempotency is tracked in POP-3897.

Two adjacent hardenings on the same persistence flow:
- delete_message helper: receipt_handle.unwrap() -> ok_or_else with a
  new ReceiveRequestError::FailedToMarkRequestAsDeleted variant
- modifications_sync.rs (peer-sync apply path): panic!/unwrap() on
  unknown modification type, missing s3_url, and JoinHandle Result
  converted to typed Err returns. No happy-path behavior change.

Cherry-picked equivalents of changes in #2007 (continuous-rerand
branch, @philsippl), translated to current main shape; the rest of
#2007 remains parked on ps/cont-rerand.

Linear: https://linear.app/worldcoin/issue/POP-3781
Follow-up: https://linear.app/worldcoin/issue/POP-3897
carlomazzaferro added a commit that referenced this pull request May 27, 2026
Move delete_message in BatchProcessor::process_message from before the
per-type processing branch to after it returns Ok. The previous ordering
acked the SQS message before the modification was durably persisted, so
a crash in that window silently dropped the message. After this change,
Err from inner processing propagates without acking, so SQS visibility-
timeout redelivery retries the message.

DeleteMessage itself uses strict `?` propagation. If the ack fails
after processing has succeeded, the Err tears down receive_batch_stream
(batch.rs:116) — intentional, because the modifications table is not
idempotent at the DB boundary (plain INSERT, no request_id key, see
iris-mpc-store/src/lib.rs:491-519). The cascade acts as a circuit
breaker against duplicate-row accumulation while SQS is unhealthy.
Proper modifications-table idempotency is tracked in POP-3897.

Two adjacent hardenings on the same persistence flow:
- delete_message helper: receipt_handle.unwrap() -> ok_or_else with a
  new ReceiveRequestError::FailedToMarkRequestAsDeleted variant
- modifications_sync.rs (peer-sync apply path): panic!/unwrap() on
  unknown modification type, missing s3_url, and JoinHandle Result
  converted to typed Err returns. No happy-path behavior change.

Cherry-picked equivalents of changes in #2007 (continuous-rerand
branch, @philsippl), translated to current main shape; the rest of
#2007 remains parked on ps/cont-rerand.

Linear: https://linear.app/worldcoin/issue/POP-3781
Follow-up: https://linear.app/worldcoin/issue/POP-3897
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants