feat: continuous rerand#2007
Open
philsippl wants to merge 80 commits into
Open
Conversation
dkales
reviewed
Mar 3, 2026
Collaborator
dkales
left a comment
There was a problem hiding this comment.
A few minor comments, overall strategy seems good. Have not looked at the e2e tests yet.
| @@ -138,33 +172,75 @@ pub async fn server_main(config: Config) -> Result<()> { | |||
|
|
|||
| sync_sqs_queues(&config, &sync_result, &aws_clients).await?; | |||
Collaborator
There was a problem hiding this comment.
Does the sync_sqs_queues logic not violate the assumptions on the sqs queues?
Collaborator
There was a problem hiding this comment.
to be fair IDK if this ever happens
Contributor
Author
There was a problem hiding this comment.
I think it actually is fine even in theory because of the modification sync happening before, so all nodes sync on them before. So after that, we would only delete modifications that no node has received. But this all relies on the order of the code and hopefully we anyways don't get a queue desync in prod.
carlomazzaferro
added a commit
that referenced
this pull request
May 27, 2026
Move delete_message in BatchProcessor::process_message from before the per-type processing branch to after it returns Ok. The previous ordering acked the SQS message before the modification was durably persisted, so a crash in that window silently dropped the message. After this change, Err from inner processing propagates without acking, so SQS visibility- timeout redelivery retries the message. The post-process DeleteMessage call is best-effort: if it transiently fails after successful processing, log + counter and continue rather than propagate. Propagating would tear down receive_batch_stream (batch.rs:116 — any Err breaks the spawn loop) and discard the assembled BatchQuery for what is a transient ack failure; the modification is already persisted and downstream appliers are idempotent. Two adjacent hardenings on the same persistence flow: - delete_message helper: receipt_handle.unwrap() -> ok_or_else with a new ReceiveRequestError::FailedToMarkRequestAsDeleted variant - modifications_sync.rs (peer-sync apply path): panic!/unwrap() on unknown modification type, missing s3_url, and JoinHandle Result converted to typed Err returns. No happy-path behavior change. Cherry-picked equivalents of changes in #2007 (continuous-rerand branch, @philsippl), translated to current main shape; the rest of #2007 remains parked on ps/cont-rerand. The best-effort delete wrap addresses a codex review point that #2007 does not. Linear: https://linear.app/worldcoin/issue/POP-3781
carlomazzaferro
added a commit
that referenced
this pull request
May 27, 2026
Move delete_message in BatchProcessor::process_message from before the per-type processing branch to after it returns Ok. The previous ordering acked the SQS message before the modification was durably persisted, so a crash in that window silently dropped the message. After this change, Err from inner processing propagates without acking, so SQS visibility- timeout redelivery retries the message. DeleteMessage itself uses strict `?` propagation. If the ack fails after processing has succeeded, the Err tears down receive_batch_stream (batch.rs:116) — intentional, because the modifications table is not idempotent at the DB boundary (plain INSERT, no request_id key, see iris-mpc-store/src/lib.rs:491-519). The cascade acts as a circuit breaker against duplicate-row accumulation while SQS is unhealthy. Proper modifications-table idempotency is tracked in POP-3897. Two adjacent hardenings on the same persistence flow: - delete_message helper: receipt_handle.unwrap() -> ok_or_else with a new ReceiveRequestError::FailedToMarkRequestAsDeleted variant - modifications_sync.rs (peer-sync apply path): panic!/unwrap() on unknown modification type, missing s3_url, and JoinHandle Result converted to typed Err returns. No happy-path behavior change. Cherry-picked equivalents of changes in #2007 (continuous-rerand branch, @philsippl), translated to current main shape; the rest of #2007 remains parked on ps/cont-rerand. Linear: https://linear.app/worldcoin/issue/POP-3781 Follow-up: https://linear.app/worldcoin/issue/POP-3897
carlomazzaferro
added a commit
that referenced
this pull request
May 27, 2026
Move delete_message in BatchProcessor::process_message from before the per-type processing branch to after it returns Ok. The previous ordering acked the SQS message before the modification was durably persisted, so a crash in that window silently dropped the message. After this change, Err from inner processing propagates without acking, so SQS visibility- timeout redelivery retries the message. DeleteMessage itself uses strict `?` propagation. If the ack fails after processing has succeeded, the Err tears down receive_batch_stream (batch.rs:116) — intentional, because the modifications table is not idempotent at the DB boundary (plain INSERT, no request_id key, see iris-mpc-store/src/lib.rs:491-519). The cascade acts as a circuit breaker against duplicate-row accumulation while SQS is unhealthy. Proper modifications-table idempotency is tracked in POP-3897. Two adjacent hardenings on the same persistence flow: - delete_message helper: receipt_handle.unwrap() -> ok_or_else with a new ReceiveRequestError::FailedToMarkRequestAsDeleted variant - modifications_sync.rs (peer-sync apply path): panic!/unwrap() on unknown modification type, missing s3_url, and JoinHandle Result converted to typed Err returns. No happy-path behavior change. Cherry-picked equivalents of changes in #2007 (continuous-rerand branch, @philsippl), translated to current main shape; the rest of #2007 remains parked on ps/cont-rerand. Linear: https://linear.app/worldcoin/issue/POP-3781 Follow-up: https://linear.app/worldcoin/issue/POP-3897
carlomazzaferro
added a commit
that referenced
this pull request
May 27, 2026
Move delete_message in BatchProcessor::process_message from before the per-type processing branch to after it returns Ok. The previous ordering acked the SQS message before the modification was durably persisted, so a crash in that window silently dropped the message. After this change, Err from inner processing propagates without acking, so SQS visibility- timeout redelivery retries the message. DeleteMessage itself uses strict `?` propagation. If the ack fails after processing has succeeded, the Err tears down receive_batch_stream (batch.rs:116) — intentional, because the modifications table is not idempotent at the DB boundary (plain INSERT, no request_id key, see iris-mpc-store/src/lib.rs:491-519). The cascade acts as a circuit breaker against duplicate-row accumulation while SQS is unhealthy. Proper modifications-table idempotency is tracked in POP-3897. Two adjacent hardenings on the same persistence flow: - delete_message helper: receipt_handle.unwrap() -> ok_or_else with a new ReceiveRequestError::FailedToMarkRequestAsDeleted variant - modifications_sync.rs (peer-sync apply path): panic!/unwrap() on unknown modification type, missing s3_url, and JoinHandle Result converted to typed Err returns. No happy-path behavior change. Cherry-picked equivalents of changes in #2007 (continuous-rerand branch, @philsippl), translated to current main shape; the rest of #2007 remains parked on ps/cont-rerand. Linear: https://linear.app/worldcoin/issue/POP-3781 Follow-up: https://linear.app/worldcoin/issue/POP-3897
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Spec and implementation of continuous rerandomization of the iris share databases.
Spec is provided in docs/specs/rerandomization.md.
Goals:
High level design:
Rerand server operating in chunks over DB
The rerand servers can be at most 1 chunk (or 1 epoch at the boundary) apart from each other
Synchronization with matching server required:
Rerand servers first write to "staging" schema and then apply in chunks to reduce lock times
Relies on the current system property that modifications are guaranteed to eventually arrive on all parties
related: support additional route ampc-common#71