Skip to content

[FEAT] Add checkpoint for long-running GraphWorkflows#1498

Open
adichaudhary wants to merge 2 commits intokyegomez:masterfrom
adichaudhary:feat/graph-checkpoint-dir
Open

[FEAT] Add checkpoint for long-running GraphWorkflows#1498
adichaudhary wants to merge 2 commits intokyegomez:masterfrom
adichaudhary:feat/graph-checkpoint-dir

Conversation

@adichaudhary
Copy link
Copy Markdown
Contributor

@adichaudhary adichaudhary commented Mar 24, 2026

Description

This PR adds checkpoint-based fault tolerance to GraphWorkflow. Long-running pipelines that crash or time out mid-execution previously required a full restart from scratch, re-running every agent and paying for every LLM call again. Now, a checkpoint_dir parameter can be passed at construction time — after each layer completes its outputs are persisted to disk, and on the next run() call with the same task string any already-completed layers are loaded from the checkpoint files and skipped entirely. A clear_checkpoints(task) method is also provided to clean up after a confirmed successful run.

Files Changed

  • swarms/structs/graph_workflow.py — added checkpoint_dir parameter to __init__, checkpoint save/resume logic in the layer execution loop of run(), and the clear_checkpoints() method
  • examples/multi_agent/graphworkflow_examples/graph_workflow_checkpointing.py

Issue

#1484

Dependencies

No extra dependencies required.

Maintainer

@kyegomez

Twitter

@akc__2025

📚 Documentation preview 📚: https://swarms--1451.org.readthedocs.build/en/1451/

Copilot AI review requested due to automatic review settings March 24, 2026 00:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds checkpoint-based fault tolerance to GraphWorkflow so long-running, multi-layer executions can resume without re-running completed layers.

Changes:

  • Add checkpoint_dir parameter to GraphWorkflow.__init__ and implement per-layer checkpoint save/resume in run().
  • Add clear_checkpoints(task) helper to remove checkpoint files for a given task.
  • Add an example script demonstrating checkpointing and a simulated crash/re-run flow.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
swarms/structs/graph_workflow.py Introduces checkpoint_dir, checkpoint save/resume logic in run(), and clear_checkpoints() cleanup method.
examples/multi_agent/graphworkflow_examples/graph_workflow_checkpointing.py New example showcasing checkpoint usage and a simulated crash scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added the tests label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants