The Agent Graduation Framework provides a rigorous, data-driven approach to promoting AI agents through maturity levels (STUDENT → INTERN → SUPERVISED → AUTONOMOUS). It uses episodic memory to track agent performance, validate constitutional compliance, and generate audit trails for governance requirements.
- Why Graduation Matters
- Two Types of Graduation ⚡ IMPORTANT
- Graduation Criteria
- How Graduation is Triggered
- Readiness Score Calculation
- Constitutional Compliance
- Use Cases
- Graduation Workflow
- Edge Case Testing
- Audit Trail Generation
Atom has two distinct graduation systems that work together:
What: Overall agent maturity level Progression: STUDENT → INTERN → SUPERVISED → AUTONOMOUS Based on: Episodes, intervention rates, constitutional compliance Scope: Agent-wide (affects all capabilities)
What: Individual skill/capability maturity Progression: 5 → 20 → 50 successful uses Based on: Usage count per capability Scope: Per-capability (skills graduate independently)
Example:
- An agent can be INTERN overall (agent graduation)
- But have AUTONOMOUS level for "data_query" capability (capability graduation)
- While still being STUDENT for "shell_access" capability
See: Capability Graduation Logic for the 5/20/50 rule.
Agents must demonstrate competence before gaining autonomy. Graduation ensures:
- Zero critical errors in AUTONOMOUS mode
- Constitutional compliance (tax laws, HIPAA, etc.)
- Proven track record of correct decisions
Regulated industries require audit trails proving:
- Agent performance over time
- Human intervention rates
- Compliance with domain-specific rules
Users need confidence that agents:
- Learn from past experiences
- Improve over time
- Don't repeat mistakes
Graduation tracking includes:
- Episodic Memory: Agent interactions and past experiences
- Community Skills: Skill usage diversity and learning velocity
- Canvas Presentations: Context-aware decision making
- User Feedback: Performance ratings and corrections
| Level | Description | Permissions |
|---|---|---|
| STUDENT | Learning phase | Read-only, presentations |
| INTERN | Basic autonomy | Streaming, form presentations |
| SUPERVISED | Advanced autonomy | Form submissions, state changes (with supervision) |
| AUTONOMOUS | Full independence | All actions, no oversight |
| Promotion | Min Episodes | Max Intervention Rate | Min Constitutional Score |
|---|---|---|---|
| STUDENT → INTERN | 10 | 50% | 0.70 |
| INTERN → SUPERVISED | 25 | 20% | 0.85 |
| SUPERVISED → AUTONOMOUS | 50 | 0% | 0.95 |
Episode Count: Number of completed episodes at current maturity level
Intervention Rate: Percentage of episodes requiring human correction
intervention_rate = total_interventions / episode_count
Constitutional Score: Compliance with domain rules (0.0 to 1.0)
- Validated against Knowledge Graph
- Tracks violations of tax laws, HIPAA, etc.
- Calculated per episode
Graduation is an autonomous, event-driven process that monitors agent performance across two distinct paths:
Immediately following every execution of the GenericAgent, the system invokes the GraduationService.
- Mechanism: The
_record_executionhook captures the task outcome. - Context: Passing the
agent_idand theskill_id(e.g., "tax_calculation") to the evaluation engine. - Immediate Action: If the local streak threshold is met, the skill is promoted to
AUTONOMOUSinstantly.
The BackgroundAgentRunner performs system-wide audits independently of user interaction.
- Mechanism: Periodic scheduled jobs (e.g., every 5 minutes).
- Context: Scans the
AgentRegistryfor skills in theSUPERVISEDorINTERNtiers. - Action: Aggregates episodic data across all sessions for a comprehensive readiness review.
Note: This section describes skill promotion within the agent graduation framework. For per-capability graduation based on usage count (5/20/50 rule), see Capability Graduation Logic.
Promotion to AUTONOMOUS state is governed by the Dynamic Streak Rule:
| Complexity | Required Consecutive Clean Runs |
|---|---|
| Simple | 3 |
| Moderate | 5 |
| Complex | 8 |
| Advanced | 8 |
A "Clean Run" is defined as:
- Success:
True(Task completed objective). - Human Interventions:
0(Zero manual corrections). - Constitutional Score:
≥ 0.95(Full domain compliance).
The readiness score now uses a 6-component weighted formula that provides more comprehensive assessment:
Readiness Score =
(Zero Intervention Ratio × 30%) +
(Average Constitutional Score × 25%) +
(Average Confidence Score × 15%) +
(Success Rate × 10%) +
(Supervision Success Rate × 10%) +
(Skill Diversity Score × 10%)
Component Breakdown:
| Component | Weight | Description | Calculation |
|---|---|---|---|
| Zero Intervention Ratio | 30% | Episodes with zero human interventions | zero_interventions / total_episodes |
| Average Constitutional Score | 25% | Compliance with domain rules | avg(constitutional_scores) |
| Average Confidence Score | 15% | Agent's self-assessed confidence | agent.confidence_score |
| Success Rate | 10% | Overall task success rate | successful_tasks / total_tasks |
| Supervision Success Rate | 10% | Performance during supervision | supervision_tasks_with_4_5_star / total_supervision |
| Skill Diversity Score | 10% | Variety of skills used | Encourages broader capability |
Example Calculation:
Scenario: Agent seeking promotion to INTERN
Metrics:
- Episodes: 12 (min required: 10)
- Zero intervention episodes: 10/12 = 83%
- Avg constitutional score: 0.78
- Avg confidence score: 0.72
- Success rate: 0.85
- Supervision success rate: 0.80
- Skill diversity score: 0.60 (used 6 different skills)
Score:
Zero Intervention: 0.83 × 30 = 24.9
Constitutional: 0.78 × 25 = 19.5
Confidence: 0.72 × 15 = 10.8
Success Rate: 0.85 × 10 = 8.5
Supervision: 0.80 × 10 = 8.0
Skill Diversity: 0.60 × 10 = 6.0
---
Total = 77.7/100 (Ready for promotion!)
| Old Formula | New Formula | Change |
|---|---|---|
| 3 components | 6 components | +3 new metrics |
| Episode/Intervention/Constitutional only | Adds confidence, success, supervision, skills | More comprehensive |
| Episode Score (40%) | Split into multiple metrics | Better granularity |
| No skill tracking | Skill diversity bonus | Encourages broader learning |
Agents are rewarded for using a diverse set of skills:
skill_diversity_score = min(unique_skills_used / 20, 1.0)- 0 skills: 0% score
- 10 skills: 50% score
- 20+ skills: 100% score (max bonus)
This encourages agents to develop broader capabilities rather than specializing in a narrow domain.
Agents must adhere to domain-specific rules stored in the Knowledge Graph:
- Tax Agents: HST rules, provincial tax rates, exemption criteria
- Medical Agents: HIPAA regulations, clinical documentation standards
- Legal Agents: Confidentiality rules, document retention policies
from core.agent_graduation_service import AgentGraduationService
service = AgentGraduationService(db)
# Validate specific episode
result = await service.validate_constitutional_compliance(
episode_id="episode_123"
)
# Returns:
# {
# "compliant": True,
# "score": 0.95,
# "violations": [],
# "episode_id": "episode_123"
# }Each episode tracks:
class Episode:
constitutional_score: float # 0.0 to 1.0
human_intervention_count: int # Number of corrections
human_edits: JSON # List of specific correctionsScenario: Hospital board requires proof that MedScribe agent can document clinical encounters with zero errors before autonomous operation.
Requirements:
- 100 episodes of clinical documentation
- 0 human interventions
- 1.0 constitutional score (HIPAA compliance)
- Full audit trail for board review
Implementation:
# Create clinical documentation episodes
for encounter in patient_encounters:
episode = await service.create_episode_from_session(
session_id=encounter.session_id,
agent_id="medscribe_agent",
title=f"Clinical Documentation: {encounter.patient_id}"
)
# Episodes track:
# - human_intervention_count (must be 0)
# - constitutional_score (validated against HIPAA rules)
# - clinical_accuracy_score (validated against medical records)
# Generate audit report for hospital board
audit = await service.get_graduation_audit_trail(agent_id="medscribe_agent")
board_report = f"""
MedScribe Graduation Report for Hospital Board Review
======================================================
Agent: {audit['agent_name']}
Current Maturity: {audit['current_maturity']}
Performance Metrics:
- Total Clinical Episodes: {audit['total_episodes']}
- Total Interventions: {audit['total_interventions']}
- Avg Constitutional Score (HIPAA): {audit['avg_constitutional_score']:.2f}
Graduation Status: {'✓ READY FOR AUTONOMOUS OPERATION' if audit['total_interventions'] == 0 else '✗ NOT READY'}
Episode Breakdown by Maturity:
"""
for maturity, count in audit['episodes_by_maturity'].items():
board_report += f"- {maturity}: {count} episodes\n"
print(board_report)Sample Output:
MedScribe Graduation Report for Hospital Board Review
======================================================
Agent: MedScribe Clinical Agent
Current Maturity: AUTONOMOUS
Performance Metrics:
- Total Clinical Episodes: 100
- Total Interventions: 0
- Avg Constitutional Score (HIPAA): 1.00
Graduation Status: ✓ READY FOR AUTONOMOUS OPERATION
Episode Breakdown by Maturity:
- STUDENT: 20 episodes
- INTERN: 30 episodes
- SUPERVISED: 50 episodes
- AUTONOMOUS: 0 episodes (ready to begin)
Recent Autonomous-Ready Episodes:
1. "Clinical Documentation: Patient #12345" - 0 interventions, 1.0 score
2. "Clinical Documentation: Patient #12346" - 0 interventions, 1.0 score
3. "Clinical Documentation: Patient #12347" - 0 interventions, 1.0 score
Scenario: Sales agent must understand Woodstock, Ontario pricing nuances (including HST exemptions) before sending autonomous emails to clients.
Requirements:
- 50 Woodstock-specific sales episodes
- 0 interventions on Woodstock pricing
- 0.95 constitutional score (Canada Tax Knowledge Graph)
- Validation of HST calculations for machinery sales
Implementation:
# Create Woodstock-specific training episodes
woodstock_episodes = []
for sale in woodstock_sales:
episode = await service.create_episode_from_session(
session_id=sale.session_id,
agent_id="sales_bot",
title=f"Woodstock Sale: {sale.machinery_type} - HST Calculation"
)
woodstock_episodes.append(episode.id)
# Calculate readiness for Woodstock-specific operations
result = await service.calculate_readiness_score(
agent_id="sales_bot",
target_maturity="AUTONOMOUS"
)
# Filter for Woodstock episodes specifically
woodstock_count = len([ep for ep in episodes if "Woodstock" in ep.title])
woodstock_interventions = sum([
ep.human_intervention_count for ep in episodes
if "Woodstock" in ep.title
])
print(f"Woodstock-Specific Readiness:")
print(f" Episodes: {woodstock_count}/50")
print(f" Interventions: {woodstock_interventions} (must be 0)")
if woodstock_count >= 50 and woodstock_interventions == 0:
print("✓ Ready for autonomous Woodstock sales emails")
else:
print("✗ Not ready - more training required")Scenario: Tax calculation agent must validate HST calculations across Canadian provinces before autonomous operation.
Requirements:
- 100 episodes per province (ON, BC, QC, AB)
- 0 interventions on tax rate calculations
- Validation against Canada Tax Knowledge Graph
- Edge case testing for exemption scenarios
Implementation:
# Group episodes by province
province_stats = {}
for episode in all_episodes:
province = extract_province(episode.title) # e.g., "ON", "BC"
if province not in province_stats:
province_stats[province] = {"count": 0, "interventions": 0}
province_stats[province]["count"] += 1
province_stats[province]["interventions"] += episode.human_intervention_count
# Validate each province meets criteria
for province, stats in province_stats.items():
print(f"{province}: {stats['count']} episodes, {stats['interventions']} interventions")
if stats['count'] >= 100 and stats['interventions'] == 0:
print(f" ✓ {province} ready for autonomous operation")
else:
print(f" ✗ {province} needs more training")from core.agent_graduation_service import AgentGraduationService
service = AgentGraduationService(db)
result = await service.calculate_readiness_score(
agent_id="student_agent",
target_maturity="INTERN"
)
print(f"Ready: {result['ready']}")
print(f"Score: {result['score']}/100")
print(f"Gaps: {result['gaps']}")
print(f"Recommendation: {result['recommendation']}")# Test agent on historical failures from other agents
edge_cases = [
"edge_case_tax_exemption_1",
"edge_case_hipaa_violation_1",
"edge_case_pricing_error_1"
]
exam_result = await service.run_graduation_exam(
agent_id="student_agent",
edge_case_episodes=edge_cases
)
print(f"Exam Passed: {exam_result['passed']}")
print(f"Score: {exam_result['score']}/100")if result['ready'] and exam_result['passed']:
await service.promote_agent(
agent_id="student_agent",
new_maturity="INTERN",
validated_by="admin_user"
)
print("Agent promoted successfully!")audit = await service.get_graduation_audit_trail(agent_id="student_agent")
# Save for compliance records
with open(f"graduation_audit_{agent_id}.json", "w") as f:
json.dump(audit, f, indent=2)Edge cases are historical failure scenarios from other agents. Testing current agents against these edge cases ensures they don't repeat past mistakes.
# Create edge case episode from historical failure
edge_case = Episode(
title="Edge Case: HST Exemption for Agricultural Machinery",
description="Historical failure where agent incorrectly applied HST to exempt equipment",
agent_id="archive_failed_agent",
topics=["hst", "exemptions", "agriculture"],
constitutional_score=0.0, # Failed
human_intervention_count=1, # Required correction
human_edits=[
{
"field": "tax_rate",
"original": "0.13",
"correction": "0.0",
"reason": "Agricultural machinery exempt from HST"
}
]
)exam_result = await service.run_graduation_exam(
agent_id="current_agent",
edge_case_episodes=[edge_case.id]
)
# Check if agent handles edge case correctly
if exam_result['passed']:
print("✓ Agent correctly handled edge case")
else:
print("✗ Agent failed edge case - more training needed")audit = await service.get_graduation_audit_trail(agent_id="agent_123")
# Returns:
{
"agent_id": "agent_123",
"agent_name": "Tax Calculation Agent",
"current_maturity": "INTERN",
"total_episodes": 45,
"total_interventions": 8,
"avg_constitutional_score": 0.87,
"episodes_by_maturity": {
"STUDENT": 15,
"INTERN": 30
},
"recent_episodes": [
{
"id": "ep_45",
"title": "HST Calculation for Invoice #123",
"started_at": "2026-02-03T10:30:00",
"human_intervention_count": 0,
"constitutional_score": 1.0
},
...
]
}import json
from datetime import datetime
# Generate compliance report
audit = await service.get_graduation_audit_trail(agent_id="agent_123")
report = {
"generated_at": datetime.now().isoformat(),
"agent_info": {
"id": audit["agent_id"],
"name": audit["agent_name"],
"current_maturity": audit["current_maturity"]
},
"performance_metrics": {
"total_episodes": audit["total_episodes"],
"total_interventions": audit["total_interventions"],
"avg_constitutional_score": audit["avg_constitutional_score"]
},
"episode_breakdown": audit["episodes_by_maturity"],
"recent_episodes": audit["recent_episodes"][:10]
}
# Save to file
with open(f"graduation_audit_{agent_id}_{datetime.now().date()}.json", "w") as f:
json.dump(report, f, indent=2)
print(f"Audit trail saved: {f.name}")# Instead of just counting interventions
episode.human_intervention_count = 1
# Track what was corrected
episode.human_edits = [
{
"timestamp": "2026-02-03T10:30:00",
"field": "tax_rate",
"original": "0.13",
"correction": "0.0",
"reason": "Agricultural machinery exempt",
"corrected_by": "tax_expert_1"
}
]# After each episode, validate against domain rules
compliance_result = await service.validate_constitutional_compliance(
episode_id=episode.id
)
episode.constitutional_score = compliance_result["score"]
if not compliance_result["compliant"]:
logger.warning(f"Constitutional violations: {compliance_result['violations']}")# Create domain-specific episodes for better tracking
episode = Episode(
title=f"{domain} Task: {specific_task}",
topics=[domain, ...], # e.g., ["tax", "hst", "ontario"]
metadata={
"domain": domain,
"jurisdiction": jurisdiction,
"task_type": task_type
}
)# Check readiness weekly
from celery import Celery
@celery.task
def weekly_readiness_check():
agents = db.query(AgentRegistry).filter(
AgentRegistry.status != AgentStatus.AUTONOMOUS
).all()
for agent in agents:
result = await service.calculate_readiness_score(
agent_id=agent.id,
target_maturity=get_next_maturity(agent.status)
)
if result["ready"]:
notify_admins(
subject=f"Agent {agent.name} ready for promotion",
body=f"Score: {result['score']}/100\n{result['recommendation']}"
)Symptoms: High episode count, low interventions, but score < 70
Possible Causes:
- Low constitutional score dragging down average
- Low confidence score from recent self-assessments
- Poor success rate despite low interventions
- Low supervision success rate
- Limited skill diversity (not using varied capabilities)
Solution:
# Check each component separately
zero_intervention = zero_interventions / total_episodes
constitutional = avg_constitutional_score
confidence = agent.confidence_score
success = successful_tasks / total_tasks
supervision = supervision_4_5_star / total_supervision
skill_diversity = min(unique_skills / 20, 1.0)
print(f"Zero Intervention (30%): {zero_intervention:.2f} → {zero_intervention * 30:.1f}/30")
print(f"Constitutional (25%): {constitutional:.2f} → {constitutional * 25:.1f}/25")
print(f"Confidence (15%): {confidence:.2f} → {confidence * 15:.1f}/15")
print(f"Success Rate (10%): {success:.2f} → {success * 10:.1f}/10")
print(f"Supervision (10%): {supervision:.2f} → {supervision * 10:.1f}/10")
print(f"Skill Diversity (10%): {skill_diversity:.2f} → {skill_diversity * 10:.1f}/10")
print(f"\nTotal: {(zero_intervention * 30 + constitutional * 25 + confidence * 15 + success * 10 + supervision * 10 + skill_diversity * 10):.1f}/100")
# Address the weakest componentSymptoms: human_intervention_count always 0 despite corrections
Solution:
# Explicitly track interventions in agent code
if human_correction_made:
episode.human_intervention_count += 1
episode.human_edits.append({
"timestamp": datetime.now().isoformat(),
"field": corrected_field,
"original": original_value,
"correction": corrected_value,
"reason": correction_reason
})- Set up tracking: Ensure all agent executions track interventions
- Create domain-specific episodes: Organize episodes by domain/jurisdiction
- Validate constitutional compliance: Run compliance checks after each episode
- Schedule readiness checks: Automate weekly readiness assessments
- Generate audit trails: Export reports for governance compliance
For more information:
- Self-Evolution & Reflection Pool - Critique-based learning, Memento-Skills, and AlphaEvolver for autonomous agent improvement
- Episodic Memory - Episode tracking and retrieval system for agent learning
- GraphRAG & Entity Types - Knowledge graph and entity extraction
- World Model & JIT Facts - Knowledge management and real-time fact verification
- Agent Governance - Maturity levels and permissions system
- Student Training - Maturity routing and training proposals
- Queen Agent - Structured workflow automation
- Fleet Admiral - Dynamic agent recruitment for unstructured tasks
- Auto-Dev User Guide - Self-evolving agent capabilities (Memento-Skills, AlphaEvolver)
- Memory Integration Guide - Complete memory system integration