Agent Graduation Guide

Overview

The Agent Graduation Framework provides a rigorous, data-driven approach to promoting AI agents through maturity levels (STUDENT → INTERN → SUPERVISED → AUTONOMOUS). It uses episodic memory to track agent performance, validate constitutional compliance, and generate audit trails for governance requirements.

Why Graduation Matters
Two Types of Graduation ⚡ IMPORTANT
Graduation Criteria
How Graduation is Triggered
Readiness Score Calculation
Constitutional Compliance
Use Cases
Graduation Workflow
Edge Case Testing
Audit Trail Generation

Two Types of Graduation ⚡ IMPORTANT

Atom has two distinct graduation systems that work together:

1. Agent Graduation (This Guide)

What: Overall agent maturity level Progression: STUDENT → INTERN → SUPERVISED → AUTONOMOUS Based on: Episodes, intervention rates, constitutional compliance Scope: Agent-wide (affects all capabilities)

2. Capability Graduation ⚡

What: Individual skill/capability maturity Progression: 5 → 20 → 50 successful uses Based on: Usage count per capability Scope: Per-capability (skills graduate independently)

Example:

An agent can be INTERN overall (agent graduation)
But have AUTONOMOUS level for "data_query" capability (capability graduation)
While still being STUDENT for "shell_access" capability

See: Capability Graduation Logic for the 5/20/50 rule.

Why Graduation Matters

1. Safety

Agents must demonstrate competence before gaining autonomy. Graduation ensures:

Zero critical errors in AUTONOMOUS mode
Constitutional compliance (tax laws, HIPAA, etc.)
Proven track record of correct decisions

2. Governance Compliance

Regulated industries require audit trails proving:

Agent performance over time
Human intervention rates
Compliance with domain-specific rules

3. Trust

Users need confidence that agents:

Learn from past experiences
Improve over time
Don't repeat mistakes

4. Multi-Dimensional Learning

Graduation tracking includes:

Episodic Memory: Agent interactions and past experiences
Community Skills: Skill usage diversity and learning velocity
Canvas Presentations: Context-aware decision making
User Feedback: Performance ratings and corrections

Graduation Criteria

Maturity Levels

Level	Description	Permissions
STUDENT	Learning phase	Read-only, presentations
INTERN	Basic autonomy	Streaming, form presentations
SUPERVISED	Advanced autonomy	Form submissions, state changes (with supervision)
AUTONOMOUS	Full independence	All actions, no oversight

Promotion Requirements

Promotion	Min Episodes	Max Intervention Rate	Min Constitutional Score
STUDENT → INTERN	10	50%	0.70
INTERN → SUPERVISED	25	20%	0.85
SUPERVISED → AUTONOMOUS	50	0%	0.95

Key Metrics

Episode Count: Number of completed episodes at current maturity level

Intervention Rate: Percentage of episodes requiring human correction

intervention_rate = total_interventions / episode_count

Constitutional Score: Compliance with domain rules (0.0 to 1.0)

Validated against Knowledge Graph
Tracks violations of tax laws, HIPAA, etc.
Calculated per episode

How Graduation is Triggered

Graduation is an autonomous, event-driven process that monitors agent performance across two distinct paths:

1. Event-Driven Trigger (Post-Task)

Immediately following every execution of the GenericAgent, the system invokes the GraduationService.

Mechanism: The _record_execution hook captures the task outcome.
Context: Passing the agent_id and the skill_id (e.g., "tax_calculation") to the evaluation engine.
Immediate Action: If the local streak threshold is met, the skill is promoted to AUTONOMOUS instantly.

2. Autonomous Review Trigger (Background Audit)

The BackgroundAgentRunner performs system-wide audits independently of user interaction.

Mechanism: Periodic scheduled jobs (e.g., every 5 minutes).
Context: Scans the AgentRegistry for skills in the SUPERVISED or INTERN tiers.
Action: Aggregates episodic data across all sessions for a comprehensive readiness review.

Skill Promotion Logic

Note: This section describes skill promotion within the agent graduation framework. For per-capability graduation based on usage count (5/20/50 rule), see Capability Graduation Logic.

Promotion to AUTONOMOUS state is governed by the Dynamic Streak Rule:

Complexity	Required Consecutive Clean Runs
Simple	3
Moderate	5
Complex	8
Advanced	8

A "Clean Run" is defined as:

Success: True (Task completed objective).
Human Interventions: 0 (Zero manual corrections).
Constitutional Score: ≥ 0.95 (Full domain compliance).

Readiness Score Calculation

⚡ UPDATED FORMULA (Current Implementation)

The readiness score now uses a 6-component weighted formula that provides more comprehensive assessment:

Readiness Score =
    (Zero Intervention Ratio × 30%) +
    (Average Constitutional Score × 25%) +
    (Average Confidence Score × 15%) +
    (Success Rate × 10%) +
    (Supervision Success Rate × 10%) +
    (Skill Diversity Score × 10%)

Component Breakdown:

Component	Weight	Description	Calculation
Zero Intervention Ratio	30%	Episodes with zero human interventions	`zero_interventions / total_episodes`
Average Constitutional Score	25%	Compliance with domain rules	`avg(constitutional_scores)`
Average Confidence Score	15%	Agent's self-assessed confidence	`agent.confidence_score`
Success Rate	10%	Overall task success rate	`successful_tasks / total_tasks`
Supervision Success Rate	10%	Performance during supervision	`supervision_tasks_with_4_5_star / total_supervision`
Skill Diversity Score	10%	Variety of skills used	Encourages broader capability

Example Calculation:

Scenario: Agent seeking promotion to INTERN

Metrics:

Episodes: 12 (min required: 10)
Zero intervention episodes: 10/12 = 83%
Avg constitutional score: 0.78
Avg confidence score: 0.72
Success rate: 0.85
Supervision success rate: 0.80
Skill diversity score: 0.60 (used 6 different skills)

Score:

Zero Intervention: 0.83 × 30 = 24.9
Constitutional: 0.78 × 25 = 19.5
Confidence: 0.72 × 15 = 10.8
Success Rate: 0.85 × 10 = 8.5
Supervision: 0.80 × 10 = 8.0
Skill Diversity: 0.60 × 10 = 6.0
---
Total = 77.7/100 (Ready for promotion!)

Key Changes from Previous Formula

Old Formula	New Formula	Change
3 components	6 components	+3 new metrics
Episode/Intervention/Constitutional only	Adds confidence, success, supervision, skills	More comprehensive
Episode Score (40%)	Split into multiple metrics	Better granularity
No skill tracking	Skill diversity bonus	Encourages broader learning

Skill Diversity Bonus

Agents are rewarded for using a diverse set of skills:

skill_diversity_score = min(unique_skills_used / 20, 1.0)

0 skills: 0% score
10 skills: 50% score
20+ skills: 100% score (max bonus)

This encourages agents to develop broader capabilities rather than specializing in a narrow domain.

Constitutional Compliance

What is Constitutional Compliance?

Agents must adhere to domain-specific rules stored in the Knowledge Graph:

Tax Agents: HST rules, provincial tax rates, exemption criteria
Medical Agents: HIPAA regulations, clinical documentation standards
Legal Agents: Confidentiality rules, document retention policies

Validation

from core.agent_graduation_service import AgentGraduationService

service = AgentGraduationService(db)

# Validate specific episode
result = await service.validate_constitutional_compliance(
    episode_id="episode_123"
)

# Returns:
# {
#   "compliant": True,
#   "score": 0.95,
#   "violations": [],
#   "episode_id": "episode_123"
# }

Score Tracking

Each episode tracks:

class Episode:
    constitutional_score: float  # 0.0 to 1.0
    human_intervention_count: int  # Number of corrections
    human_edits: JSON  # List of specific corrections

Use Cases

Use Case 1: MedScribe (Clinical Documentation)

Scenario: Hospital board requires proof that MedScribe agent can document clinical encounters with zero errors before autonomous operation.

Requirements:

100 episodes of clinical documentation
0 human interventions
1.0 constitutional score (HIPAA compliance)
Full audit trail for board review

Implementation:

# Create clinical documentation episodes
for encounter in patient_encounters:
    episode = await service.create_episode_from_session(
        session_id=encounter.session_id,
        agent_id="medscribe_agent",
        title=f"Clinical Documentation: {encounter.patient_id}"
    )
    # Episodes track:
    # - human_intervention_count (must be 0)
    # - constitutional_score (validated against HIPAA rules)
    # - clinical_accuracy_score (validated against medical records)

# Generate audit report for hospital board
audit = await service.get_graduation_audit_trail(agent_id="medscribe_agent")

board_report = f"""
MedScribe Graduation Report for Hospital Board Review
======================================================

Agent: {audit['agent_name']}
Current Maturity: {audit['current_maturity']}

Performance Metrics:
- Total Clinical Episodes: {audit['total_episodes']}
- Total Interventions: {audit['total_interventions']}
- Avg Constitutional Score (HIPAA): {audit['avg_constitutional_score']:.2f}

Graduation Status: {'✓ READY FOR AUTONOMOUS OPERATION' if audit['total_interventions'] == 0 else '✗ NOT READY'}

Episode Breakdown by Maturity:
"""
for maturity, count in audit['episodes_by_maturity'].items():
    board_report += f"- {maturity}: {count} episodes\n"

print(board_report)

Sample Output:

MedScribe Graduation Report for Hospital Board Review
======================================================

Agent: MedScribe Clinical Agent
Current Maturity: AUTONOMOUS

Performance Metrics:
- Total Clinical Episodes: 100
- Total Interventions: 0
- Avg Constitutional Score (HIPAA): 1.00

Graduation Status: ✓ READY FOR AUTONOMOUS OPERATION

Episode Breakdown by Maturity:
- STUDENT: 20 episodes
- INTERN: 30 episodes
- SUPERVISED: 50 episodes
- AUTONOMOUS: 0 episodes (ready to begin)

Recent Autonomous-Ready Episodes:
1. "Clinical Documentation: Patient #12345" - 0 interventions, 1.0 score
2. "Clinical Documentation: Patient #12346" - 0 interventions, 1.0 score
3. "Clinical Documentation: Patient #12347" - 0 interventions, 1.0 score

Use Case 2: Brennan.ca (Sales Tax Compliance)

Scenario: Sales agent must understand Woodstock, Ontario pricing nuances (including HST exemptions) before sending autonomous emails to clients.

Requirements:

50 Woodstock-specific sales episodes
0 interventions on Woodstock pricing
0.95 constitutional score (Canada Tax Knowledge Graph)
Validation of HST calculations for machinery sales

Implementation:

# Create Woodstock-specific training episodes
woodstock_episodes = []
for sale in woodstock_sales:
    episode = await service.create_episode_from_session(
        session_id=sale.session_id,
        agent_id="sales_bot",
        title=f"Woodstock Sale: {sale.machinery_type} - HST Calculation"
    )
    woodstock_episodes.append(episode.id)

# Calculate readiness for Woodstock-specific operations
result = await service.calculate_readiness_score(
    agent_id="sales_bot",
    target_maturity="AUTONOMOUS"
)

# Filter for Woodstock episodes specifically
woodstock_count = len([ep for ep in episodes if "Woodstock" in ep.title])
woodstock_interventions = sum([
    ep.human_intervention_count for ep in episodes
    if "Woodstock" in ep.title
])

print(f"Woodstock-Specific Readiness:")
print(f"  Episodes: {woodstock_count}/50")
print(f"  Interventions: {woodstock_interventions} (must be 0)")

if woodstock_count >= 50 and woodstock_interventions == 0:
    print("✓ Ready for autonomous Woodstock sales emails")
else:
    print("✗ Not ready - more training required")

Use Case 3: Tax Bot (Multi-Jurisdictional Compliance)

Scenario: Tax calculation agent must validate HST calculations across Canadian provinces before autonomous operation.

Requirements:

100 episodes per province (ON, BC, QC, AB)
0 interventions on tax rate calculations
Validation against Canada Tax Knowledge Graph
Edge case testing for exemption scenarios

Implementation:

# Group episodes by province
province_stats = {}
for episode in all_episodes:
    province = extract_province(episode.title)  # e.g., "ON", "BC"
    if province not in province_stats:
        province_stats[province] = {"count": 0, "interventions": 0}
    province_stats[province]["count"] += 1
    province_stats[province]["interventions"] += episode.human_intervention_count

# Validate each province meets criteria
for province, stats in province_stats.items():
    print(f"{province}: {stats['count']} episodes, {stats['interventions']} interventions")
    if stats['count'] >= 100 and stats['interventions'] == 0:
        print(f"  ✓ {province} ready for autonomous operation")
    else:
        print(f"  ✗ {province} needs more training")

Graduation Workflow

Step 1: Check Readiness

from core.agent_graduation_service import AgentGraduationService

service = AgentGraduationService(db)

result = await service.calculate_readiness_score(
    agent_id="student_agent",
    target_maturity="INTERN"
)

print(f"Ready: {result['ready']}")
print(f"Score: {result['score']}/100")
print(f"Gaps: {result['gaps']}")
print(f"Recommendation: {result['recommendation']}")

Step 2: Run Edge Case Tests

# Test agent on historical failures from other agents
edge_cases = [
    "edge_case_tax_exemption_1",
    "edge_case_hipaa_violation_1",
    "edge_case_pricing_error_1"
]

exam_result = await service.run_graduation_exam(
    agent_id="student_agent",
    edge_case_episodes=edge_cases
)

print(f"Exam Passed: {exam_result['passed']}")
print(f"Score: {exam_result['score']}/100")

Step 3: Promote Agent

if result['ready'] and exam_result['passed']:
    await service.promote_agent(
        agent_id="student_agent",
        new_maturity="INTERN",
        validated_by="admin_user"
    )
    print("Agent promoted successfully!")

Step 4: Generate Audit Trail

audit = await service.get_graduation_audit_trail(agent_id="student_agent")

# Save for compliance records
with open(f"graduation_audit_{agent_id}.json", "w") as f:
    json.dump(audit, f, indent=2)

Edge Case Testing

What are Edge Cases?

Edge cases are historical failure scenarios from other agents. Testing current agents against these edge cases ensures they don't repeat past mistakes.

Creating Edge Case Episodes

# Create edge case episode from historical failure
edge_case = Episode(
    title="Edge Case: HST Exemption for Agricultural Machinery",
    description="Historical failure where agent incorrectly applied HST to exempt equipment",
    agent_id="archive_failed_agent",
    topics=["hst", "exemptions", "agriculture"],
    constitutional_score=0.0,  # Failed
    human_intervention_count=1,  # Required correction
    human_edits=[
        {
            "field": "tax_rate",
            "original": "0.13",
            "correction": "0.0",
            "reason": "Agricultural machinery exempt from HST"
        }
    ]
)

Running Edge Case Tests

exam_result = await service.run_graduation_exam(
    agent_id="current_agent",
    edge_case_episodes=[edge_case.id]
)

# Check if agent handles edge case correctly
if exam_result['passed']:
    print("✓ Agent correctly handled edge case")
else:
    print("✗ Agent failed edge case - more training needed")

Audit Trail Generation

What's in the Audit Trail?

audit = await service.get_graduation_audit_trail(agent_id="agent_123")

# Returns:
{
    "agent_id": "agent_123",
    "agent_name": "Tax Calculation Agent",
    "current_maturity": "INTERN",
    "total_episodes": 45,
    "total_interventions": 8,
    "avg_constitutional_score": 0.87,
    "episodes_by_maturity": {
        "STUDENT": 15,
        "INTERN": 30
    },
    "recent_episodes": [
        {
            "id": "ep_45",
            "title": "HST Calculation for Invoice #123",
            "started_at": "2026-02-03T10:30:00",
            "human_intervention_count": 0,
            "constitutional_score": 1.0
        },
        ...
    ]
}

Exporting for Compliance

import json
from datetime import datetime

# Generate compliance report
audit = await service.get_graduation_audit_trail(agent_id="agent_123")

report = {
    "generated_at": datetime.now().isoformat(),
    "agent_info": {
        "id": audit["agent_id"],
        "name": audit["agent_name"],
        "current_maturity": audit["current_maturity"]
    },
    "performance_metrics": {
        "total_episodes": audit["total_episodes"],
        "total_interventions": audit["total_interventions"],
        "avg_constitutional_score": audit["avg_constitutional_score"]
    },
    "episode_breakdown": audit["episodes_by_maturity"],
    "recent_episodes": audit["recent_episodes"][:10]
}

# Save to file
with open(f"graduation_audit_{agent_id}_{datetime.now().date()}.json", "w") as f:
    json.dump(report, f, indent=2)

print(f"Audit trail saved: {f.name}")

Best Practices

1. Track Interventions Granularly

# Instead of just counting interventions
episode.human_intervention_count = 1

# Track what was corrected
episode.human_edits = [
    {
        "timestamp": "2026-02-03T10:30:00",
        "field": "tax_rate",
        "original": "0.13",
        "correction": "0.0",
        "reason": "Agricultural machinery exempt",
        "corrected_by": "tax_expert_1"
    }
]

2. Validate Constitutional Compliance

# After each episode, validate against domain rules
compliance_result = await service.validate_constitutional_compliance(
    episode_id=episode.id
)

episode.constitutional_score = compliance_result["score"]

if not compliance_result["compliant"]:
    logger.warning(f"Constitutional violations: {compliance_result['violations']}")

3. Use Domain-Specific Episodes

# Create domain-specific episodes for better tracking
episode = Episode(
    title=f"{domain} Task: {specific_task}",
    topics=[domain, ...],  # e.g., ["tax", "hst", "ontario"]
    metadata={
        "domain": domain,
        "jurisdiction": jurisdiction,
        "task_type": task_type
    }
)

4. Regular Readiness Checks

# Check readiness weekly
from celery import Celery

@celery.task
def weekly_readiness_check():
    agents = db.query(AgentRegistry).filter(
        AgentRegistry.status != AgentStatus.AUTONOMOUS
    ).all()

    for agent in agents:
        result = await service.calculate_readiness_score(
            agent_id=agent.id,
            target_maturity=get_next_maturity(agent.status)
        )

        if result["ready"]:
            notify_admins(
                subject=f"Agent {agent.name} ready for promotion",
                body=f"Score: {result['score']}/100\n{result['recommendation']}"
            )

Troubleshooting

Problem: Agent Not Ready Despite Good Performance

Symptoms: High episode count, low interventions, but score < 70

Possible Causes:

Low constitutional score dragging down average
Low confidence score from recent self-assessments
Poor success rate despite low interventions
Low supervision success rate
Limited skill diversity (not using varied capabilities)

Solution:

# Check each component separately
zero_intervention = zero_interventions / total_episodes
constitutional = avg_constitutional_score
confidence = agent.confidence_score
success = successful_tasks / total_tasks
supervision = supervision_4_5_star / total_supervision
skill_diversity = min(unique_skills / 20, 1.0)

print(f"Zero Intervention (30%): {zero_intervention:.2f} → {zero_intervention * 30:.1f}/30")
print(f"Constitutional (25%): {constitutional:.2f} → {constitutional * 25:.1f}/25")
print(f"Confidence (15%): {confidence:.2f} → {confidence * 15:.1f}/15")
print(f"Success Rate (10%): {success:.2f} → {success * 10:.1f}/10")
print(f"Supervision (10%): {supervision:.2f} → {supervision * 10:.1f}/10")
print(f"Skill Diversity (10%): {skill_diversity:.2f} → {skill_diversity * 10:.1f}/10")

print(f"\nTotal: {(zero_intervention * 30 + constitutional * 25 + confidence * 15 + success * 10 + supervision * 10 + skill_diversity * 10):.1f}/100")

# Address the weakest component

Problem: Interventions Not Tracking

Symptoms: human_intervention_count always 0 despite corrections

Solution:

# Explicitly track interventions in agent code
if human_correction_made:
    episode.human_intervention_count += 1
    episode.human_edits.append({
        "timestamp": datetime.now().isoformat(),
        "field": corrected_field,
        "original": original_value,
        "correction": corrected_value,
        "reason": correction_reason
    })

Next Steps

Set up tracking: Ensure all agent executions track interventions
Create domain-specific episodes: Organize episodes by domain/jurisdiction
Validate constitutional compliance: Run compliance checks after each episode
Schedule readiness checks: Automate weekly readiness assessments
Generate audit trails: Export reports for governance compliance

For more information:

Uh oh!

FilesExpand file tree

graduation.md

Latest commit

History

graduation.md

File metadata and controls

Agent Graduation Guide

Overview

Table of Contents

Two Types of Graduation ⚡ IMPORTANT

1. Agent Graduation (This Guide)

2. Capability Graduation ⚡

Why Graduation Matters

1. Safety

2. Governance Compliance

3. Trust

4. Multi-Dimensional Learning

Graduation Criteria

Maturity Levels

Promotion Requirements

Key Metrics

How Graduation is Triggered

1. Event-Driven Trigger (Post-Task)

2. Autonomous Review Trigger (Background Audit)

Skill Promotion Logic

Readiness Score Calculation

⚡ UPDATED FORMULA (Current Implementation)

Key Changes from Previous Formula

Skill Diversity Bonus

Constitutional Compliance

What is Constitutional Compliance?

Validation

Score Tracking

Use Cases

Use Case 1: MedScribe (Clinical Documentation)

Use Case 2: Brennan.ca (Sales Tax Compliance)

Use Case 3: Tax Bot (Multi-Jurisdictional Compliance)

Graduation Workflow

Step 1: Check Readiness

Step 2: Run Edge Case Tests

Step 3: Promote Agent

Step 4: Generate Audit Trail

Edge Case Testing

What are Edge Cases?

Creating Edge Case Episodes

Running Edge Case Tests

Audit Trail Generation

What's in the Audit Trail?

Exporting for Compliance

Best Practices

1. Track Interventions Granularly

2. Validate Constitutional Compliance

3. Use Domain-Specific Episodes

4. Regular Readiness Checks

Troubleshooting

Problem: Agent Not Ready Despite Good Performance

Problem: Interventions Not Tracking

Next Steps

Related Documentation

Agent Learning & Intelligence

Agent Systems

Integration Guides