A real-world OpenEnv environment for AI agents that manage a startup founder's operations inbox.
Most AI agent benchmarks are unrealistic.
InboxOps simulates real operational chaos:
- Investor pressure
- Customer outages
- Legal deadlines
- Inbox overload
This is not a toy problem — it’s decision-making under pressure.
What makes it strong:
- Deterministic + heuristic grading
- Partial credit scoring
- SLA-driven urgency modeling
- Multi-step agent reasoning
- Deployable via Docker + HuggingFace Spaces
| From | Stakes | |
|---|---|---|
| email_001 | Tier-1 VC | IC meeting Friday |
| email_002 | Enterprise customer | 🚨 Production outage |
| email_003 | Newsletter | Noise |
| email_004 | BigTech BD | Distribution deal |
| email_005 | Mom | Personal |
| email_007 | Paying customer | Compliance issue |
| email_010 | Enterprise client | Contract renewal |
- Bugs
- Refunds
- Press deadlines
- Internal conflicts
| Task | Difficulty | Goal |
|---|---|---|
| Email Classification | Easy | Categorize emails |
| Priority Management | Medium | Add urgency + routing |
| Full Ops Triage | Hard | End-to-end decision + reply |
nvestor · customer_support · partnership · personal · newsletter notification · spam · press · internal · operational customer_feedback · sales
critical (≤30m) · high (≤2h) · medium (≤8h) · low
total = 0.25 × classification
+ 0.15 × priority
+ 0.20 × routing
+ 0.10 × action
+ 0.20 × draft_quality
+ 0.10 × sla_compliance
- penalties
Score Meaning
Score Meaning
0.0–0.3 Poor classification
0.3–0.6 Decent routing
0.6–1.0 Strong execution
⚙️ Action Format
{
"action_type": "classify",
"email_id": "email_001",
"category": "investor",
"priority": "critical",
"escalation_team": "founder",
"suggested_action": "reply_immediately",
"draft_body": "Hi, I’ll send the deck by Thursday...",
"reply_tone": "professional_warm"
}
🏗️ Project Structure
inboxops/
├── models.py
├── env.py
├── graders.py
├── inference.py
├── app.py
├── openenv.yaml
├── Dockerfile
├── requirements.txt
├── README.md
└── data/
⚡ Quickstart
1. Clone Repo
git clone https://github.com/your-org/inboxops
cd inboxops
pip install -r requirements.txt
2. Run UI
python app.py
3. Run Agent
TASK=hard SCENARIO_ID=scenario_001 python inference.py
🐳 Docker
docker build -t inboxops .
docker run -p 7860:7860 inboxops
🧪 Python Usage
from env import InboxOpsEnv
env = InboxOpsEnv()
obs = env.reset()
while not obs.done:
action = ...
obs, reward, done, info = env.step(action)
print(env.episode_summary())
📊 Baseline Scores
Agent Score Grade
Random 0.08 F
Heuristic 0.51 C
Claude Sonnet ~0.74 B
📏 SLA Policies
Situation Time Team
Outage 15m engineering
Contract 60m legal
Investor 240m founder
🤝 Contributing
pytest tests/
To add a scenario:
Update data/inbox_scenarios.json
Add ground truth
Update openenv.yaml
📜 License
MIT License © InboxOps