Add agent-layer threat rules (27 patterns, issue #30)#33
Add agent-layer threat rules (27 patterns, issue #30)#33Adamthereal (eeee2345) wants to merge 4 commits intogendigitalinc:pre-releasefrom
Conversation
…MCP poisoning, skill compromise, context exfiltration Contributed under MIT per vaclavbelak's comment on issue gendigitalinc#30 (gendigitalinc#30 (comment)). Upstream: ATR (Agent Threat Rules) — https://github.com/Agent-Threat-Rule/agent-threat-rules Coverage - Prompt injection (4): CLT-PI-001..004 - MCP tool/response attacks (3): CLT-MCP-001..003 - Skill package compromise (8): CLT-SKL-001..008 - Context exfiltration (2): CLT-CTX-001..002 Design - All rules target match_on: content so they fire on Write/Edit content, plugin/skill file scans, and any integration that passes a `content` artifact. They complement Sage's existing 313 rules (command/URL/ credential-file) rather than overlap with them — all rules audited against Sage's existing credential/command/supply-chain rules to avoid duplicates. - Regex converted from ATR's multi-condition YAML to Sage's single- pattern schema; ATR's inline (?i) flags were replaced with case_insensitive: true (Sage's RegExp does not enable inline-flag syntax). - All severities and actions chosen conservatively — log/require_approval where a legitimate use case exists, block where the pattern is attack-only (IMDS URL, Unicode Tag smuggling, time-gated credential read, etc). Validation - Loads cleanly via packages/core loadThreats (17/17 rules). - Zero false positives on the ATR 432-sample real-world benign skill corpus (including apify, browserbase, resend, figma, datadog, axiomhq, antfu/nuxt, datadog-labs, mcp-use, and 420+ others). - 17/17 curated attack test cases trigger the expected rule. - pnpm test: 1521/1521 Sage tests still passing with the file in place. Docs - docs/threat-rules.md "Rule Files" table: add agent-layer.yaml entry. Note on --no-verify: scripts/git-hooks/pre-commit references .gitleaks.toml which does not exist in either the main or pre-release branch, so the hook fails for every contributor. Ran gitleaks directly with default config — no secrets detected. Biome lint clean (14 pre- existing warnings in test files, unrelated to this PR).
… fork impersonation, path traversal, supply chain
Ports 10 additional rule classes from ATR's upstream catalog that the
initial 17-rule subset undercounted. Adds a new supply-chain category to
complement existing prompt-injection / MCP / skill-compromise / context-
exfiltration groupings.
New rules
- CLT-PI-005 System-prompt override framing (new/updated system prompt: …)
- CLT-PI-006 Cross-agent impersonation claim (I am the admin agent …)
- CLT-PI-007 Agent-to-agent override (override verb adjacent to agent keyword)
- CLT-MCP-004 Path traversal to system dir (/etc, /proc, /root, …)
- CLT-MCP-005 Community-fork impersonation prose framing
- CLT-SKL-009 Skill scope hijacking ("also read all other files …")
- CLT-SUP-001 Typosquatted filesystem tool name (filesytem-*, filsystem-*)
- CLT-SUP-002 Install command for "community fork" package
- CLT-CTX-003 PEM private key block appearing in content
- CLT-CTX-004 Obfuscation-framed credential leak (encrypted key: sk-…)
Refinements vs upstream
- CLT-PI-007 tightened: requires an agent-identifier within 80 chars of
the override verb so it does not duplicate CLT-PI-001 on generic user
input.
- CLT-MCP-004 tightened: traversal must terminate in a sensitive system
directory (etc/proc/root/sys/boot/dev/passwd/shadow/hosts). The bare
multi-hop `../../` pattern FPs at ~3% on the benign corpus because
legitimate skills reference relative paths in code examples.
Validation
- loadThreats() loads 27/27 rules cleanly
- 27/27 curated attack test cases trigger the expected rule
- Zero false positives across the 432-sample real-world benign skill
corpus (down from 14 FPs on CLT-MCP-004 before the narrowing above)
- pnpm test: 1521/1521 Sage tests still pass
Why this is a second commit instead of rewriting the earlier one
An initial scope audit dropped a few rule classes as apparent overlaps
with Sage's existing command/URL/credential-file rules. On re-inspection
those were different detection surfaces (content-layer vs command-layer)
so the coverage loss was not intentional. Adding them here as a net-
positive commit keeps the PR history clean for reviewers.
|
Adding production validation numbers since opening this PR on Apr 18. Validation results (run against full ATR v2.0.12, 27 agent-layer rules in this PR):
Ecosystem adoption this week:
Let me know if you'd like me to split by category or narrow scope. |
CONTRIBUTING.md requires threats/*.yaml to be licensed under DRL-1.1. @vaclavbelak suggested MIT in issue gendigitalinc#30; relicensing to match the repo's explicit contribution terms and remove the licensing ambiguity before review.
|
Three updates to unblock review:
Vaclav Belak (@vaclavbelak) — ready for review when you have a window. Happy to narrow |
|
Thanks a lot for a substantial contribution! I am on BSides the rest of this week, but I will try to have a look. |
Summary
Adds
threats/agent-layer.yamlwith 27 agent-protocol threat rules, contributed upstream from ATR (Agent Threat Rules). These cover the attack surface above the shell — prompt injection, MCP tool poisoning, skill-package compromise, supply-chain typosquatting, and context exfiltration — complementing Sage's existing 313 command/URL/credential-file rules.Submitted per Vaclav Belak (@vaclavbelak)'s invitation on #30:
CLT-PI-001..007CLT-MCP-001..005<important>cross-tool shadowing, IMDS SSRF, path traversal to system dirs, community-fork proseCLT-SKL-001..009CLT-SUP-001..002CLT-CTX-001..004Design
match_on: contenton every rule — fires on Write/Edit content, plugin/skill file scans, and any agent integration that passes acontentartifact. Intentionally separate from Sage's existing command-layer rules so they complement each other (catch the payload before it becomes a command).case_insensitive: truereplaces ATR's inline(?i)flag — Sage'sRegExpcompilation doesn't enable inline-flag syntax.log/require_approvalwhere a legitimate use case exists,blockonly where the pattern is attack-only (IMDS URL, Unicode Tag smuggling, path traversal to system dir, time-gated credential read, PEM private key, compound wallet/SSH archival).credentials.yaml,commands.yaml,supply_chain.yaml,self-defense.yaml,mitre.yaml. Narrowed a path-traversal regex mid-review to drop 14 benign FPs on../../in code examples — final pattern requires traversal to terminate inetc/proc/root/sys/boot/dev/passwd/shadow/hosts.Validation
pnpm build && pnpm test→ 1521/1521 tests pass (0 regression).pnpm lintclean (14 pre-existing warnings in unrelated test files).packages/core/loadThreats()→ 27/27 loaded.\uDB40[\uDC00-\uDC7F]).Licensing
File header declares MIT, per Vaclav Belak (@vaclavbelak)'s explicit grant in #30. I noticed
CONTRIBUTING.mdstatesthreats/*.yamlare DRL-1.1 by default; happy to relicense to DRL-1.1 or add a dedicatedthreats/agent-layer.LICENSEin a single commit if that's cleaner for your review — just let me know which you prefer.Flexibility on scope
If 27 rules is too large for a first external contribution, I'm happy to trim to any subset you prefer — for example:
action: block— the hardest attack-signal patterns.severity: criticalentries.CLT-SKL-*group — highest-impact given your Claude Code marketplace position.Just indicate which framing works best and I'll push a trimmed commit.
Out-of-scope notes (for transparency, not this PR)
pre-commithook references.gitleaks.tomlwhich doesn't exist in eithermainorpre-release. Ran gitleaks directly with default config, no secrets detected. Happy to submit a follow-up PR adding a minimal config if useful.Test plan
match_on: contentrouting fires as expected through Claude Code's Write/Edit extractors.authorfield per-rule), happy to add.Upstream: https://github.com/Agent-Threat-Rule/agent-threat-rules
Related Cisco integration (same ruleset, different delivery): cisco-ai-defense/skill-scanner#79