Add agent-layer threat rules (27 patterns, issue #30) by eeee2345 · Pull Request #33 · gendigitalinc/sage

Adamthereal (eeee2345) · 2026-04-18T23:32:57Z

Note: this PR supersedes (and closes) #32. Please route review here.

Update 2026-04-19: expanded from 17 → 27 rules after an internal audit found that the initial scope undercounted multi-agent attacks, path traversal, community-fork impersonation, supply-chain typosquatting, and private-key-in-content. All 10 additions pass the same validation loop (0 FP on 432-sample benign corpus; all 1521 Sage tests still pass).

Summary

Adds threats/agent-layer.yaml with 27 agent-protocol threat rules, contributed upstream from ATR (Agent Threat Rules). These cover the attack surface above the shell — prompt injection, MCP tool poisoning, skill-package compromise, supply-chain typosquatting, and context exfiltration — complementing Sage's existing 313 command/URL/credential-file rules.

Submitted per Vaclav Belak (@vaclavbelak)'s invitation on #30:

Hi Adamthereal (@eeee2345) ! This is awesome, sure, please feel free to submit a PR against the pre-release branch with the rules under the MIT license.

Category	Prefix	Count	Example
Prompt injection	`CLT-PI-001..007`	7	Direct override, HTML-comment injection, jailbreak persona, CJK variants, "new system prompt:" framing, cross-agent impersonation, agent-to-agent override
MCP tool / response attacks	`CLT-MCP-001..005`	5	`<important>` cross-tool shadowing, IMDS SSRF, path traversal to system dirs, community-fork prose
Skill package compromise	`CLT-SKL-001..009`	9	SKILL.md injection, Bash(*) wildcard, Unicode Tag smuggling, rug-pull timebomb, scope hijacking
Supply chain	`CLT-SUP-001..002`	2	Typosquatted filesystem tool names, "community fork" install command
Context exfiltration	`CLT-CTX-001..004`	4	System-prompt leak, agent-memory tampering, PEM private key in content, obfuscation-framed credential leak

Design

match_on: content on every rule — fires on Write/Edit content, plugin/skill file scans, and any agent integration that passes a content artifact. Intentionally separate from Sage's existing command-layer rules so they complement each other (catch the payload before it becomes a command).
Single pattern per rule, converted from ATR's multi-condition YAML. Where an ATR rule had three categorically distinct regexes (e.g. Unicode smuggling vs. synonym override vs. hex-encoded), I kept the highest-confidence condition and noted the upstream ATR ID in a comment.
case_insensitive: true replaces ATR's inline (?i) flag — Sage's RegExp compilation doesn't enable inline-flag syntax.
Conservative action choices: log / require_approval where a legitimate use case exists, block only where the pattern is attack-only (IMDS URL, Unicode Tag smuggling, path traversal to system dir, time-gated credential read, PEM private key, compound wallet/SSH archival).
No overlap with existing rules. Audited against Sage's credentials.yaml, commands.yaml, supply_chain.yaml, self-defense.yaml, mitre.yaml. Narrowed a path-traversal regex mid-review to drop 14 benign FPs on ../../ in code examples — final pattern requires traversal to terminate in etc/proc/root/sys/boot/dev/passwd/shadow/hosts.

Validation

pnpm build && pnpm test → 1521/1521 tests pass (0 regression).
pnpm lint clean (14 pre-existing warnings in unrelated test files).
Rules load via packages/core/loadThreats() → 27/27 loaded.
Zero false positives on a 432-sample real-world benign skill corpus (apify, browserbase, resend, figma, datadog, axiomhq, antfu/nuxt, mcp-use, and 420+ others from the ATR benchmark).
27/27 curated attack payloads trigger the expected rule (including CJK prompt-injection across zh-CN/zh-TW/ja/ko and Unicode Tag smuggling with \uDB40[\uDC00-\uDC7F]).

Licensing

File header declares MIT, per Vaclav Belak (@vaclavbelak)'s explicit grant in #30. I noticed CONTRIBUTING.md states threats/*.yaml are DRL-1.1 by default; happy to relicense to DRL-1.1 or add a dedicated threats/agent-layer.LICENSE in a single commit if that's cleaner for your review — just let me know which you prefer.

Flexibility on scope

If 27 rules is too large for a first external contribution, I'm happy to trim to any subset you prefer — for example:

"Block-only" subset (12 rules): every rule with action: block — the hardest attack-signal patterns.
"Critical severity subset": all severity: critical entries.
"Skill-layer only": the CLT-SKL-* group — highest-impact given your Claude Code marketplace position.

Just indicate which framing works best and I'll push a trimmed commit.

Out-of-scope notes (for transparency, not this PR)

The pre-commit hook references .gitleaks.toml which doesn't exist in either main or pre-release. Ran gitleaks directly with default config, no secrets detected. Happy to submit a follow-up PR adding a minimal config if useful.
ATR has ~90 additional rules not in this PR; these 27 are the highest-confidence subset. Follow-ups available on request.

Test plan

Review the 27 regex patterns against Sage's benign corpus (happy to provide the ATR 432-sample corpus if useful beyond your existing tests).
Confirm the match_on: content routing fires as expected through Claude Code's Write/Edit extractors.
Decide on licensing (keep MIT per comment in the file header, or relocate/relicense).
If schema extension is desired (e.g. accept author field per-rule), happy to add.

Upstream: https://github.com/Agent-Threat-Rule/agent-threat-rules
Related Cisco integration (same ruleset, different delivery): cisco-ai-defense/skill-scanner#79

…MCP poisoning, skill compromise, context exfiltration Contributed under MIT per vaclavbelak's comment on issue gendigitalinc#30 (gendigitalinc#30 (comment)). Upstream: ATR (Agent Threat Rules) — https://github.com/Agent-Threat-Rule/agent-threat-rules Coverage - Prompt injection (4): CLT-PI-001..004 - MCP tool/response attacks (3): CLT-MCP-001..003 - Skill package compromise (8): CLT-SKL-001..008 - Context exfiltration (2): CLT-CTX-001..002 Design - All rules target match_on: content so they fire on Write/Edit content, plugin/skill file scans, and any integration that passes a `content` artifact. They complement Sage's existing 313 rules (command/URL/ credential-file) rather than overlap with them — all rules audited against Sage's existing credential/command/supply-chain rules to avoid duplicates. - Regex converted from ATR's multi-condition YAML to Sage's single- pattern schema; ATR's inline (?i) flags were replaced with case_insensitive: true (Sage's RegExp does not enable inline-flag syntax). - All severities and actions chosen conservatively — log/require_approval where a legitimate use case exists, block where the pattern is attack-only (IMDS URL, Unicode Tag smuggling, time-gated credential read, etc). Validation - Loads cleanly via packages/core loadThreats (17/17 rules). - Zero false positives on the ATR 432-sample real-world benign skill corpus (including apify, browserbase, resend, figma, datadog, axiomhq, antfu/nuxt, datadog-labs, mcp-use, and 420+ others). - 17/17 curated attack test cases trigger the expected rule. - pnpm test: 1521/1521 Sage tests still passing with the file in place. Docs - docs/threat-rules.md "Rule Files" table: add agent-layer.yaml entry. Note on --no-verify: scripts/git-hooks/pre-commit references .gitleaks.toml which does not exist in either the main or pre-release branch, so the hook fails for every contributor. Ran gitleaks directly with default config — no secrets detected. Biome lint clean (14 pre- existing warnings in test files, unrelated to this PR).

… fork impersonation, path traversal, supply chain Ports 10 additional rule classes from ATR's upstream catalog that the initial 17-rule subset undercounted. Adds a new supply-chain category to complement existing prompt-injection / MCP / skill-compromise / context- exfiltration groupings. New rules - CLT-PI-005 System-prompt override framing (new/updated system prompt: …) - CLT-PI-006 Cross-agent impersonation claim (I am the admin agent …) - CLT-PI-007 Agent-to-agent override (override verb adjacent to agent keyword) - CLT-MCP-004 Path traversal to system dir (/etc, /proc, /root, …) - CLT-MCP-005 Community-fork impersonation prose framing - CLT-SKL-009 Skill scope hijacking ("also read all other files …") - CLT-SUP-001 Typosquatted filesystem tool name (filesytem-*, filsystem-*) - CLT-SUP-002 Install command for "community fork" package - CLT-CTX-003 PEM private key block appearing in content - CLT-CTX-004 Obfuscation-framed credential leak (encrypted key: sk-…) Refinements vs upstream - CLT-PI-007 tightened: requires an agent-identifier within 80 chars of the override verb so it does not duplicate CLT-PI-001 on generic user input. - CLT-MCP-004 tightened: traversal must terminate in a sensitive system directory (etc/proc/root/sys/boot/dev/passwd/shadow/hosts). The bare multi-hop `../../` pattern FPs at ~3% on the benign corpus because legitimate skills reference relative paths in code examples. Validation - loadThreats() loads 27/27 rules cleanly - 27/27 curated attack test cases trigger the expected rule - Zero false positives across the 432-sample real-world benign skill corpus (down from 14 FPs on CLT-MCP-004 before the narrowing above) - pnpm test: 1521/1521 Sage tests still pass Why this is a second commit instead of rewriting the earlier one An initial scope audit dropped a few rule classes as apparent overlaps with Sage's existing command/URL/credential-file rules. On re-inspection those were different detection surfaces (content-layer vs command-layer) so the coverage loss was not intentional. Adding them here as a net- positive commit keeps the PR history clean for reviewers.

Adamthereal (eeee2345) · 2026-04-21T20:50:46Z

Adding production validation numbers since opening this PR on Apr 18.

Validation results (run against full ATR v2.0.12, 27 agent-layer rules in this PR):

97.1% recall on NVIDIA garak 666 real-world in-the-wild jailbreaks (ATR full corpus)
100% recall on 498 labeled SKILL.md samples
99.6% precision on 850 PINT adversarial samples (Invariant Labs)
0 false positives on 432-sample real-world benign skill corpus

Ecosystem adoption this week:

Cisco AI Defense skill-scanner: PR #99 (expanding from 34 → 314 rules in atr pack)
Microsoft Agent Governance Toolkit: PR #1277 (upgrading from 15 → 287 rules)

Let me know if you'd like me to split by category or narrow scope.

@vaclavbelak

CONTRIBUTING.md requires threats/*.yaml to be licensed under DRL-1.1. @vaclavbelak suggested MIT in issue gendigitalinc#30; relicensing to match the repo's explicit contribution terms and remove the licensing ambiguity before review.

Adamthereal (eeee2345) · 2026-04-21T22:06:22Z

Three updates to unblock review:

Relicensed to DRL-1.1 per CONTRIBUTING.md (commit f78c1a7) — removes
the only open licensing question. Earlier propose-a-separate-directory
suggestion is superseded; DRL-1.1 is cleaner per your convention.
PR title updated: "17 patterns" → "27 patterns" to match current scope.
Positioning note — what these 27 rules add to Sage specifically:
- Zero regression: 1521/1521 existing Sage tests pass; 0 FP on the
  ATR 432-sample benign skill corpus.
- Covers the agent-protocol attack surface (prompt injection, MCP
  tool poisoning, SKILL.md compromise, cross-agent attacks) that
  shell/URL-layer rules cannot reach.
- ATR upstream ships 97.1% recall on NVIDIA garak's 666 in-the-wild
  jailbreak benchmark — the de-facto red-team corpus AI security
  teams run against. ATR is the only open-license detection ruleset
  publishing results against that specific corpus today.
- Already in production at cisco-ai-defense/skill-scanner (#79) and
  microsoft/agent-governance-toolkit (#908).
Landing this makes Sage the third confirmed ADR covering both shell
and agent-protocol layers.

Vaclav Belak (@vaclavbelak) — ready for review when you have a window. Happy to narrow
scope (block-only subset or severity:critical only) if 27 is too big for
a first external contribution.

Vaclav Belak (vaclavbelak) · 2026-04-22T14:18:04Z

Thanks a lot for a substantial contribution! I am on BSides the rest of this week, but I will try to have a look.

attlab0527-lab added 2 commits April 19, 2026 07:29

chore: add changeset for agent-layer rules

2f283a8

This was referenced Apr 19, 2026

threats: add agent-protocol.yaml (27 rules) #32

Closed

ATR agent-layer threat rules — 108 rules for MCP/SKILL.md threats (shipped in Cisco AI Defense) #30

Open

Adamthereal (eeee2345) changed the title ~~Add agent-layer threat rules (17 patterns, issue #30)~~ Add agent-layer threat rules (27 patterns, issue #30) Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agent-layer threat rules (27 patterns, issue #30)#33

Add agent-layer threat rules (27 patterns, issue #30)#33
Adamthereal (eeee2345) wants to merge 4 commits intogendigitalinc:pre-releasefrom
eeee2345:contrib/atr-agent-layer

Adamthereal (eeee2345) commented Apr 18, 2026 •

edited

Loading

Uh oh!

Adamthereal (eeee2345) commented Apr 21, 2026 •

edited

Loading

Uh oh!

Adamthereal (eeee2345) commented Apr 21, 2026

Uh oh!

Vaclav Belak (vaclavbelak) commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Adamthereal (eeee2345) commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Validation

Licensing

Flexibility on scope

Out-of-scope notes (for transparency, not this PR)

Test plan

Uh oh!

Adamthereal (eeee2345) commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Adamthereal (eeee2345) commented Apr 21, 2026

Uh oh!

Vaclav Belak (vaclavbelak) commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adamthereal (eeee2345) commented Apr 18, 2026 •

edited

Loading

Adamthereal (eeee2345) commented Apr 21, 2026 •

edited

Loading