SS-90: Add OTel telemetry instrumentation to SmartFix by JacobMagesHaskinsContrast · Pull Request #139 · Contrast-Security-OSS/contrast-ai-smartfix-action

JacobMagesHaskinsContrast · 2026-04-10T01:20:39Z

Summary

Adds OpenTelemetry (OTel) tracing to SmartFix per the AI Observability spec. Emits structured spans to any OTLP-compatible backend (disabled by default when OTEL_EXPORTER_OTLP_ENDPOINT is unset).

Span Hierarchy

smartfix-run  (root — one per GitHub Actions run)
  fix-vulnerability  (one per vulnerability processed)
    chat {model}  (one per LLM attempt inside _call_llm_with_retry)

Changes

New infrastructure

src/otel_provider.py — TracerProvider lifecycle: initialize, start_span(), shutdown with force_flush
docker-compose.otel.yml — grafana/otel-lgtm stack for local E2E verification

Instrumentation

src/main.py — smartfix-run root span (session.id, contrast.smartfix.vulnerabilities_total); fix-vulnerability child span per vulnerability with all contrast.finding.* and contrast.smartfix.* attributes (fix_applied, files_modified, pr_created, pr_url)
src/smartfix/extensions/smartfix_litellm.py — chat {model} span per LLM attempt with full gen_ai.* attributes; _derive_system() maps model strings to gen_ai.system values; _log_cost_analysis() now returns token counts for span population

Config

src/config.py / action.yml — no new required inputs; OTel is opt-in via standard OTEL_EXPORTER_OTLP_ENDPOINT env var

Tests

test/test_otel_provider.py — TracerProvider lifecycle unit tests
test/test_otel_spans.py — 25 integration tests using InMemorySpanExporter verifying span names, attributes, and parent-child relationships
test/test_smartfix_litellm.py — TestDeriveSystem, TestLogCostAnalysisReturnValue, TestCallLlmWithRetryOtelSpan (15 new tests)
test/test_main.py — operation span request/response attribute tests
test/test_git_operations.py — get_staged_files_count() tests

Test plan

All 829 tests passing
Linting clean
Manual E2E: docker compose -f docker-compose.otel.yml up -d, set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318, run against a real repo, verify spans in Grafana Tempo at http://localhost:3000

Jira

SS-90

🤖 Generated with Claude Code

Add spec section references, async context propagation notes, force_flush clarification, turn-based LLM span cardinality, and e2e testing bead (contrast-eo1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Use standard OTEL_EXPORTER_OTLP_ENDPOINT env vars (no CONTRAST_HOST derivation) - Remove _NoOpSpan helper — use SDK built-in NoOpTracerProvider instead - Move files_modified to operation span (per-PR, not per-run) - Reuse TokenCostAccumulator/_log_cost_analysis() for token values in LLM spans Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add opentelemetry-sdk==1.37.0 and opentelemetry-exporter-otlp-proto-http==1.37.0 to requirements.txt (versions match google-adk transitive constraint >=1.36.0) - Regenerate requirements.lock - Add GITHUB_RUN_ID optional config field (defaults to ""), used as session.id on the OTel tool-run span Bead ID: contrast-37r Tests: passing (773 tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New src/otel_provider.py with: - initialize_otel(config): sets up TracerProvider when OTEL_EXPORTER_OTLP_ENDPOINT is configured; no-op otherwise - Resource attributes: service.name/version, contrast.org_id, vcs.* - OTLPSpanExporter() with no args (reads env vars automatically) - start_span(name): thin wrapper around trace.get_tracer().start_as_current_span() - shutdown_otel(): force_flush(2000ms) then shutdown, double-call safe New test/test_otel_provider.py: 11 tests using OTel InMemorySpanExporter and mocks covering all key behaviours. Bead ID: contrast-22m Tests: passing (784 tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Split main() into thin wrapper + _main_impl() to avoid re-indenting the 530-line body - main() calls initialize_otel(config), registers atexit shutdown, wraps _main_impl() in 'smartfix-run' span - session.id = config.GITHUB_RUN_ID set at span start - contrast.smartfix.vulnerabilities_total set in finally (0 when no vulns processed; incremented past the existing-PR skip check) - shutdown_otel() called in both finally and atexit (double-call safe) - Added test_main_initializes_and_shuts_down_otel to test_main.py Bead ID: contrast-3ta Tests: passing (785 tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Wrap each vulnerability iteration in a 'fix-vulnerability' child span (child of smartfix-run), active during LiteLLM calls for context propagation - Set request-side attributes immediately: contrast.finding.fingerprint, contrast.finding.source, contrast.finding.rule_id, contrast.smartfix.coding_agent, contrast.smartfix.llm_provider - Set response-side attributes in finally: contrast.finding.language, contrast.smartfix.fix_applied, contrast.smartfix.files_modified, contrast.smartfix.pr_created, contrast.smartfix.pr_url - Add GitOperations.get_staged_files_count() using git diff --cached - Use except BaseException to catch SystemExit from error_exit() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add _derive_system() to map model strings to gen_ai.system values - Refactor _log_cost_analysis() to return (input, output, cache_read, cache_write) 4-tuple - Wrap each _call_llm_with_retry() attempt in a 'chat {model}' OTel span - Set gen_ai.* request/response/usage attributes on each LLM span - Record exception info and StatusCode.ERROR on retryable/terminal failures - Move _log_cost_analysis() call from generate_content_async() into _call_llm_with_retry() - Add 15 new tests (TestDeriveSystem, TestLogCostAnalysisReturnValue, TestCallLlmWithRetryOtelSpan) - Fix retry tests: mock _log_cost_analysis to return 4-tuple Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add test_otel_spans.py: 25 integration tests using InMemorySpanExporter - TestSmartFixRunSpan: session.id, vulnerabilities_total present; files_modified/pr_created/pr_url absent - TestFixVulnerabilitySpan: all contrast.finding.* and contrast.smartfix.* attributes - TestLlmCallSpan: gen_ai.* attributes, 'chat {model}' name format, parent-child span nesting - Patches src.otel_provider.start_span to bypass OTel SDK's Once global-state restriction - Fix test_smartfix_litellm_retry.py: mock _log_cost_analysis to return (0,0,0,0) 4-tuple Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ontrast-eo1) Adds grafana/otel-lgtm stack for manually verifying SmartFix span output without the full Contrast API. Includes instructions for running against it and checklist of span attributes/hierarchy to verify in Tempo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- tests/otel/run_otel_test.sh: orchestrates the full test - auto-detects Docker socket (Docker Desktop or Rancher Desktop) - starts grafana/otel-lgtm stack, waits for Grafana + Tempo health checks - runs SmartFix via act with OTEL_EXPORTER_OTLP_ENDPOINT injected (host.docker.internal:4318) - runs verify_spans.py after the run with --verify-only flag for re-checks - tests/otel/verify_spans.py: standalone span verifier - queries Tempo HTTP API for most recent smartfix-run trace - asserts root span has session.id, vulnerabilities_total; lacks files_modified/pr_created/pr_url - asserts fix-vulnerability spans have all contrast.finding.* and contrast.smartfix.* attributes - asserts chat spans are children of fix-vulnerability with all gen_ai.* attributes - supports --wait for polling until spans appear, --trace-id for specific traces - docker-compose.otel.yml: expose port 3200 (Tempo HTTP API) for verify_spans.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The script no longer knows about any consuming repo. It only manages the LGTM Docker stack lifecycle (--start-only, --verify-only, --stop). The act invocation lives in the consuming test repo's own Makefile, keeping the dependency arrow pointing the right way. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a confirmation log so the act run log clearly shows whether OTel was initialized and which endpoint it is exporting to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

shutdown_otel() was called inside the `with start_span("smartfix-run")` block, so force_flush() ran while the root span was still open. The BatchSpanProcessor exported all completed child spans but the root span had not ended yet. When the with-block exited and ended the root span, the processor was already shut down and dropped it. Fix: wrap the with-block in a try/finally so shutdown_otel() is called after the root span has ended and been submitted to the batch processor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SmartFixLiteLlm is instantiated while fix-vulnerability is the current OTel span. Capture that context in __init__ and pass it explicitly when creating each 'chat <model>' span so they are always direct children of fix-vulnerability, regardless of whatever ADK-internal spans (call_llm, invocation, agent_run) may be active at LLM-call time. - otel_provider.start_span() gains an optional context parameter - SmartFixLiteLlm.__init__ captures otel_context.get_current() - _call_llm_with_retry passes self._otel_context to start_span - Tests updated: mock_start_span lambdas accept context=None, and MagicMock(spec=) instances set _otel_context = None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

JacobMagesHaskinsContrast · 2026-04-10T18:52:42Z

Evidence: OTel collect against my employee-management test app as viewed in Grafana...

Copilot

Pull request overview

Adds OpenTelemetry (OTel) tracing to SmartFix (opt-in via standard OTEL env vars) and introduces test + local E2E verification infrastructure to ensure span hierarchy/attributes match the SS-90 observability spec.

Changes:

Introduces src/otel_provider.py to manage OTel initialization, span creation, and shutdown/flush behavior.
Instruments src/main.py and src/smartfix/extensions/smartfix_litellm.py to emit smartfix-run, fix-vulnerability, and chat {model} spans with required attributes.
Adds extensive unit/integration tests plus local verification tooling (docker-compose.otel.yml, tests/otel/*) to validate spans without external dependencies.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/otel/verify_spans.py	Tempo-querying verifier script to assert span structure/attributes against the spec.
tests/otel/run_otel_test.sh	Helper script to run the local Grafana/Tempo stack and execute span verification.
test/test_smartfix_litellm.py	Adds tests for `_derive_system`, `_log_cost_analysis` return values, and per-attempt chat span creation.
test/test_smartfix_litellm_retry.py	Adjusts retry tests to match new `_log_cost_analysis` contract and OTel context usage.
test/test_otel_spans.py	In-memory exporter integration tests validating span names/attributes/parent-child relationships.
test/test_otel_provider.py	Unit tests for provider lifecycle and “no-op when disabled” behavior.
test/test_main.py	Tests for root/operation span creation and required request/response attributes.
test/test_git_operations.py	Adds coverage for counting staged files.
test/test_config_integration.py	Adds coverage for `GITHUB_RUN_ID` defaulting and env reading.
src/smartfix/extensions/smartfix_litellm.py	Emits `chat {model}` spans per attempt; refactors cost parsing to return token counts for span attributes.
src/smartfix/domains/scm/git_operations.py	Adds `get_staged_files_count()` used for operation-span attributes.
src/requirements.txt	Adds OTel SDK + OTLP HTTP exporter deps.
src/requirements.lock	Regenerates lock metadata for new direct requirements.
src/otel_provider.py	New OTel provider module (init/start_span/shutdown).
src/main.py	Wraps execution in `smartfix-run` root span and creates `fix-vulnerability` spans per vulnerability with attributes.
src/config.py	Adds optional `GITHUB_RUN_ID` config field (defaults to empty string).
docker-compose.otel.yml	Local Grafana/otel-lgtm stack for manual E2E verification.
.beads/issues.jsonl	Records SS-90 task tracking updates.

…x/domains/telemetry/ and update imports

The script runs with set -euo pipefail, so a non-zero exit from verify_spans.py would terminate immediately before VERIFY_EXIT=$? or the success/failure messaging could run. Use if/then/else to capture the exit code without triggering set -e.

Copilot flagged that trace.set_tracer_provider() is a one-time global and a pre-existing provider could cause spans to be lost. Add a design note explaining why this cannot happen in SmartFix: ADK's OTel setup only runs from its CLI web server (not the Runner API we use), and LiteLLM's OTel integration requires explicit callback configuration.

vuln_count was a local int so if _main_impl() exited via error_exit()/sys.exit() (BUILD_VERIFICATION_FAILED, GENERATE_PR_FAILURE) after incrementing the counter, the finally block would report 0 instead of the actual number processed. Use a mutable list so the finally block always reads the correct value regardless of how _main_impl exits.

JacobMagesHaskinsContrast · 2026-04-13T14:24:23Z

traces and spans still show in Grafana after refactor and post-code-review improvements:

JacobMagesHaskinsContrast and others added 17 commits April 8, 2026 14:51

Planning: SS-90 - OTel telemetry instrumentation for SmartFix

3a92600

SS-90: Remove unused contextmanager import in test_otel_spans.py

31ad548

SS-90: Sync bead state (all beads closed)

cfa7bb4

SS-90: log OTel endpoint on successful initialization

b6141c4

Adds a confirmation log so the act run log clearly shows whether OTel was initialized and which endpoint it is exporting to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

JacobMagesHaskinsContrast self-assigned this Apr 10, 2026

JacobMagesHaskinsContrast added the AIML label Apr 10, 2026

JacobMagesHaskinsContrast requested review from Alex-Contrast, ChrisEdwards, bchandler-12, dougj-contrast and seschis April 10, 2026 19:04

JacobMagesHaskinsContrast marked this pull request as ready for review April 10, 2026 19:04

JacobMagesHaskinsContrast requested a review from Copilot April 10, 2026 19:08

Copilot started reviewing on behalf of JacobMagesHaskinsContrast April 10, 2026 19:09 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Comment thread tests/otel/run_otel_test.sh Outdated

Comment thread src/smartfix/domains/telemetry/otel_provider.py

Comment thread src/main.py

Alex-Contrast approved these changes Apr 10, 2026

View reviewed changes

JacobMagesHaskinsContrast added 4 commits April 13, 2026 09:46

SS-90 Refile otel_provider.py and telemetry_handler.py to src/smartfi…

369ffac

…x/domains/telemetry/ and update imports

JacobMagesHaskinsContrast merged commit 1cc0e7d into release_candidate Apr 13, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SS-90: Add OTel telemetry instrumentation to SmartFix#139

SS-90: Add OTel telemetry instrumentation to SmartFix#139
JacobMagesHaskinsContrast merged 21 commits intorelease_candidatefrom
SS-90_otel-telemetry

JacobMagesHaskinsContrast commented Apr 10, 2026 •

edited

Loading

Uh oh!

JacobMagesHaskinsContrast commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JacobMagesHaskinsContrast commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JacobMagesHaskinsContrast commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Span Hierarchy

Changes

New infrastructure

Instrumentation

Config

Tests

Test plan

Jira

Uh oh!

JacobMagesHaskinsContrast commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JacobMagesHaskinsContrast commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JacobMagesHaskinsContrast commented Apr 10, 2026 •

edited

Loading