Skip to content

SS-90: Add OTel telemetry instrumentation to SmartFix#139

Merged
JacobMagesHaskinsContrast merged 21 commits intorelease_candidatefrom
SS-90_otel-telemetry
Apr 13, 2026
Merged

SS-90: Add OTel telemetry instrumentation to SmartFix#139
JacobMagesHaskinsContrast merged 21 commits intorelease_candidatefrom
SS-90_otel-telemetry

Conversation

@JacobMagesHaskinsContrast
Copy link
Copy Markdown
Collaborator

@JacobMagesHaskinsContrast JacobMagesHaskinsContrast commented Apr 10, 2026

Summary

Adds OpenTelemetry (OTel) tracing to SmartFix per the AI Observability spec. Emits structured spans to any OTLP-compatible backend (disabled by default when OTEL_EXPORTER_OTLP_ENDPOINT is unset).

Span Hierarchy

smartfix-run  (root — one per GitHub Actions run)
  fix-vulnerability  (one per vulnerability processed)
    chat {model}  (one per LLM attempt inside _call_llm_with_retry)

Changes

New infrastructure

  • src/otel_provider.py — TracerProvider lifecycle: initialize, start_span(), shutdown with force_flush
  • docker-compose.otel.yml — grafana/otel-lgtm stack for local E2E verification

Instrumentation

  • src/main.pysmartfix-run root span (session.id, contrast.smartfix.vulnerabilities_total); fix-vulnerability child span per vulnerability with all contrast.finding.* and contrast.smartfix.* attributes (fix_applied, files_modified, pr_created, pr_url)
  • src/smartfix/extensions/smartfix_litellm.pychat {model} span per LLM attempt with full gen_ai.* attributes; _derive_system() maps model strings to gen_ai.system values; _log_cost_analysis() now returns token counts for span population

Config

  • src/config.py / action.yml — no new required inputs; OTel is opt-in via standard OTEL_EXPORTER_OTLP_ENDPOINT env var

Tests

  • test/test_otel_provider.py — TracerProvider lifecycle unit tests
  • test/test_otel_spans.py — 25 integration tests using InMemorySpanExporter verifying span names, attributes, and parent-child relationships
  • test/test_smartfix_litellm.pyTestDeriveSystem, TestLogCostAnalysisReturnValue, TestCallLlmWithRetryOtelSpan (15 new tests)
  • test/test_main.py — operation span request/response attribute tests
  • test/test_git_operations.pyget_staged_files_count() tests

Test plan

  • All 829 tests passing
  • Linting clean
  • Manual E2E: docker compose -f docker-compose.otel.yml up -d, set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318, run against a real repo, verify spans in Grafana Tempo at http://localhost:3000

Jira

SS-90

🤖 Generated with Claude Code

JacobMagesHaskinsContrast and others added 17 commits April 8, 2026 14:51
Add spec section references, async context propagation notes, force_flush
clarification, turn-based LLM span cardinality, and e2e testing bead (contrast-eo1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use standard OTEL_EXPORTER_OTLP_ENDPOINT env vars (no CONTRAST_HOST derivation)
- Remove _NoOpSpan helper — use SDK built-in NoOpTracerProvider instead
- Move files_modified to operation span (per-PR, not per-run)
- Reuse TokenCostAccumulator/_log_cost_analysis() for token values in LLM spans

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add opentelemetry-sdk==1.37.0 and opentelemetry-exporter-otlp-proto-http==1.37.0
  to requirements.txt (versions match google-adk transitive constraint >=1.36.0)
- Regenerate requirements.lock
- Add GITHUB_RUN_ID optional config field (defaults to ""), used as session.id
  on the OTel tool-run span

Bead ID: contrast-37r
Tests: passing (773 tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New src/otel_provider.py with:
- initialize_otel(config): sets up TracerProvider when
  OTEL_EXPORTER_OTLP_ENDPOINT is configured; no-op otherwise
- Resource attributes: service.name/version, contrast.org_id, vcs.*
- OTLPSpanExporter() with no args (reads env vars automatically)
- start_span(name): thin wrapper around trace.get_tracer().start_as_current_span()
- shutdown_otel(): force_flush(2000ms) then shutdown, double-call safe

New test/test_otel_provider.py: 11 tests using OTel InMemorySpanExporter
and mocks covering all key behaviours.

Bead ID: contrast-22m
Tests: passing (784 tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split main() into thin wrapper + _main_impl() to avoid re-indenting
  the 530-line body
- main() calls initialize_otel(config), registers atexit shutdown,
  wraps _main_impl() in 'smartfix-run' span
- session.id = config.GITHUB_RUN_ID set at span start
- contrast.smartfix.vulnerabilities_total set in finally (0 when no
  vulns processed; incremented past the existing-PR skip check)
- shutdown_otel() called in both finally and atexit (double-call safe)
- Added test_main_initializes_and_shuts_down_otel to test_main.py

Bead ID: contrast-3ta
Tests: passing (785 tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Wrap each vulnerability iteration in a 'fix-vulnerability' child span
  (child of smartfix-run), active during LiteLLM calls for context
  propagation
- Set request-side attributes immediately: contrast.finding.fingerprint,
  contrast.finding.source, contrast.finding.rule_id,
  contrast.smartfix.coding_agent, contrast.smartfix.llm_provider
- Set response-side attributes in finally: contrast.finding.language,
  contrast.smartfix.fix_applied, contrast.smartfix.files_modified,
  contrast.smartfix.pr_created, contrast.smartfix.pr_url
- Add GitOperations.get_staged_files_count() using git diff --cached
- Use except BaseException to catch SystemExit from error_exit()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _derive_system() to map model strings to gen_ai.system values
- Refactor _log_cost_analysis() to return (input, output, cache_read, cache_write) 4-tuple
- Wrap each _call_llm_with_retry() attempt in a 'chat {model}' OTel span
- Set gen_ai.* request/response/usage attributes on each LLM span
- Record exception info and StatusCode.ERROR on retryable/terminal failures
- Move _log_cost_analysis() call from generate_content_async() into _call_llm_with_retry()
- Add 15 new tests (TestDeriveSystem, TestLogCostAnalysisReturnValue, TestCallLlmWithRetryOtelSpan)
- Fix retry tests: mock _log_cost_analysis to return 4-tuple

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add test_otel_spans.py: 25 integration tests using InMemorySpanExporter
  - TestSmartFixRunSpan: session.id, vulnerabilities_total present; files_modified/pr_created/pr_url absent
  - TestFixVulnerabilitySpan: all contrast.finding.* and contrast.smartfix.* attributes
  - TestLlmCallSpan: gen_ai.* attributes, 'chat {model}' name format, parent-child span nesting
  - Patches src.otel_provider.start_span to bypass OTel SDK's Once global-state restriction
- Fix test_smartfix_litellm_retry.py: mock _log_cost_analysis to return (0,0,0,0) 4-tuple

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ontrast-eo1)

Adds grafana/otel-lgtm stack for manually verifying SmartFix span output
without the full Contrast API. Includes instructions for running against it
and checklist of span attributes/hierarchy to verify in Tempo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tests/otel/run_otel_test.sh: orchestrates the full test
  - auto-detects Docker socket (Docker Desktop or Rancher Desktop)
  - starts grafana/otel-lgtm stack, waits for Grafana + Tempo health checks
  - runs SmartFix via act with OTEL_EXPORTER_OTLP_ENDPOINT injected (host.docker.internal:4318)
  - runs verify_spans.py after the run with --verify-only flag for re-checks
- tests/otel/verify_spans.py: standalone span verifier
  - queries Tempo HTTP API for most recent smartfix-run trace
  - asserts root span has session.id, vulnerabilities_total; lacks files_modified/pr_created/pr_url
  - asserts fix-vulnerability spans have all contrast.finding.* and contrast.smartfix.* attributes
  - asserts chat spans are children of fix-vulnerability with all gen_ai.* attributes
  - supports --wait for polling until spans appear, --trace-id for specific traces
- docker-compose.otel.yml: expose port 3200 (Tempo HTTP API) for verify_spans.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The script no longer knows about any consuming repo. It only manages the
LGTM Docker stack lifecycle (--start-only, --verify-only, --stop). The
act invocation lives in the consuming test repo's own Makefile, keeping
the dependency arrow pointing the right way.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a confirmation log so the act run log clearly shows whether
OTel was initialized and which endpoint it is exporting to.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
shutdown_otel() was called inside the `with start_span("smartfix-run")`
block, so force_flush() ran while the root span was still open. The
BatchSpanProcessor exported all completed child spans but the root span
had not ended yet. When the with-block exited and ended the root span,
the processor was already shut down and dropped it.

Fix: wrap the with-block in a try/finally so shutdown_otel() is called
after the root span has ended and been submitted to the batch processor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SmartFixLiteLlm is instantiated while fix-vulnerability is the current
OTel span. Capture that context in __init__ and pass it explicitly when
creating each 'chat <model>' span so they are always direct children of
fix-vulnerability, regardless of whatever ADK-internal spans (call_llm,
invocation, agent_run) may be active at LLM-call time.

- otel_provider.start_span() gains an optional context parameter
- SmartFixLiteLlm.__init__ captures otel_context.get_current()
- _call_llm_with_retry passes self._otel_context to start_span
- Tests updated: mock_start_span lambdas accept context=None, and
  MagicMock(spec=) instances set _otel_context = None

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@JacobMagesHaskinsContrast
Copy link
Copy Markdown
Collaborator Author

Evidence: OTel collect against my employee-management test app as viewed in Grafana...

Screenshot 2026-04-10 at 2 48 00 PM Screenshot 2026-04-10 at 2 50 07 PM Screenshot 2026-04-10 at 2 51 04 PM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry (OTel) tracing to SmartFix (opt-in via standard OTEL env vars) and introduces test + local E2E verification infrastructure to ensure span hierarchy/attributes match the SS-90 observability spec.

Changes:

  • Introduces src/otel_provider.py to manage OTel initialization, span creation, and shutdown/flush behavior.
  • Instruments src/main.py and src/smartfix/extensions/smartfix_litellm.py to emit smartfix-run, fix-vulnerability, and chat {model} spans with required attributes.
  • Adds extensive unit/integration tests plus local verification tooling (docker-compose.otel.yml, tests/otel/*) to validate spans without external dependencies.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/otel/verify_spans.py Tempo-querying verifier script to assert span structure/attributes against the spec.
tests/otel/run_otel_test.sh Helper script to run the local Grafana/Tempo stack and execute span verification.
test/test_smartfix_litellm.py Adds tests for _derive_system, _log_cost_analysis return values, and per-attempt chat span creation.
test/test_smartfix_litellm_retry.py Adjusts retry tests to match new _log_cost_analysis contract and OTel context usage.
test/test_otel_spans.py In-memory exporter integration tests validating span names/attributes/parent-child relationships.
test/test_otel_provider.py Unit tests for provider lifecycle and “no-op when disabled” behavior.
test/test_main.py Tests for root/operation span creation and required request/response attributes.
test/test_git_operations.py Adds coverage for counting staged files.
test/test_config_integration.py Adds coverage for GITHUB_RUN_ID defaulting and env reading.
src/smartfix/extensions/smartfix_litellm.py Emits chat {model} spans per attempt; refactors cost parsing to return token counts for span attributes.
src/smartfix/domains/scm/git_operations.py Adds get_staged_files_count() used for operation-span attributes.
src/requirements.txt Adds OTel SDK + OTLP HTTP exporter deps.
src/requirements.lock Regenerates lock metadata for new direct requirements.
src/otel_provider.py New OTel provider module (init/start_span/shutdown).
src/main.py Wraps execution in smartfix-run root span and creates fix-vulnerability spans per vulnerability with attributes.
src/config.py Adds optional GITHUB_RUN_ID config field (defaults to empty string).
docker-compose.otel.yml Local Grafana/otel-lgtm stack for manual E2E verification.
.beads/issues.jsonl Records SS-90 task tracking updates.

Comment thread tests/otel/run_otel_test.sh Outdated
Comment thread src/smartfix/domains/telemetry/otel_provider.py
Comment thread src/main.py
The script runs with set -euo pipefail, so a non-zero exit from
verify_spans.py would terminate immediately before VERIFY_EXIT=$?
or the success/failure messaging could run. Use if/then/else to
capture the exit code without triggering set -e.
Copilot flagged that trace.set_tracer_provider() is a one-time global
and a pre-existing provider could cause spans to be lost. Add a design
note explaining why this cannot happen in SmartFix: ADK's OTel setup
only runs from its CLI web server (not the Runner API we use), and
LiteLLM's OTel integration requires explicit callback configuration.
vuln_count was a local int so if _main_impl() exited via
error_exit()/sys.exit() (BUILD_VERIFICATION_FAILED, GENERATE_PR_FAILURE)
after incrementing the counter, the finally block would report 0 instead
of the actual number processed. Use a mutable list so the finally block
always reads the correct value regardless of how _main_impl exits.
@JacobMagesHaskinsContrast
Copy link
Copy Markdown
Collaborator Author

traces and spans still show in Grafana after refactor and post-code-review improvements:
Screenshot 2026-04-13 at 10 23 39 AM

@JacobMagesHaskinsContrast JacobMagesHaskinsContrast merged commit 1cc0e7d into release_candidate Apr 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants