Add explanation faithfulness notebook (#155) by ilonae · Pull Request #156 · anthropics/courses

ilonae · 2026-04-28T08:45:12Z

Summary

This PR adds a new notebook to the prompt_evaluations/ directory exploring a question the existing course doesn't cover: can you trust why Claude says what it says?

Adresses #155

What the notebook covers

Four techniques, all using only the standard API + a small local model:

Technique	What it tests
Counterfactual probing	Remove the word Claude claimed was decisive — does the answer change?
Motivated reasoning detection	Give Claude a misleading hint — does it flip and rebuild justifications?
SHAP attribution comparison	Compare Claude's verbal explanation against mathematical word importance
Model-graded reasoning eval	Use a second Claude to check whether the logic holds

The techniques are combined into a reusable FaithfulnessScore dataclass / scorecard, following the eval harness pattern from Lesson 8.

Why this matters

Faithfulness is orthogonal to accuracy — a model can produce correct outputs with explanations that don't reflect its actual reasoning. This matters most in high-stakes contexts (medical, legal, financial) where practitioners rely on the explanation, not just the answer.

Prerequisites assumed

prompt_evaluations/ (Lessons 1–8), real_world_prompting/

Technical requirements

anthropic, shap, transformers, torch — all installable via pip. The local model (DistilBERT, ~260MB) downloads and caches on first run.

Directly inspired by:
Circuit Tracing: Revealing Computational Graphs in Language Models
On the Biology of a Large Language Model

Adds a new notebook covering practical techniques for evaluating whether LLM explanation traces are faithful to the model's actual reasoning. Directly inspired by Anthropic's 2025 mechanistic interpretability research (circuit tracing / attribution graphs). Techniques covered: - Counterfactual probing (remove claimed features, measure output change) - Motivated reasoning detection (misleading hint experiment) - SHAP attribution comparison (local DistilBERT as reference model) - Model-graded reasoning evaluation (second-Claude-as-judge) - Combined faithfulness scorecard Handles anthropics#155

ilonae force-pushed the add/explanation-faithfulness-notebook branch from 2000ee3 to 6bb3706 Compare April 28, 2026 14:20

ilonae changed the title ~~Add explanation faithfulness notebook (prompt_evaluations #10)~~ Add explanation faithfulness notebook (#155) Apr 28, 2026

Silviupreda approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explanation faithfulness notebook (#155)#156

Add explanation faithfulness notebook (#155)#156
ilonae wants to merge 1 commit into
anthropics:masterfrom
ilonae:add/explanation-faithfulness-notebook

ilonae commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ilonae commented Apr 28, 2026

Summary

What the notebook covers

Why this matters

Prerequisites assumed

Technical requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants