Skip to content

Add explanation faithfulness notebook (#155)#156

Open
ilonae wants to merge 1 commit into
anthropics:masterfrom
ilonae:add/explanation-faithfulness-notebook
Open

Add explanation faithfulness notebook (#155)#156
ilonae wants to merge 1 commit into
anthropics:masterfrom
ilonae:add/explanation-faithfulness-notebook

Conversation

@ilonae
Copy link
Copy Markdown

@ilonae ilonae commented Apr 28, 2026

Summary

This PR adds a new notebook to the prompt_evaluations/ directory exploring a question the existing course doesn't cover: can you trust why Claude says what it says?

Adresses #155

What the notebook covers

Four techniques, all using only the standard API + a small local model:

Technique What it tests
Counterfactual probing Remove the word Claude claimed was decisive — does the answer change?
Motivated reasoning detection Give Claude a misleading hint — does it flip and rebuild justifications?
SHAP attribution comparison Compare Claude's verbal explanation against mathematical word importance
Model-graded reasoning eval Use a second Claude to check whether the logic holds

The techniques are combined into a reusable FaithfulnessScore dataclass / scorecard, following the eval harness pattern from Lesson 8.

Why this matters

Faithfulness is orthogonal to accuracy — a model can produce correct outputs with explanations that don't reflect its actual reasoning. This matters most in high-stakes contexts (medical, legal, financial) where practitioners rely on the explanation, not just the answer.

Prerequisites assumed

prompt_evaluations/ (Lessons 1–8), real_world_prompting/

Technical requirements

anthropic, shap, transformers, torch — all installable via pip. The local model (DistilBERT, ~260MB) downloads and caches on first run.


Directly inspired by:
Circuit Tracing: Revealing Computational Graphs in Language Models
On the Biology of a Large Language Model

Adds a new notebook covering practical techniques for evaluating whether LLM explanation traces are faithful to the model's actual
reasoning. Directly inspired by Anthropic's 2025 mechanistic interpretability research (circuit tracing / attribution graphs).

Techniques covered:
- Counterfactual probing (remove claimed features, measure output change)
- Motivated reasoning detection (misleading hint experiment)
- SHAP attribution comparison (local DistilBERT as reference model)
- Model-graded reasoning evaluation (second-Claude-as-judge)
- Combined faithfulness scorecard

Handles anthropics#155
@ilonae ilonae force-pushed the add/explanation-faithfulness-notebook branch from 2000ee3 to 6bb3706 Compare April 28, 2026 14:20
@ilonae ilonae changed the title Add explanation faithfulness notebook (prompt_evaluations #10) Add explanation faithfulness notebook (#155) Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants