Measuring and Guiding Monosemanticity

Supplementary material for the paper "Measuring and Guiding Monosemanticity".

The repository consits of a number of scripts that help reproduce the experiments and corresponding results of the paper.

Installation

The environment was created with the help of poetry. To install the environment:

Install poetry: pip install poetry
Run: poetry install inside this folder

Overview Dirs

create_data: Scripts to create the used datasets from hugginface or other sources
eval: Evaluation scripts for the three datasets
gemma2_SAE: Wrapper for Huggingface transformer model Gemma2
llama3_SAE: Wrapper for Huggingface transformer model Llama3
train: Train script to use on determined clusters
utils: Utilities used in the folders above

Create Data

To speed up training or evaluation time, you can create the datasets before hand. The script act_dataset.py takes four arguments:

The shorthand for the dataset, e.g. SP for the Shakespeare dataset.
The hookpoint, e.g. 25 is the 25th block
The model name, in this case llama3 for the llama3-8B model
Where to hook in the specified block, block is the residual stream after the 25th block. mlp would be the output of the mlp layer.

poetry run python ./create_data/act_dataset.py SP 25 llama3 block

The result will be saved in ./datasets_v2.

Train

To train a SAE on one of the created datasets, you first need to create a config, either be looking at one of the sample configs in ./llama3_SAE/SAE_config or generation one with ./llama3_SAE/SAE_config/gen_config.py

For a determined cluster one sample determined config file can be found here: ./train/train_SAE_24k_k2048.yaml. There you have to enter your config file or use on of the sample one.

Eval

The files are named corresponding to what they evaluate. RTP, SP, PII stand for the datasets, named as in the paper described. Steering, FMS, Feautres correspond to the conducted experiments.

If the SAEs where trained with the inclued train script, it suffices to insert the checkpoint path into the evaluation scipts, a loding method for this case is provided (utils/sae_loading.py). Otherwise, the safed SAEs need to be modified or another loding method needs to be implemented.

Other Experiments

The repository of ICV was adapted to acomodate DiffVec and the datasets mentioned in the paper. The repository of Model Arithmetic already included PreAdd but needed modification in order to accomodate all datasets from the paper. Hyperparameters of all 4 methods can be found in Appendix D.

Citation

@inproceedings{harle2025monosemanticity,
    title = {Measuring and Guiding Monosemanticity},
    author = {Ruben H{\"a}rle and Felix Friedrich and Manuel Brack and Stephan W{\"a}ldchen and Bj{\"o}rn Deiseroth and
    Patrick Schramowski and Kristian Kersting},
    booktitle = {Advances in Neural Information Processing Systems},
    year = {2025},
    note = {Spotlight}
    }

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
create_data		create_data
eval		eval
gemma2_SAE		gemma2_SAE
llama3_SAE		llama3_SAE
train		train
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Measuring and Guiding Monosemanticity

Installation

Overview Dirs

Create Data

Train

Eval

Other Experiments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Measuring and Guiding Monosemanticity

Installation

Overview Dirs

Create Data

Train

Eval

Other Experiments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages