Supplementary material for the paper "Measuring and Guiding Monosemanticity".
The repository consits of a number of scripts that help reproduce the experiments and corresponding results of the paper.
The environment was created with the help of poetry. To install the environment:
- Install poetry:
pip install poetry - Run:
poetry installinside this folder
create_data: Scripts to create the used datasets from hugginface or other sourceseval: Evaluation scripts for the three datasetsgemma2_SAE: Wrapper for Huggingface transformer model Gemma2llama3_SAE: Wrapper for Huggingface transformer model Llama3train: Train script to use on determined clustersutils: Utilities used in the folders above
To speed up training or evaluation time, you can create the datasets before hand.
The script act_dataset.py takes four arguments:
- The shorthand for the dataset, e.g.
SPfor the Shakespeare dataset. - The hookpoint, e.g.
25is the 25th block - The model name, in this case
llama3for the llama3-8B model - Where to hook in the specified block,
blockis the residual stream after the 25th block.mlpwould be the output of the mlp layer.
poetry run python ./create_data/act_dataset.py SP 25 llama3 blockThe result will be saved in ./datasets_v2.
To train a SAE on one of the created datasets, you first need to create a config, either be looking at one of the sample configs in ./llama3_SAE/SAE_config or generation one with ./llama3_SAE/SAE_config/gen_config.py
For a determined cluster one sample determined config file can be found here: ./train/train_SAE_24k_k2048.yaml. There you have to enter your config file or use on of the sample one.
The files are named corresponding to what they evaluate. RTP, SP, PII stand for the datasets, named as in the paper described. Steering, FMS, Feautres correspond to the conducted experiments.
If the SAEs where trained with the inclued train script, it suffices to insert the checkpoint path into the evaluation scipts, a loding method for this case is provided (utils/sae_loading.py). Otherwise, the safed SAEs need to be modified or another loding method needs to be implemented.
The repository of ICV was adapted to acomodate DiffVec and the datasets mentioned in the paper. The repository of Model Arithmetic already included PreAdd but needed modification in order to accomodate all datasets from the paper. Hyperparameters of all 4 methods can be found in Appendix D.
@inproceedings{harle2025monosemanticity,
title = {Measuring and Guiding Monosemanticity},
author = {Ruben H{\"a}rle and Felix Friedrich and Manuel Brack and Stephan W{\"a}ldchen and Bj{\"o}rn Deiseroth and
Patrick Schramowski and Kristian Kersting},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
note = {Spotlight}
}