Skip to content

twweeb/llm-similarity-risk

Repository files navigation

Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Installation

This project builds on top of AlignAnything. Follow the instructions below to install the required dependencies.

conda create -n llm-guardrail-durability python==3.11
conda activate llm-guardrail-durability
pip install git+https://github.com/PKU-Alignment/align-anything.git
pip install -r requirements.txt

We modified the original codebase to support our experiments. Replace the original align_anything.config.template with the provided template.py in the configs folder.

CONDA_ENV_PATH=$(conda info --base)
cp configs/template.py ${CONDA_ENV_PATH}/envs/llm-guardrail-durability/lib/python3.11/site-packages/align_anything/configs/template.py

Wandb Logger

We also support wandb logging. By default, it runs in offline mode. To view W&B logs online, set the WANDB_API_KEY environment variable before starting training:

export WANDB_API_KEY="..."  # your W&B API key here

Data Representations

Refer to this repo for the code used to extract dataset representations from the models.

Clustering Dataset Representations

python similarity/get_high_similarity_group.py \
    --dataset_folder datasets/${DATASET} \
    --model_id ${BASE_MODEL} \
    --clustering kmeans \
    --stack_vector_tf False \
    --n_clusters 20 \
    --dataset_data ${DATAPATH} \
    --dataset_reps_dir datasets/${DATASET}/reps/${BASE_MODEL}/reps-full.pt

Get a High-Similarity Dataset

python similarity/get_top_similarity_dataset_reps_general.py \
    --dataset_folder datasets/${DATASET} \
    --stack_vector_tf False \
    --select_n NUM_OF_SUBSET \
    --dataset_dir ${DATAPATH} \
    --dataset_reps_dir ${DATASET_REPS} \
    --anchoring_reps_dir ${ANCHORING_DATASET_REPS}

Upstream Alignment

See scripts/sft.sh for an example upstream-alignment script.

Specify Base Model and Training Dataset

In our experiments, we use meta-llama/Llama-2-7b-hf as the base model. The instruction fine-tuning dataset selected using our proposed method is hsiung/llm-similarity-risk.

MODEL_NAME_OR_PATH="meta-llama/Llama-2-7b-hf"  # Base model for instruction fine-tuning and safety alignment
TRAIN_DATASETS="hsiung/llm-similarity-risk"  # Training dataset (we use the UltraChat-Beavertails dataset in our experiments; see the `datasets` folder)

Train the Full Safety-Aligned Model

OUTPUT_DIR="sft_models/ultrachat_beavertails-full"  # Output directory for the trained model (`sft_models`)

# Source the setup script
source ./scripts/setup.sh

# Run DeepSpeed
deepspeed \
 --master_port ${MASTER_PORT} \
 --include localhost:0,1,2,3 \
 --module align_anything.trainers.text_to_text.sft \
 --log_run_name "ultrachat_beavertails-full" \
 --per_device_train_batch_size 8 \
 --model_name_or_path ${MODEL_NAME_OR_PATH} \
 --train_datasets ${TRAIN_DATASETS} \
 --output_dir ${OUTPUT_DIR} \
 --train_template UltraChat \
 --train_split train \
 --eval_split test

Train the 5k Safety-Aligned Model

Specify the size of the safety-alignment dataset in SIZE. We use 1k and 5k in our experiments.

SIZE="5k"
OUTPUT_DIR="sft_models/alpaca_high_sim_${SIZE}"

# Source the setup script
source ./scripts/setup.sh

# Run DeepSpeed
deepspeed \
 --master_port ${MASTER_PORT} \
 --include localhost:0,1,2,3 \
 --module align_anything.trainers.text_to_text.sft \
 --log_run_name "ultrachat-full-sft" \
 --per_device_train_batch_size 8 \
 --model_name_or_path ${MODEL_NAME_OR_PATH} \
 --train_datasets ${TRAIN_DATASETS} \
 --output_dir ${OUTPUT_DIR} \
 --train_template UltraChat \
 --train_split train_alpaca_high_sim_${SIZE} \
 --eval_split test

Downstream Fine-tuning

# Specify the downstream fine-tuning task. We use the List/Pure Bad dataset for harmful fine-tuning and Alpaca/SAMSum for benign fine-tuning.
TRAIN_DATASETS="SPECIFY_THE_DOWNSTREAM_DATASET_DIR"

# Specify the aligned model path
MODEL_NAME_OR_PATH="sft_models/list_low_sim_1k"
OUTPUT_DIR="downstream_sft_models/list_low_sim_1k-list_group-sft"
RUN_NAME="downstream_sft-list_low_sim_1k-list_group-lr=2e_5-bs=5-gpu=4"

# Source the setup script
source ./scripts/setup.sh

# Run DeepSpeed
deepspeed \
 --master_port ${MASTER_PORT} \
 --include localhost:0,1,2,3 \
 --module align_anything.trainers.text_to_text.sft \
 --epochs 5 \
 --learning_rate 2e-5 \
 --gradient_accumulation_steps 1 \
 --per_device_train_batch_size 5 \
 --log_run_name ${RUN_NAME} \
 --model_name_or_path ${MODEL_NAME_OR_PATH} \
 --train_datasets ${TRAIN_DATASETS} \
 --output_dir ${OUTPUT_DIR} \
 --train_template UltraChat \
 --train_split list_group

Evaluation

The evaluation pipeline builds on top of shallow-vs-deep-alignment and Booster.

Safety Evaluation

SAFETY_BENCH="hex-phi"

MODEL_PATH="sft_models/ultrachat-full"
accelerate launch --gpu_ids=0,1,2,3 --num_processes=4 \
  eval_safety.py --model_name_or_path=${MODEL_PATH} \
                 --torch_dtype=bfloat16 \
                 --safety_bench=${SAFETY_BENCH} \
                 --model_family='llama2_base' \
                 --prompt_style='llama2_base' \
                 --evaluator='none' \
                 --save_path=${MODEL_PATH}/evaluation/${SAFETY_BENCH}/ \
                 --eval_template='plain'

DATA_PATH="evaluation/${SAFETY_BENCH}"
python beavertails_moderation/formatter.py $DATA_PATH/results.json $DATA_PATH/results_beaver-dam.json
python beavertails_moderation/evaluate.py \
  --eval_dataset $DATA_PATH/results_beaver-dam.json \
  --model_path PKU-Alignment/beaver-dam-7b \
  --max_length 512 \
  --output_dir $DATA_PATH

Utility Evaluation

MT-Bench

BENCHMARK="mt_bench"
MODEL_ID="ultrachat_beavertails-full"
MODEL_NAME_OR_PATH="sft_models/ultrachat_beavertails-full"
OUTPUT_DIR="${MODEL_NAME_OR_PATH}/evaluation/mt_bench"
GENERATION_BACKEND="vLLM"

python __main__.py \
  --benchmark ${BENCHMARK} \
  --model_id ${MODEL_ID} \
  --model_name_or_path ${MODEL_NAME_OR_PATH} \
  --output_dir ${OUTPUT_DIR} \
  --generation_backend ${GENERATION_BACKEND} > ${MODEL_NAME_OR_PATH}/evaluation/${BENCHMARK}.log

SAMSum

We use the ROUGE-1 score to evaluate the utility of the generated summaries. The ROUGE-1 score is calculated using the rouge package.

DATASET="samsum"
MODEL_PATH="downstream_sft_models/samsum_high_sim_5k-samsum-sft"
accelerate launch --gpu_ids=0,1,2,3 --num_processes=4 \
  eval_utility.py --model_name_or_path=${MODEL_PATH} \
                 --torch_dtype=bfloat16 \
                 --dataset=${DATASET} \
                 --model_family='llama2_base' \
                 --prompt_style='llama2_base' \
                 --evaluator='rouge_1' \
                 --save_path=${MODEL_PATH}/evaluation/${DATASET}.json

Citation

If you find this helpful for your research, please cite our paper as follows:

@inproceedings{hsiung2026why,
  title={Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets},
  author={Hsiung, Lei and Pang, Tianyu and Tang, Yung-Chen and Song, Linyue and Ho, Tsung-Yi and Chen, Pin-Yu and Yang, Yaoqing},
  booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year={2026},
  url={https://arxiv.org/abs/2506.05346}
}

About

[ACL26] "Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets" by Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang

Resources

Stars

Watchers

Forks

Contributors