Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
This project builds on top of AlignAnything. Follow the instructions below to install the required dependencies.
conda create -n llm-guardrail-durability python==3.11
conda activate llm-guardrail-durability
pip install git+https://github.com/PKU-Alignment/align-anything.git
pip install -r requirements.txtWe modified the original codebase to support our experiments. Replace the original align_anything.config.template with the provided template.py in the configs folder.
CONDA_ENV_PATH=$(conda info --base)
cp configs/template.py ${CONDA_ENV_PATH}/envs/llm-guardrail-durability/lib/python3.11/site-packages/align_anything/configs/template.pyWe also support wandb logging. By default, it runs in offline mode. To view W&B logs online, set the WANDB_API_KEY environment variable before starting training:
export WANDB_API_KEY="..." # your W&B API key hereRefer to this repo for the code used to extract dataset representations from the models.
python similarity/get_high_similarity_group.py \
--dataset_folder datasets/${DATASET} \
--model_id ${BASE_MODEL} \
--clustering kmeans \
--stack_vector_tf False \
--n_clusters 20 \
--dataset_data ${DATAPATH} \
--dataset_reps_dir datasets/${DATASET}/reps/${BASE_MODEL}/reps-full.ptpython similarity/get_top_similarity_dataset_reps_general.py \
--dataset_folder datasets/${DATASET} \
--stack_vector_tf False \
--select_n NUM_OF_SUBSET \
--dataset_dir ${DATAPATH} \
--dataset_reps_dir ${DATASET_REPS} \
--anchoring_reps_dir ${ANCHORING_DATASET_REPS}See scripts/sft.sh for an example upstream-alignment script.
In our experiments, we use meta-llama/Llama-2-7b-hf as the base model. The instruction fine-tuning dataset selected using our proposed method is hsiung/llm-similarity-risk.
MODEL_NAME_OR_PATH="meta-llama/Llama-2-7b-hf" # Base model for instruction fine-tuning and safety alignment
TRAIN_DATASETS="hsiung/llm-similarity-risk" # Training dataset (we use the UltraChat-Beavertails dataset in our experiments; see the `datasets` folder)OUTPUT_DIR="sft_models/ultrachat_beavertails-full" # Output directory for the trained model (`sft_models`)
# Source the setup script
source ./scripts/setup.sh
# Run DeepSpeed
deepspeed \
--master_port ${MASTER_PORT} \
--include localhost:0,1,2,3 \
--module align_anything.trainers.text_to_text.sft \
--log_run_name "ultrachat_beavertails-full" \
--per_device_train_batch_size 8 \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--train_datasets ${TRAIN_DATASETS} \
--output_dir ${OUTPUT_DIR} \
--train_template UltraChat \
--train_split train \
--eval_split testSpecify the size of the safety-alignment dataset in SIZE. We use 1k and 5k in our experiments.
SIZE="5k"
OUTPUT_DIR="sft_models/alpaca_high_sim_${SIZE}"
# Source the setup script
source ./scripts/setup.sh
# Run DeepSpeed
deepspeed \
--master_port ${MASTER_PORT} \
--include localhost:0,1,2,3 \
--module align_anything.trainers.text_to_text.sft \
--log_run_name "ultrachat-full-sft" \
--per_device_train_batch_size 8 \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--train_datasets ${TRAIN_DATASETS} \
--output_dir ${OUTPUT_DIR} \
--train_template UltraChat \
--train_split train_alpaca_high_sim_${SIZE} \
--eval_split test# Specify the downstream fine-tuning task. We use the List/Pure Bad dataset for harmful fine-tuning and Alpaca/SAMSum for benign fine-tuning.
TRAIN_DATASETS="SPECIFY_THE_DOWNSTREAM_DATASET_DIR"
# Specify the aligned model path
MODEL_NAME_OR_PATH="sft_models/list_low_sim_1k"
OUTPUT_DIR="downstream_sft_models/list_low_sim_1k-list_group-sft"
RUN_NAME="downstream_sft-list_low_sim_1k-list_group-lr=2e_5-bs=5-gpu=4"
# Source the setup script
source ./scripts/setup.sh
# Run DeepSpeed
deepspeed \
--master_port ${MASTER_PORT} \
--include localhost:0,1,2,3 \
--module align_anything.trainers.text_to_text.sft \
--epochs 5 \
--learning_rate 2e-5 \
--gradient_accumulation_steps 1 \
--per_device_train_batch_size 5 \
--log_run_name ${RUN_NAME} \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--train_datasets ${TRAIN_DATASETS} \
--output_dir ${OUTPUT_DIR} \
--train_template UltraChat \
--train_split list_groupThe evaluation pipeline builds on top of shallow-vs-deep-alignment and Booster.
SAFETY_BENCH="hex-phi"
MODEL_PATH="sft_models/ultrachat-full"
accelerate launch --gpu_ids=0,1,2,3 --num_processes=4 \
eval_safety.py --model_name_or_path=${MODEL_PATH} \
--torch_dtype=bfloat16 \
--safety_bench=${SAFETY_BENCH} \
--model_family='llama2_base' \
--prompt_style='llama2_base' \
--evaluator='none' \
--save_path=${MODEL_PATH}/evaluation/${SAFETY_BENCH}/ \
--eval_template='plain'
DATA_PATH="evaluation/${SAFETY_BENCH}"
python beavertails_moderation/formatter.py $DATA_PATH/results.json $DATA_PATH/results_beaver-dam.json
python beavertails_moderation/evaluate.py \
--eval_dataset $DATA_PATH/results_beaver-dam.json \
--model_path PKU-Alignment/beaver-dam-7b \
--max_length 512 \
--output_dir $DATA_PATHBENCHMARK="mt_bench"
MODEL_ID="ultrachat_beavertails-full"
MODEL_NAME_OR_PATH="sft_models/ultrachat_beavertails-full"
OUTPUT_DIR="${MODEL_NAME_OR_PATH}/evaluation/mt_bench"
GENERATION_BACKEND="vLLM"
python __main__.py \
--benchmark ${BENCHMARK} \
--model_id ${MODEL_ID} \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--output_dir ${OUTPUT_DIR} \
--generation_backend ${GENERATION_BACKEND} > ${MODEL_NAME_OR_PATH}/evaluation/${BENCHMARK}.logWe use the ROUGE-1 score to evaluate the utility of the generated summaries. The ROUGE-1 score is calculated using the rouge package.
DATASET="samsum"
MODEL_PATH="downstream_sft_models/samsum_high_sim_5k-samsum-sft"
accelerate launch --gpu_ids=0,1,2,3 --num_processes=4 \
eval_utility.py --model_name_or_path=${MODEL_PATH} \
--torch_dtype=bfloat16 \
--dataset=${DATASET} \
--model_family='llama2_base' \
--prompt_style='llama2_base' \
--evaluator='rouge_1' \
--save_path=${MODEL_PATH}/evaluation/${DATASET}.jsonIf you find this helpful for your research, please cite our paper as follows:
@inproceedings{hsiung2026why,
title={Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets},
author={Hsiung, Lei and Pang, Tianyu and Tang, Yung-Chen and Song, Linyue and Ho, Tsung-Yi and Chen, Pin-Yu and Yang, Yaoqing},
booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year={2026},
url={https://arxiv.org/abs/2506.05346}
}