One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

This repository contains the code and data for the paper "One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation".

Overview

Instruction-tuned embedding models are typically evaluated with a single, fixed prompt per task. This single-point evaluation hides a critical problem: these models are highly sensitive to prompt phrasing. We present an empirical study across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset (990 total evaluations) showing that:

Prompt deflation & inflation: Reported scores can systematically understate or overstate a model's true performance distribution.
Leaderboard fragility: By selecting prompts favorably, any model in our study can be promoted to rank 1 on a simulated leaderboard.
Prompt hacking: Most models report scores above their expected performance, analogous to p-hacking — selectively reporting a favorable prompt without modifying the model itself.

We recommend that benchmarks transition from single-prompt evaluation to distribution-based robustness metrics.

Repository Structure

prompt-hacking/
├── src/
│   ├── prompt_generation/       # Synthetic prompt generation via LLM
│   │   ├── generate_prompts.py  # Main script: generates 15 prompts per task using vLLM
│   │   └── prompt_template.py   # Prompt templates & validation for structured output
│   ├── running_experiments/     # MTEB evaluation runners
│   │   ├── run_models.py        # Multi-GPU experiment runner with prompt injection
│   │   └── run_models_single_gpu.py  # Single-GPU variant
├── data/
│   ├── prompts/                 # Generated prompts (JSON)
│   │   └── openai-gpt-oss-120b-gen-prompts.json
│   └── cache_data/mteb_cache/   # Cached MTEB evaluation results
│       └── results/             # Per-model result directories
├── pyproject.toml               # Project metadata & dependencies
├── makefile                     # Setup shortcuts
└── uv.lock                     # Locked dependency versions

Models Evaluated

Model	Type
`Qwen/Qwen3-Embedding-0.6B`	Multilingual
`intfloat/multilingual-e5-large-instruct`	Multilingual
`KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5`	Multilingual
`BAAI/bge-small-en-v1.5`	English
`BAAI/bge-base-en-v1.5`	English
`BAAI/bge-large-en-v1.5`	English

Tasks & Datasets

Covering 4 task types from MTEB/MMTEB:

Task Type	Datasets
Retrieval	MIRACLRetrievalHardNegatives.v2, Touche2020Retrieval.v3, FEVERHardNegatives
Classification	TweetSentimentClassification, ImdbClassification, AmazonCounterfactualClassification
Clustering	MedrxivClusteringP2P.v2, StackExchangeClustering.v2
Semantic Similarity	STS14, STS15, STS22.v2

Setup

Requires Python ≥ 3.13 and uv.

make install
# or equivalently:
uv sync

Usage

1. Generate Prompts

Generate 15 synthetic task-specific prompts per task using a language model (default: openai/gpt-oss-120b) with vLLM structured outputs:

uv run python src/prompt_generation/generate_prompts.py

This produces a JSON file in data/prompts/ containing the generated prompts along with metadata (task type, dataset description, language).

2. Run Experiments

Evaluate embedding models on all tasks with each generated prompt. The runner loads the model once and iterates over all prompt/task combinations:

# Multi-GPU (recommended)
uv run python src/running_experiments/run_models.py <model_name> \
    --prompts_path data/prompts/openai-gpt-oss-120b-gen-prompts.json \
    --task_description <optional: filter by task type> \
    --enc_batch_size 32

# Single-GPU
uv run python src/running_experiments/run_models_single_gpu.py

Results are cached under data/cache_data/mteb_cache/results/ following the MTEB result format, with custom prompts stored under an experiments/ subdirectory per model revision.

Result Cache Format

data/cache_data/mteb_cache/results/
└── <model_name>/
    └── <revision>/
        ├── <TaskName>.json           # Baseline (default prompt) results
        ├── model_meta.json
        └── experiments/
            └── <prompt_config>/
                └── <TaskName>.json   # Custom prompt results

Citation

License

See the repository for license details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

Overview

Repository Structure

Models Evaluated

Tasks & Datasets

Setup

Usage

1. Generate Prompts

2. Run Experiments

Result Cache Format

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
makefile		makefile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

Overview

Repository Structure

Models Evaluated

Tasks & Datasets

Setup

Usage

1. Generate Prompts

2. Run Experiments

Result Cache Format

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages