This repository contains scripts for evaluating In-Context Learning (ICL) performance of large language models with long contexts on translation tasks. There are two architectures: a client-server architecture (newer, optimized) and a monolithic architecture (older).
A standalone vLLM server that:
- Loads a language model once and keeps it in GPU memory
- Provides an OpenAI-compatible API endpoint
- Enables prefix caching for efficient repeated prompt processing
- Configurable via command-line arguments
Key Features:
- Port: 8000 (default)
- Prefix caching enabled for faster processing of repeated prompts
- Chunked prefill for handling long contexts
- 95% GPU memory utilization
- Max batch tokens: 131,072
Usage:
python vllm_server.py \
--model_name "Qwen/Qwen2.5-7B-Instruct-1M" \
--max_ctx 1000000 \
--port 8000Client script that:
- Connects to the vLLM server via OpenAI API
- Loads translation datasets (FLORES+ Javanese↔English)
- Generates prompts with varying amounts of in-context examples
- Sends requests to the server and saves results
Key Features:
- Uses OpenAI client pointing to local vLLM server
- Supports zero-shot and few-shot learning (0 to 500K tokens of examples)
- Batches requests sequentially with tqdm progress tracking
- Saves results to CSV files
Workflow:
-
User manually starts server (in separate terminal):
python vllm_server.py --model_name "MODEL" --max_ctx CTX --port 8000 -
Script runs experiments for multiple models/configurations:
- Loops through
MODEL_NAMESandMAX_CTXarrays - For each model, runs experiments with different shot token counts
- Calls icl_client.py repeatedly with different parameters
- Server stays alive across all runs
- Loops through
-
Benefits:
- Model loaded once, used many times (efficient)
- Prefix caching accelerates repeated prompts
- User controls when to stop/restart server
Configuration:
SHOT_TOKENS=(0 1024 2048 4096 8192) # ICL example sizes
MODEL_NAMES=("Qwen/Qwen2.5-7B-Instruct-1M" "internlm/internlm2_5-7b-chat-1m")
MAX_CTX=(1000000 655360) # Corresponding max context lengthsUsage:
# Terminal 1: Start server
python vllm_server.py --model_name "Qwen/Qwen2.5-7B-Instruct-1M" --max_ctx 1000000
# Terminal 2: Run experiments
bash run_optimized.shWorkflow:
-
For each model in sequence:
cleanup_server(): Kill all GPU processes, clear memory- Check GPU memory status with
nvidia-smi - Start vLLM server in background (logs to
vllm_${MODEL_IDX}.log) wait_for_server(): Poll endpoint until ready (max 20 min)
-
Run all experiments for that model:
- Loop through shot token amounts
- Call icl_client.py for each configuration
- Handle failures gracefully (continue on error)
-
Cleanup and next model:
cleanup_server(): Kill server, clear GPU- Wait 30 seconds before loading next model
Key Functions:
wait_for_server(): Pollshttp://localhost:8000/v1/modelsevery 10s (max 120 attempts)cleanup_server():- Kills GPU processes via
nvidia-smiPIDs - Kills vLLM, Ray, and Python processes
- Clears port 8000
- Waits 15s for GPU memory to clear
- Kills GPU processes via
Benefits:
- Fully automated - no manual intervention needed
- Proper cleanup between models prevents OOM errors
- Logs server output for debugging
- Robust error handling and recovery
Usage:
bash run_multi_server.shHow it worked:
-
run_051225.sh: Simple bash loop
- Iterates through models, shot token counts, and iterations
- Calls icl_051225.py directly with parameters
- No separate server process
-
icl_051225.py: Self-contained script
- Loads the model directly using vLLM's
LLMclass - Model loaded into memory for each script execution
- Processes all samples in one run
- Uses
llm.generate()for batch inference - Saves results to CSV
- Loads the model directly using vLLM's
Workflow:
bash script → python script → load model → generate → save → exit → repeat
Problems:
- ❌ Model reloaded every run (slow, inefficient)
- ❌ No prefix caching between runs
- ❌ No API server - direct vLLM usage
- ❌ GPU memory not cleared between runs (potential OOM)
Why it was replaced: The client-server architecture avoids repeated model loading and enables prefix caching, significantly improving throughput for multiple experiments.
| Feature | Monolithic (Old) | Client-Server (New) |
|---|---|---|
| Model Loading | Every run | Once per model |
| Architecture | Direct vLLM | OpenAI API + vLLM server |
| Prefix Caching | ❌ No | ✅ Yes |
| GPU Management | Manual | Automated (run_multi_server.sh) |
| Flexibility | Limited | High (can run different clients) |
| Efficiency | Low | High |
All scripts save results to CSV files:
{output_dir}/{shot_tokens}_shot_icl_javanese_{model}_{max_ctx}_{iteration}.csv
Columns:
source_lang: "english" or "javanese"prompt: Full prompt (including ICL examples)label: Ground truth translationanswer: Model-generated translation
- Qwen/Qwen2.5-7B-Instruct-1M (1M context)
- internlm/internlm2_5-7b-chat-1m (655K context)
- aws-prototyping/MegaBeam-Mistral-7B-512k (512K context)
- LargeWorldModel/LWM-Text-Chat-1M (200K context)
- chuxin-llm/Chuxin-1.6B-1M (configurable)