ICL Long Context Analysis - Documentation

Overview

This repository contains scripts for evaluating In-Context Learning (ICL) performance of large language models with long contexts on translation tasks. There are two architectures: a client-server architecture (newer, optimized) and a monolithic architecture (older).

1. Client-Server Architecture (Current)

Components

vllm_server.py

A standalone vLLM server that:

Loads a language model once and keeps it in GPU memory
Provides an OpenAI-compatible API endpoint
Enables prefix caching for efficient repeated prompt processing
Configurable via command-line arguments

Key Features:

Port: 8000 (default)
Prefix caching enabled for faster processing of repeated prompts
Chunked prefill for handling long contexts
95% GPU memory utilization
Max batch tokens: 131,072

Usage:

python vllm_server.py \
  --model_name "Qwen/Qwen2.5-7B-Instruct-1M" \
  --max_ctx 1000000 \
  --port 8000

icl_client.py

Client script that:

Connects to the vLLM server via OpenAI API
Loads translation datasets (FLORES+ Javanese↔English)
Generates prompts with varying amounts of in-context examples
Sends requests to the server and saves results

Key Features:

Uses OpenAI client pointing to local vLLM server
Supports zero-shot and few-shot learning (0 to 500K tokens of examples)
Batches requests sequentially with tqdm progress tracking
Saves results to CSV files

run_optimized.sh - Manual Server Management

Workflow:

User manually starts server (in separate terminal):

python vllm_server.py --model_name "MODEL" --max_ctx CTX --port 8000

Script runs experiments for multiple models/configurations:
- Loops through MODEL_NAMES and MAX_CTX arrays
- For each model, runs experiments with different shot token counts
- Calls icl_client.py repeatedly with different parameters
- Server stays alive across all runs
Benefits:
- Model loaded once, used many times (efficient)
- Prefix caching accelerates repeated prompts
- User controls when to stop/restart server

Configuration:

SHOT_TOKENS=(0 1024 2048 4096 8192)  # ICL example sizes
MODEL_NAMES=("Qwen/Qwen2.5-7B-Instruct-1M" "internlm/internlm2_5-7b-chat-1m")
MAX_CTX=(1000000 655360)  # Corresponding max context lengths

Usage:

# Terminal 1: Start server
python vllm_server.py --model_name "Qwen/Qwen2.5-7B-Instruct-1M" --max_ctx 1000000

# Terminal 2: Run experiments
bash run_optimized.sh

run_multi_server.sh - Automated Server Management

Workflow:

For each model in sequence:
- cleanup_server(): Kill all GPU processes, clear memory
- Check GPU memory status with nvidia-smi
- Start vLLM server in background (logs to vllm_${MODEL_IDX}.log)
- wait_for_server(): Poll endpoint until ready (max 20 min)
Run all experiments for that model:
- Loop through shot token amounts
- Call icl_client.py for each configuration
- Handle failures gracefully (continue on error)
Cleanup and next model:
- cleanup_server(): Kill server, clear GPU
- Wait 30 seconds before loading next model

Key Functions:

wait_for_server(): Polls http://localhost:8000/v1/models every 10s (max 120 attempts)
cleanup_server():
- Kills GPU processes via nvidia-smi PIDs
- Kills vLLM, Ray, and Python processes
- Clears port 8000
- Waits 15s for GPU memory to clear

Benefits:

Fully automated - no manual intervention needed
Proper cleanup between models prevents OOM errors
Logs server output for debugging
Robust error handling and recovery

Usage:

bash run_multi_server.sh

2. Monolithic Architecture (Older)

run_051225.sh + icl_051225.py

How it worked:

run_051225.sh: Simple bash loop
- Iterates through models, shot token counts, and iterations
- Calls icl_051225.py directly with parameters
- No separate server process
icl_051225.py: Self-contained script
- Loads the model directly using vLLM's LLM class
- Model loaded into memory for each script execution
- Processes all samples in one run
- Uses llm.generate() for batch inference
- Saves results to CSV

Workflow:

bash script → python script → load model → generate → save → exit → repeat

Problems:

❌ Model reloaded every run (slow, inefficient)
❌ No prefix caching between runs
❌ No API server - direct vLLM usage
❌ GPU memory not cleared between runs (potential OOM)

Why it was replaced: The client-server architecture avoids repeated model loading and enables prefix caching, significantly improving throughput for multiple experiments.

Key Differences Summary

Feature	Monolithic (Old)	Client-Server (New)
Model Loading	Every run	Once per model
Architecture	Direct vLLM	OpenAI API + vLLM server
Prefix Caching	❌ No	✅ Yes
GPU Management	Manual	Automated (run_multi_server.sh)
Flexibility	Limited	High (can run different clients)
Efficiency	Low	High

Output Format

All scripts save results to CSV files:

{output_dir}/{shot_tokens}_shot_icl_javanese_{model}_{max_ctx}_{iteration}.csv

Columns:

source_lang: "english" or "javanese"
prompt: Full prompt (including ICL examples)
label: Ground truth translation
answer: Model-generated translation

Models Tested

Qwen/Qwen2.5-7B-Instruct-1M (1M context)
internlm/internlm2_5-7b-chat-1m (655K context)
aws-prototyping/MegaBeam-Mistral-7B-512k (512K context)
LargeWorldModel/LWM-Text-Chat-1M (200K context)
chuxin-llm/Chuxin-1.6B-1M (configurable)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
analyze_mt_outputs.py		analyze_mt_outputs.py
analyze_mt_outputs_less_old.py		analyze_mt_outputs_less_old.py
analyze_mt_outputs_old.py		analyze_mt_outputs_old.py
ca_en_es_oprouter.csv		ca_en_es_oprouter.csv
catalan_spanish_oprouter.csv		catalan_spanish_oprouter.csv
icl_051225.py		icl_051225.py
icl_client.py		icl_client.py
run_051225.sh		run_051225.sh
run_analysis.sh		run_analysis.sh
run_multi_server.sh		run_multi_server.sh
run_optimized.sh		run_optimized.sh
srun.txt		srun.txt
vllm_server.py		vllm_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ICL Long Context Analysis - Documentation

Overview

1. Client-Server Architecture (Current)

Components

vllm_server.py

icl_client.py

run_optimized.sh - Manual Server Management

run_multi_server.sh - Automated Server Management

2. Monolithic Architecture (Older)

run_051225.sh + icl_051225.py

Key Differences Summary

Output Format

Models Tested

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ICL Long Context Analysis - Documentation

Overview

1. Client-Server Architecture (Current)

Components

vllm_server.py

icl_client.py

run_optimized.sh - Manual Server Management

run_multi_server.sh - Automated Server Management

2. Monolithic Architecture (Older)

run_051225.sh + icl_051225.py

Key Differences Summary

Output Format

Models Tested

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages