Skip to content

luisfrentzen/lores-icl-long-context-analysis

Repository files navigation

ICL Long Context Analysis - Documentation

Overview

This repository contains scripts for evaluating In-Context Learning (ICL) performance of large language models with long contexts on translation tasks. There are two architectures: a client-server architecture (newer, optimized) and a monolithic architecture (older).


1. Client-Server Architecture (Current)

Components

vllm_server.py

A standalone vLLM server that:

  • Loads a language model once and keeps it in GPU memory
  • Provides an OpenAI-compatible API endpoint
  • Enables prefix caching for efficient repeated prompt processing
  • Configurable via command-line arguments

Key Features:

  • Port: 8000 (default)
  • Prefix caching enabled for faster processing of repeated prompts
  • Chunked prefill for handling long contexts
  • 95% GPU memory utilization
  • Max batch tokens: 131,072

Usage:

python vllm_server.py \
  --model_name "Qwen/Qwen2.5-7B-Instruct-1M" \
  --max_ctx 1000000 \
  --port 8000

icl_client.py

Client script that:

  • Connects to the vLLM server via OpenAI API
  • Loads translation datasets (FLORES+ Javanese↔English)
  • Generates prompts with varying amounts of in-context examples
  • Sends requests to the server and saves results

Key Features:

  • Uses OpenAI client pointing to local vLLM server
  • Supports zero-shot and few-shot learning (0 to 500K tokens of examples)
  • Batches requests sequentially with tqdm progress tracking
  • Saves results to CSV files

run_optimized.sh - Manual Server Management

Workflow:

  1. User manually starts server (in separate terminal):

    python vllm_server.py --model_name "MODEL" --max_ctx CTX --port 8000
  2. Script runs experiments for multiple models/configurations:

    • Loops through MODEL_NAMES and MAX_CTX arrays
    • For each model, runs experiments with different shot token counts
    • Calls icl_client.py repeatedly with different parameters
    • Server stays alive across all runs
  3. Benefits:

    • Model loaded once, used many times (efficient)
    • Prefix caching accelerates repeated prompts
    • User controls when to stop/restart server

Configuration:

SHOT_TOKENS=(0 1024 2048 4096 8192)  # ICL example sizes
MODEL_NAMES=("Qwen/Qwen2.5-7B-Instruct-1M" "internlm/internlm2_5-7b-chat-1m")
MAX_CTX=(1000000 655360)  # Corresponding max context lengths

Usage:

# Terminal 1: Start server
python vllm_server.py --model_name "Qwen/Qwen2.5-7B-Instruct-1M" --max_ctx 1000000

# Terminal 2: Run experiments
bash run_optimized.sh

run_multi_server.sh - Automated Server Management

Workflow:

  1. For each model in sequence:

    • cleanup_server(): Kill all GPU processes, clear memory
    • Check GPU memory status with nvidia-smi
    • Start vLLM server in background (logs to vllm_${MODEL_IDX}.log)
    • wait_for_server(): Poll endpoint until ready (max 20 min)
  2. Run all experiments for that model:

    • Loop through shot token amounts
    • Call icl_client.py for each configuration
    • Handle failures gracefully (continue on error)
  3. Cleanup and next model:

    • cleanup_server(): Kill server, clear GPU
    • Wait 30 seconds before loading next model

Key Functions:

  • wait_for_server(): Polls http://localhost:8000/v1/models every 10s (max 120 attempts)
  • cleanup_server():
    • Kills GPU processes via nvidia-smi PIDs
    • Kills vLLM, Ray, and Python processes
    • Clears port 8000
    • Waits 15s for GPU memory to clear

Benefits:

  • Fully automated - no manual intervention needed
  • Proper cleanup between models prevents OOM errors
  • Logs server output for debugging
  • Robust error handling and recovery

Usage:

bash run_multi_server.sh

2. Monolithic Architecture (Older)

run_051225.sh + icl_051225.py

How it worked:

  1. run_051225.sh: Simple bash loop

    • Iterates through models, shot token counts, and iterations
    • Calls icl_051225.py directly with parameters
    • No separate server process
  2. icl_051225.py: Self-contained script

    • Loads the model directly using vLLM's LLM class
    • Model loaded into memory for each script execution
    • Processes all samples in one run
    • Uses llm.generate() for batch inference
    • Saves results to CSV

Workflow:

bash script → python script → load model → generate → save → exit → repeat

Problems:

  • Model reloaded every run (slow, inefficient)
  • No prefix caching between runs
  • No API server - direct vLLM usage
  • GPU memory not cleared between runs (potential OOM)

Why it was replaced: The client-server architecture avoids repeated model loading and enables prefix caching, significantly improving throughput for multiple experiments.


Key Differences Summary

Feature Monolithic (Old) Client-Server (New)
Model Loading Every run Once per model
Architecture Direct vLLM OpenAI API + vLLM server
Prefix Caching ❌ No ✅ Yes
GPU Management Manual Automated (run_multi_server.sh)
Flexibility Limited High (can run different clients)
Efficiency Low High

Output Format

All scripts save results to CSV files:

{output_dir}/{shot_tokens}_shot_icl_javanese_{model}_{max_ctx}_{iteration}.csv

Columns:

  • source_lang: "english" or "javanese"
  • prompt: Full prompt (including ICL examples)
  • label: Ground truth translation
  • answer: Model-generated translation

Models Tested

  • Qwen/Qwen2.5-7B-Instruct-1M (1M context)
  • internlm/internlm2_5-7b-chat-1m (655K context)
  • aws-prototyping/MegaBeam-Mistral-7B-512k (512K context)
  • LargeWorldModel/LWM-Text-Chat-1M (200K context)
  • chuxin-llm/Chuxin-1.6B-1M (configurable)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors