Skip to content

pleasedodisturb/awesome-llm-token-optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome LLM Token Optimization Awesome

A curated list of strategies, tools, papers, and resources for reducing LLM token costs and improving efficiency in production.

Building with LLMs is expensive. An agent processing 10 reasoning steps can consume 50K-100K tokens per task. This list collects everything you need to cut costs by 80-99% without sacrificing quality.

Contents


Quick Wins

The highest-impact strategies ranked by effort-to-savings ratio:

Strategy Savings Effort Link
Prompt caching 90% input tokens Add cache headers Anthropic
Token-efficient tool use 70% output reduction Flip a flag Anthropic
Batch API 50% Queue non-urgent work Anthropic
Model routing 60-95% Route by task complexity RouteLLM
Response caching 100% on repeats Add a cache layer Redis guide
Prompt compression 5-20x Use LLMLingua GitHub

Combined pipeline: Cache prefix (90%) + route to cheapest model (60-95%) + batch non-urgent (50%) + compress prompts (5-20x) + cache responses (100% on repeats) = 95-99% cost reduction vs. naive approach.

Prompt Caching

Reuse previously-processed prompt prefixes to avoid re-computing the same tokens.

Provider Docs

Strategy: Cached Prefix Pattern

Structure prompts so the system prompt + user profile is the first ~2,000 tokens. All subsequent calls share this prefix. For bulk operations (e.g., scoring 50 items): 1x full + 49x at 10% = 88% total savings.

Batch APIs

50% discounts for non-time-critical requests. Combine with caching for 95% savings.

Model Routing

Route simple tasks to cheaper models. 80% of typical LLM calls don't need the most expensive model.

Frameworks

  • RouteLLM - Open-source LLM router by LMSYS. Trains routers from preference data; 2x+ cost reduction. Stars
  • LiteLLM - SDK + proxy for 100+ LLMs with routing, cost tracking. Strategies: least-busy, cost-based, latency-based. Stars
  • NotDiamond - Per-query best-model selection. Stars
  • Bifrost - 50x faster than LiteLLM; adaptive load balancer, 1000+ models. Stars
  • OpenRouter - Unified API for 300+ models with auto-router.
  • Martian Router - Patent-pending; cuts costs 20-97% via "Model Mapping".

Curated Lists

Research

Prompt Compression

Reduce prompt size while preserving information quality.

Tools

  • LLMLingua - Up to 20x compression. Coarse-to-fine iterative method. Integrates with LangChain/LlamaIndex. Stars
  • Headroom - Routes JSON/code/text to specialized compressors.
  • code2prompt - Codebase to LLM prompt with token counting. Stars

Research

Guides

Lossless Compression Principles

Rule-based lossless distillation achieves 3-4:1 compression without any model:

  • Strip: prose transitions, hedging, rhetoric, common knowledge
  • Preserve: numbers, entities, decisions, constraints, risks
  • Transform: prose to dense bullets; verbose to semicolon-joined
  • Split: 3,000-5,000 token self-contained sections, loadable independently

Context Window Management

Key Research

Provider Docs

Chunking & Splitting

KV Cache Optimization

Server-side optimizations for inference efficiency.

Inference Engines

  • vLLM - PagedAttention, high-throughput inference. Stars
  • SGLang - RadixAttention for automatic KV cache reuse. Stars
  • GPUStack - GPU cluster manager for vLLM/SGLang. Stars

Compression Tools

  • NVIDIA kvpress - KV cache compression made easy. Stars
  • R-KV - Redundancy-aware compression (NeurIPS 2025). Stars
  • llm-compressor - Compression for deployment with vLLM. Stars
  • NVIDIA Model Optimizer - Quantization, pruning, distillation, speculative decoding. Stars
  • TurboQuant - Google's ICLR 2026; 5x KV cache compression.
  • aibrix - Cost-efficient infrastructure for GenAI inference. Stars

Research

Educational

Browser Tool Efficiency

Different browser automation approaches consume vastly different context.

Agent Output Size Efficiency Link
WebFetch ~1.5 KB (AI-summarized) 20x better Docs
Playwright MCP ~10-33 KB (accessibility tree) Baseline GitHub
Agent Browser ~28 KB (accessibility tree) Similar GitHub
Lightpanda ~16 KB (raw markdown) 2x better GitHub

For 10-page workflows: WebFetch = ~15KB vs Playwright = ~330KB total context consumed.

Why Accessibility Trees Are Efficient

The accessibility tree strips visual styling to retain only semantic structure (name, role, state, value). 10-50x smaller than raw HTML. See: Token cost analysis in browser MCPs.

Further Reading

Cost Tracking Tools

Pricing Comparison

Live Pricing Tools

Provider Pricing Pages

Prompt Engineering for Efficiency

Official Guides

Community

Concise Reasoning Research

Comprehensive Guides

Academic Papers

Prompt Compression

Paper Year Key Result
Prompt Compression Survey 2024 Comprehensive survey of all techniques
LLMLingua 2023 Up to 20x compression (EMNLP)
LLMLingua-2 2024 3-6x faster via BERT distillation (ACL)
LongLLMLingua 2023 4x fewer tokens in long contexts
Selective Context 2023 50% reduction via self-information pruning
RECOMP 2023 5% token ratio for retrieved docs
500xCompressor 2024 6-480x compression ratios
LoPace 2026 Lossless; 72.2% savings
SCOPE 2025 Training-free generative rewriting
Dynamic Compressing 2025 MDP-based adaptive token removal
Empirical Study 2025 Benchmarks 6 methods across 13 datasets

Model Routing & Cascading

Paper Year Key Result
FrugalGPT 2023 Seminal cascade paper; up to 98% cost reduction
RouteLLM 2024 2x+ cost reduction without quality loss
Hybrid LLM 2024 40% fewer calls to large model
Unified Routing + Cascading 2024 +14% over individual strategies
Dynamic Routing Survey 2026 Comprehensive survey
Pay for Hints 2026 Small model gets hints, not full answers

Context & Inference

Paper Year Key Result
Lost in the Middle 2023 Models struggle with mid-context info
Context Rot 2025 Degradation before context limits
RAG vs Long Context 2025 Complementary strengths by query type
Self-Route Hybrid 2024 Adaptive RAG + long context
InfiniteICL 2025 90% reduction, 103% performance
YaRN Context Extension 2023 10x less tokens for context extension
SkyLadder 2025 22% training time savings
TRIM 2024 19.4% token savings on GPT-4o

KV Cache & Inference

Paper Year Key Result
PagedAttention (vLLM) 2023 Near-zero KV cache waste
RadixAttention (SGLang) 2023 Auto KV cache reuse
KV Cache Survey (2026) 2026 Comprehensive techniques survey
VectorQ Semantic Caching 2025 Up to 100x latency reduction
KV-Compress 2024 Variable-head-rate compression
vAttention 2024 1.99x throughput over vLLM
LazyLLM 2024 Dynamic token pruning at prefill
SlimInfer 2025 1.88x latency reduction
Mirror Speculative Decoding 2025 Breaks serial barrier
LongSpec 2025 Constant memory speculative decoding

Prompt Optimization

Paper Year Key Result
APE (Automatic Prompt Engineer) 2022 LLMs generate optimal prompts
Concise Chain-of-Thought 2024 48.7% shorter, negligible quality loss
Chain of Draft 2025 Only 7.6% of CoT tokens used
Semantic Compression 2023 Semantic compression with LLMs

Community Resources

Related Projects

  • LLM Safe Haven - Security toolkit for AI coding agents. npx llm-safe-haven hardens Claude Code, Cursor, Windsurf in 60 seconds. Companion project — agent retries from security failures waste tokens.

Blogs

Discussions

Podcasts


License

CC BY 4.0

This work is licensed under Creative Commons Attribution 4.0 International.

Releases

No releases published

Packages

 
 
 

Contributors