Awesome LLM Token Optimization

A curated list of strategies, tools, papers, and resources for reducing LLM token costs and improving efficiency in production.

Building with LLMs is expensive. An agent processing 10 reasoning steps can consume 50K-100K tokens per task. This list collects everything you need to cut costs by 80-99% without sacrificing quality.

Quick Wins

The highest-impact strategies ranked by effort-to-savings ratio:

Strategy	Savings	Effort	Link
Prompt caching	90% input tokens	Add cache headers	Anthropic
Token-efficient tool use	70% output reduction	Flip a flag	Anthropic
Batch API	50%	Queue non-urgent work	Anthropic
Model routing	60-95%	Route by task complexity	RouteLLM
Response caching	100% on repeats	Add a cache layer	Redis guide
Prompt compression	5-20x	Use LLMLingua	GitHub

Combined pipeline: Cache prefix (90%) + route to cheapest model (60-95%) + batch non-urgent (50%) + compress prompts (5-20x) + cache responses (100% on repeats) = 95-99% cost reduction vs. naive approach.

Prompt Caching

Reuse previously-processed prompt prefixes to avoid re-computing the same tokens.

Provider Docs

Anthropic Prompt Caching - 90% discount, 5min/1hr TTL, min 1,024 tokens.
Anthropic Caching Announcement - Blog post explaining economics.
Anthropic Token-Saving Updates - Cache-aware rate limits, simplified caching.
Anthropic Extended Thinking + Caching - Thinking blocks get cached in tool-use loops.
OpenAI Prompt Caching - 50% discount, automatic for 1024+ token prompts.
OpenAI Prompt Caching Cookbook - Advanced techniques with code.
Google Gemini Context Caching - Implicit (auto) and explicit caching, 90% discount.
Google Vertex AI Caching - Enterprise context caching.
DeepSeek KV Cache - Disk-based, 64-token granularity, 90% savings.

Strategy: Cached Prefix Pattern

Structure prompts so the system prompt + user profile is the first ~2,000 tokens. All subsequent calls share this prefix. For bulk operations (e.g., scoring 50 items): 1x full + 49x at 10% = 88% total savings.

Batch APIs

50% discounts for non-time-critical requests. Combine with caching for 95% savings.

Anthropic Message Batches - Up to 10,000 requests, 24hr turnaround.
Anthropic Batches Announcement - Use cases and GA details.
OpenAI Batch API - 50% discount, 50K requests per file.
OpenAI Batch API FAQ - Limits and behavior.
Google Gemini Batch API - 50% discount, combinable with context caching.
Google Vertex Batch Prediction - Enterprise batch.

Model Routing

Route simple tasks to cheaper models. 80% of typical LLM calls don't need the most expensive model.

Frameworks

RouteLLM - Open-source LLM router by LMSYS. Trains routers from preference data; 2x+ cost reduction.
LiteLLM - SDK + proxy for 100+ LLMs with routing, cost tracking. Strategies: least-busy, cost-based, latency-based.
NotDiamond - Per-query best-model selection.
Bifrost - 50x faster than LiteLLM; adaptive load balancer, 1000+ models.
OpenRouter - Unified API for 300+ models with auto-router.
Martian Router - Patent-pending; cuts costs 20-97% via "Model Mapping".

Curated Lists

Awesome AI Model Routing - Comprehensive list of routing approaches.

Research

RouteLLM paper - LMSYS blog on cost-quality tradeoffs.
Cascade Routing - Combined routing + cascading; +14% over individual strategies.
Dynamic Routing Survey (2026) - Comprehensive survey.
IBM LLM Routers - IBM's research on training routers.
LLM Routing Explained - Intuitive guide.

Prompt Compression

Reduce prompt size while preserving information quality.

Tools

LLMLingua - Up to 20x compression. Coarse-to-fine iterative method. Integrates with LangChain/LlamaIndex.
Headroom - Routes JSON/code/text to specialized compressors.
code2prompt - Codebase to LLM prompt with token counting.

Research

LLMLingua paper (EMNLP'23) - Budget controller + token-level iterative compression.
LLMLingua-2 (ACL'24) - BERT encoder via GPT-4 distillation; 3-6x faster.
LongLLMLingua - Long context extension; 21.4% boost with 4x fewer tokens.
Selective Context - Self-information pruning; 50% context reduction.
RECOMP - Extractive + abstractive compressors; 5% token ratio.
500xCompressor - Extreme: contexts down to a single token (6-480x ratios).
LoPace - Lossless; 72.2% savings with 100% reconstruction.
SCOPE - Training-free generative rewriting.
Prompt Compression Survey - Comprehensive survey of all techniques.
CompactPrompt - Unified prompt + data compression pipeline.
Efficient Prompting Survey - Survey of efficient prompting methods.

Guides

LLMLingua Research Blog - Microsoft Research deep dive.
Prompt Compression Tutorial (FreeCodeCamp) - Practical guide with code.
Prompt Compression Overview (MLM) - 6x to 480x compression ratios.
Awesome LLM Compression - Curated paper list.

Lossless Compression Principles

Rule-based lossless distillation achieves 3-4:1 compression without any model:

Strip: prose transitions, hedging, rhetoric, common knowledge
Preserve: numbers, entities, decisions, constraints, risks
Transform: prose to dense bullets; verbose to semicolon-joined
Split: 3,000-5,000 token self-contained sections, loadable independently

Context Window Management

Key Research

Context Rot (Chroma) - LLMs degrade well before context limits; tested 18 models. GitHub toolkit.
Lost in the Middle - Seminal 2023 finding: models struggle with info in the middle.
RAG vs Long Context - RAG wins for dialogue-based queries; long context wins for QA.
RAG vs Long Context (Elastic) - RAG is 1250x cheaper for many queries.
Long Context RAG (Databricks) - Degradation after 32K-64K tokens.
Self-Route Hybrid (2024) - Proposes combining RAG and long context adaptively.
InfiniteICL - 90% context reduction, 103% of full-context performance.
Context Extension Survey - All context extension techniques surveyed.

Provider Docs

Anthropic Long Context Tips - Place docs at top, use XML tags.
Anthropic Context Windows - How context works, server-side compaction.
Anthropic Context Engineering - Finding the smallest high-signal token set.
Anthropic Long-Running Agents - Managing context across extended workflows.

Chunking & Splitting

Pinecone Chunking Guide - Fixed-length, semantic, hierarchical.
Advanced Chunking (Galileo) - Agentic and LLM-based.
Context Engineering Guide - Curated papers and tools.
Efficient Context Management (JetBrains) - Observation masking vs summarization.

KV Cache Optimization

Server-side optimizations for inference efficiency.

Inference Engines

vLLM - PagedAttention, high-throughput inference.
SGLang - RadixAttention for automatic KV cache reuse.
GPUStack - GPU cluster manager for vLLM/SGLang.

Compression Tools

NVIDIA kvpress - KV cache compression made easy.
R-KV - Redundancy-aware compression (NeurIPS 2025).
llm-compressor - Compression for deployment with vLLM.
NVIDIA Model Optimizer - Quantization, pruning, distillation, speculative decoding.
TurboQuant - Google's ICLR 2026; 5x KV cache compression.
aibrix - Cost-efficient infrastructure for GenAI inference.

Research

PagedAttention (vLLM) - Foundational; near-zero waste in KV cache memory.
RadixAttention (SGLang) - Automatic KV cache reuse via radix tree.
KV Cache Optimization Survey (2026) - Comprehensive survey.
KV-Compress - Variable-head-rate compression, PagedAttention compatible.
vAttention - Up to 1.99x decode throughput over vLLM.
Semantic Prompt Caching (VectorQ) - Up to 100x latency reduction.
Speculative Sampling - Fast inference via speculative decoding.
Awesome KV Cache Compression - Must-read paper list.

Educational

mini-sglang - Learn LLM serving internals.
tiny-llm - Build a tiny vLLM on Apple Silicon.

Browser Tool Efficiency

Different browser automation approaches consume vastly different context.

Agent	Output Size	Efficiency	Link
WebFetch	~1.5 KB (AI-summarized)	20x better	Docs
Playwright MCP	~10-33 KB (accessibility tree)	Baseline	GitHub
Agent Browser	~28 KB (accessibility tree)	Similar	GitHub
Lightpanda	~16 KB (raw markdown)	2x better	GitHub

For 10-page workflows: WebFetch = ~15KB vs Playwright = ~330KB total context consumed.

Why Accessibility Trees Are Efficient

The accessibility tree strips visual styling to retain only semantic structure (name, role, state, value). 10-50x smaller than raw HTML. See: Token cost analysis in browser MCPs.

Cost Tracking Tools

Langfuse - Open-source LLM observability + cost tracking. Cost tracking docs.
Helicone - LLM observability, 300+ models, SOC 2. Cost tracking cookbook.
LiteLLM - SDK + proxy with spend tracking and budget routing.
tokencost - USD cost estimates for 400+ LLMs.
AgentOps - Agent monitoring with LLM cost tracking.
Helicone AI Gateway - Fastest open-source AI gateway (Rust).
Anthropic Token Counter - Free pre-flight token counting endpoint.
tiktoken - OpenAI's fast BPE tokenizer (Python/Rust), 3-6x faster.
LangSmith Cost Tracking - Automatic recording with dashboards.
LlamaIndex Cost Analysis - Estimate costs before calls.

Pricing Comparison

Live Pricing Tools

Price Per Token - Daily-updated, 300+ models.
Artificial Analysis Calculator - Free calculator, 100+ models.
Artificial Analysis Leaderboard - Quality + price + speed.
Simon Willison's LLM Prices - Interactive calculator.
Helicone LLM Cost Comparison - 300+ model calculator.
CostGoat - 302+ APIs from 10+ providers.
Langtail - Side-by-side comparison.
WhatLLM - 256 models, 43+ providers, weekly updates.

Provider Pricing Pages

Anthropic | OpenAI | Google Gemini | DeepSeek | Mistral

Prompt Engineering for Efficiency

Official Guides

Anthropic Prompt Engineering - Master guide.
Anthropic Claude 4 Best Practices - Model-specific.
Anthropic Interactive Tutorial - 9-chapter course.
Anthropic Tool Search - 85% token reduction for large tool libraries.
OpenAI Prompt Engineering - Strategies and tactics.
OpenAI Cost Optimization - Input minimization, model selection, caching.
OpenAI Optimization Cookbook - Collection of notebooks.
Token-Efficient Tool Use (Anthropic) - 70% output token reduction.

Community

PromptingGuide: Optimizing - Compression, abstraction, filtering.
Prompt Bloat Impact (MLOps) - Quality degrades with bloat.

Concise Reasoning Research

Concise Chain-of-Thought - 48.7% shorter responses, negligible quality loss.
Chain of Draft - Only 7.6% of CoT tokens while matching accuracy.
Token Complexity - Each task has intrinsic minimum tokens for success.
Verbosity != Veracity - Demystifying verbosity in LLM outputs.
Incorporating Token Usage - Token usage as prompting strategy metric.

Comprehensive Guides

8 Strategies to Cut API Spend 80% (2026)
Redis Token Optimization - Semantic caching, ~73% cost reduction.
How I Reduced Token Costs by 90%
LLM Token Optimization Strategies
Monitor and Cut LLM Costs 90% (Helicone)
LLM Caching Strategies (CostLens) - "90% savings most developers don't know about"
AI Agent Cost Optimization - 60-70% of agent calls suit small models.
Practical Cost + Latency Reduction
Vantage LLM Cost Guide - Enterprise monitoring.
Semantic Highlight for RAG (Zilliz) - 70-80% token reduction.
Optimizing LLM in Production (HuggingFace) - Quantization, Flash Attention.
HuggingFace Inference Optimization - Transformers library.

Academic Papers

Prompt Compression

Paper	Year	Key Result
Prompt Compression Survey	2024	Comprehensive survey of all techniques
LLMLingua	2023	Up to 20x compression (EMNLP)
LLMLingua-2	2024	3-6x faster via BERT distillation (ACL)
LongLLMLingua	2023	4x fewer tokens in long contexts
Selective Context	2023	50% reduction via self-information pruning
RECOMP	2023	5% token ratio for retrieved docs
500xCompressor	2024	6-480x compression ratios
LoPace	2026	Lossless; 72.2% savings
SCOPE	2025	Training-free generative rewriting
Dynamic Compressing	2025	MDP-based adaptive token removal
Empirical Study	2025	Benchmarks 6 methods across 13 datasets

Model Routing & Cascading

Paper	Year	Key Result
FrugalGPT	2023	Seminal cascade paper; up to 98% cost reduction
RouteLLM	2024	2x+ cost reduction without quality loss
Hybrid LLM	2024	40% fewer calls to large model
Unified Routing + Cascading	2024	+14% over individual strategies
Dynamic Routing Survey	2026	Comprehensive survey
Pay for Hints	2026	Small model gets hints, not full answers

Context & Inference

Paper	Year	Key Result
Lost in the Middle	2023	Models struggle with mid-context info
Context Rot	2025	Degradation before context limits
RAG vs Long Context	2025	Complementary strengths by query type
Self-Route Hybrid	2024	Adaptive RAG + long context
InfiniteICL	2025	90% reduction, 103% performance
YaRN Context Extension	2023	10x less tokens for context extension
SkyLadder	2025	22% training time savings
TRIM	2024	19.4% token savings on GPT-4o

KV Cache & Inference

Paper	Year	Key Result
PagedAttention (vLLM)	2023	Near-zero KV cache waste
RadixAttention (SGLang)	2023	Auto KV cache reuse
KV Cache Survey (2026)	2026	Comprehensive techniques survey
VectorQ Semantic Caching	2025	Up to 100x latency reduction
KV-Compress	2024	Variable-head-rate compression
vAttention	2024	1.99x throughput over vLLM
LazyLLM	2024	Dynamic token pruning at prefill
SlimInfer	2025	1.88x latency reduction
Mirror Speculative Decoding	2025	Breaks serial barrier
LongSpec	2025	Constant memory speculative decoding

Prompt Optimization

Paper	Year	Key Result
APE (Automatic Prompt Engineer)	2022	LLMs generate optimal prompts
Concise Chain-of-Thought	2024	48.7% shorter, negligible quality loss
Chain of Draft	2025	Only 7.6% of CoT tokens used
Semantic Compression	2023	Semantic compression with LLMs

Community Resources

Related Projects

LLM Safe Haven - Security toolkit for AI coding agents. npx llm-safe-haven hardens Claude Code, Cursor, Windsurf in 60 seconds. Companion project — agent retries from security failures waste tokens.

Blogs

Simon Willison: LLM Pricing - Ongoing coverage of cost collapse.
Simon Willison: LLMs in 2024 - MoE efficiency, cost trends.
Eugene Yan: LLM Patterns - Caching (50%+ savings), fine-tuning, RAG, guardrails.
Chip Huyen: AI OSS Analysis - 900 most popular AI tools analyzed.
goodailist.com - Daily-updated tracker of 15K+ AI repos.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
research		research
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM Token Optimization

Contents

Quick Wins

Prompt Caching

Provider Docs

Strategy: Cached Prefix Pattern

Batch APIs

Model Routing

Frameworks

Curated Lists

Research

Prompt Compression

Tools

Research

Guides

Lossless Compression Principles

Context Window Management

Key Research

Provider Docs

Chunking & Splitting

KV Cache Optimization

Inference Engines

Compression Tools

Research

Educational

Browser Tool Efficiency

Why Accessibility Trees Are Efficient

Further Reading

Cost Tracking Tools

Pricing Comparison

Live Pricing Tools

Provider Pricing Pages

Prompt Engineering for Efficiency

Official Guides

Community

Concise Reasoning Research

Comprehensive Guides

Academic Papers

Prompt Compression

Model Routing & Cascading

Context & Inference

KV Cache & Inference

Prompt Optimization

Community Resources

Related Projects

Blogs

Discussions

Podcasts

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages