A curated list of strategies, tools, papers, and resources for reducing LLM token costs and improving efficiency in production.
Building with LLMs is expensive. An agent processing 10 reasoning steps can consume 50K-100K tokens per task. This list collects everything you need to cut costs by 80-99% without sacrificing quality.
- Quick Wins
- Prompt Caching
- Batch APIs
- Model Routing
- Prompt Compression
- Context Window Management
- KV Cache Optimization
- Browser Tool Efficiency
- Cost Tracking Tools
- Pricing Comparison
- Prompt Engineering for Efficiency
- Comprehensive Guides
- Academic Papers
- Community Resources
The highest-impact strategies ranked by effort-to-savings ratio:
| Strategy | Savings | Effort | Link |
|---|---|---|---|
| Prompt caching | 90% input tokens | Add cache headers | Anthropic |
| Token-efficient tool use | 70% output reduction | Flip a flag | Anthropic |
| Batch API | 50% | Queue non-urgent work | Anthropic |
| Model routing | 60-95% | Route by task complexity | RouteLLM |
| Response caching | 100% on repeats | Add a cache layer | Redis guide |
| Prompt compression | 5-20x | Use LLMLingua | GitHub |
Combined pipeline: Cache prefix (90%) + route to cheapest model (60-95%) + batch non-urgent (50%) + compress prompts (5-20x) + cache responses (100% on repeats) = 95-99% cost reduction vs. naive approach.
Reuse previously-processed prompt prefixes to avoid re-computing the same tokens.
- Anthropic Prompt Caching - 90% discount, 5min/1hr TTL, min 1,024 tokens.
- Anthropic Caching Announcement - Blog post explaining economics.
- Anthropic Token-Saving Updates - Cache-aware rate limits, simplified caching.
- Anthropic Extended Thinking + Caching - Thinking blocks get cached in tool-use loops.
- OpenAI Prompt Caching - 50% discount, automatic for 1024+ token prompts.
- OpenAI Prompt Caching Cookbook - Advanced techniques with code.
- Google Gemini Context Caching - Implicit (auto) and explicit caching, 90% discount.
- Google Vertex AI Caching - Enterprise context caching.
- DeepSeek KV Cache - Disk-based, 64-token granularity, 90% savings.
Structure prompts so the system prompt + user profile is the first ~2,000 tokens. All subsequent calls share this prefix. For bulk operations (e.g., scoring 50 items): 1x full + 49x at 10% = 88% total savings.
50% discounts for non-time-critical requests. Combine with caching for 95% savings.
- Anthropic Message Batches - Up to 10,000 requests, 24hr turnaround.
- Anthropic Batches Announcement - Use cases and GA details.
- OpenAI Batch API - 50% discount, 50K requests per file.
- OpenAI Batch API FAQ - Limits and behavior.
- Google Gemini Batch API - 50% discount, combinable with context caching.
- Google Vertex Batch Prediction - Enterprise batch.
Route simple tasks to cheaper models. 80% of typical LLM calls don't need the most expensive model.
- RouteLLM - Open-source LLM router by LMSYS. Trains routers from preference data; 2x+ cost reduction.
- LiteLLM - SDK + proxy for 100+ LLMs with routing, cost tracking. Strategies: least-busy, cost-based, latency-based.
- NotDiamond - Per-query best-model selection.
- Bifrost - 50x faster than LiteLLM; adaptive load balancer, 1000+ models.
- OpenRouter - Unified API for 300+ models with auto-router.
- Martian Router - Patent-pending; cuts costs 20-97% via "Model Mapping".
- Awesome AI Model Routing - Comprehensive list of routing approaches.
- RouteLLM paper - LMSYS blog on cost-quality tradeoffs.
- Cascade Routing - Combined routing + cascading; +14% over individual strategies.
- Dynamic Routing Survey (2026) - Comprehensive survey.
- IBM LLM Routers - IBM's research on training routers.
- LLM Routing Explained - Intuitive guide.
Reduce prompt size while preserving information quality.
- LLMLingua - Up to 20x compression. Coarse-to-fine iterative method. Integrates with LangChain/LlamaIndex.
- Headroom - Routes JSON/code/text to specialized compressors.
- code2prompt - Codebase to LLM prompt with token counting.
- LLMLingua paper (EMNLP'23) - Budget controller + token-level iterative compression.
- LLMLingua-2 (ACL'24) - BERT encoder via GPT-4 distillation; 3-6x faster.
- LongLLMLingua - Long context extension; 21.4% boost with 4x fewer tokens.
- Selective Context - Self-information pruning; 50% context reduction.
- RECOMP - Extractive + abstractive compressors; 5% token ratio.
- 500xCompressor - Extreme: contexts down to a single token (6-480x ratios).
- LoPace - Lossless; 72.2% savings with 100% reconstruction.
- SCOPE - Training-free generative rewriting.
- Prompt Compression Survey - Comprehensive survey of all techniques.
- CompactPrompt - Unified prompt + data compression pipeline.
- Efficient Prompting Survey - Survey of efficient prompting methods.
- LLMLingua Research Blog - Microsoft Research deep dive.
- Prompt Compression Tutorial (FreeCodeCamp) - Practical guide with code.
- Prompt Compression Overview (MLM) - 6x to 480x compression ratios.
- Awesome LLM Compression - Curated paper list.
Rule-based lossless distillation achieves 3-4:1 compression without any model:
- Strip: prose transitions, hedging, rhetoric, common knowledge
- Preserve: numbers, entities, decisions, constraints, risks
- Transform: prose to dense bullets; verbose to semicolon-joined
- Split: 3,000-5,000 token self-contained sections, loadable independently
- Context Rot (Chroma) - LLMs degrade well before context limits; tested 18 models. GitHub toolkit.
- Lost in the Middle - Seminal 2023 finding: models struggle with info in the middle.
- RAG vs Long Context - RAG wins for dialogue-based queries; long context wins for QA.
- RAG vs Long Context (Elastic) - RAG is 1250x cheaper for many queries.
- Long Context RAG (Databricks) - Degradation after 32K-64K tokens.
- Self-Route Hybrid (2024) - Proposes combining RAG and long context adaptively.
- InfiniteICL - 90% context reduction, 103% of full-context performance.
- Context Extension Survey - All context extension techniques surveyed.
- Anthropic Long Context Tips - Place docs at top, use XML tags.
- Anthropic Context Windows - How context works, server-side compaction.
- Anthropic Context Engineering - Finding the smallest high-signal token set.
- Anthropic Long-Running Agents - Managing context across extended workflows.
- Pinecone Chunking Guide - Fixed-length, semantic, hierarchical.
- Advanced Chunking (Galileo) - Agentic and LLM-based.
- Context Engineering Guide - Curated papers and tools.
- Efficient Context Management (JetBrains) - Observation masking vs summarization.
Server-side optimizations for inference efficiency.
- vLLM - PagedAttention, high-throughput inference.
- SGLang - RadixAttention for automatic KV cache reuse.
- GPUStack - GPU cluster manager for vLLM/SGLang.
- NVIDIA kvpress - KV cache compression made easy.
- R-KV - Redundancy-aware compression (NeurIPS 2025).
- llm-compressor - Compression for deployment with vLLM.
- NVIDIA Model Optimizer - Quantization, pruning, distillation, speculative decoding.
- TurboQuant - Google's ICLR 2026; 5x KV cache compression.
- aibrix - Cost-efficient infrastructure for GenAI inference.
- PagedAttention (vLLM) - Foundational; near-zero waste in KV cache memory.
- RadixAttention (SGLang) - Automatic KV cache reuse via radix tree.
- KV Cache Optimization Survey (2026) - Comprehensive survey.
- KV-Compress - Variable-head-rate compression, PagedAttention compatible.
- vAttention - Up to 1.99x decode throughput over vLLM.
- Semantic Prompt Caching (VectorQ) - Up to 100x latency reduction.
- Speculative Sampling - Fast inference via speculative decoding.
- Awesome KV Cache Compression - Must-read paper list.
- mini-sglang - Learn LLM serving internals.
- tiny-llm - Build a tiny vLLM on Apple Silicon.
Different browser automation approaches consume vastly different context.
| Agent | Output Size | Efficiency | Link |
|---|---|---|---|
| WebFetch | ~1.5 KB (AI-summarized) | 20x better | Docs |
| Playwright MCP | ~10-33 KB (accessibility tree) | Baseline | GitHub |
| Agent Browser | ~28 KB (accessibility tree) | Similar | GitHub |
| Lightpanda | ~16 KB (raw markdown) | 2x better | GitHub |
For 10-page workflows: WebFetch = ~15KB vs Playwright = ~330KB total context consumed.
The accessibility tree strips visual styling to retain only semantic structure (name, role, state, value). 10-50x smaller than raw HTML. See: Token cost analysis in browser MCPs.
- WebFetch vs WebSearch analysis - Deep comparison.
- browser-use - Foundation library for AI browser agents.
- Chrome full accessibility tree - DevTools feature.
- Langfuse - Open-source LLM observability + cost tracking. Cost tracking docs.
- Helicone - LLM observability, 300+ models, SOC 2. Cost tracking cookbook.
- LiteLLM - SDK + proxy with spend tracking and budget routing.
- tokencost - USD cost estimates for 400+ LLMs.
- AgentOps - Agent monitoring with LLM cost tracking.
- Helicone AI Gateway - Fastest open-source AI gateway (Rust).
- Anthropic Token Counter - Free pre-flight token counting endpoint.
- tiktoken - OpenAI's fast BPE tokenizer (Python/Rust), 3-6x faster.
- LangSmith Cost Tracking - Automatic recording with dashboards.
- LlamaIndex Cost Analysis - Estimate costs before calls.
- Price Per Token - Daily-updated, 300+ models.
- Artificial Analysis Calculator - Free calculator, 100+ models.
- Artificial Analysis Leaderboard - Quality + price + speed.
- Simon Willison's LLM Prices - Interactive calculator.
- Helicone LLM Cost Comparison - 300+ model calculator.
- CostGoat - 302+ APIs from 10+ providers.
- Langtail - Side-by-side comparison.
- WhatLLM - 256 models, 43+ providers, weekly updates.
- Anthropic | OpenAI | Google Gemini | DeepSeek | Mistral
- Anthropic Prompt Engineering - Master guide.
- Anthropic Claude 4 Best Practices - Model-specific.
- Anthropic Interactive Tutorial - 9-chapter course.
- Anthropic Tool Search - 85% token reduction for large tool libraries.
- OpenAI Prompt Engineering - Strategies and tactics.
- OpenAI Cost Optimization - Input minimization, model selection, caching.
- OpenAI Optimization Cookbook - Collection of notebooks.
- Token-Efficient Tool Use (Anthropic) - 70% output token reduction.
- PromptingGuide: Optimizing - Compression, abstraction, filtering.
- Prompt Bloat Impact (MLOps) - Quality degrades with bloat.
- Concise Chain-of-Thought - 48.7% shorter responses, negligible quality loss.
- Chain of Draft - Only 7.6% of CoT tokens while matching accuracy.
- Token Complexity - Each task has intrinsic minimum tokens for success.
- Verbosity != Veracity - Demystifying verbosity in LLM outputs.
- Incorporating Token Usage - Token usage as prompting strategy metric.
- 8 Strategies to Cut API Spend 80% (2026)
- Redis Token Optimization - Semantic caching, ~73% cost reduction.
- How I Reduced Token Costs by 90%
- LLM Token Optimization Strategies
- Monitor and Cut LLM Costs 90% (Helicone)
- LLM Caching Strategies (CostLens) - "90% savings most developers don't know about"
- AI Agent Cost Optimization - 60-70% of agent calls suit small models.
- Practical Cost + Latency Reduction
- Vantage LLM Cost Guide - Enterprise monitoring.
- Semantic Highlight for RAG (Zilliz) - 70-80% token reduction.
- Optimizing LLM in Production (HuggingFace) - Quantization, Flash Attention.
- HuggingFace Inference Optimization - Transformers library.
| Paper | Year | Key Result |
|---|---|---|
| Prompt Compression Survey | 2024 | Comprehensive survey of all techniques |
| LLMLingua | 2023 | Up to 20x compression (EMNLP) |
| LLMLingua-2 | 2024 | 3-6x faster via BERT distillation (ACL) |
| LongLLMLingua | 2023 | 4x fewer tokens in long contexts |
| Selective Context | 2023 | 50% reduction via self-information pruning |
| RECOMP | 2023 | 5% token ratio for retrieved docs |
| 500xCompressor | 2024 | 6-480x compression ratios |
| LoPace | 2026 | Lossless; 72.2% savings |
| SCOPE | 2025 | Training-free generative rewriting |
| Dynamic Compressing | 2025 | MDP-based adaptive token removal |
| Empirical Study | 2025 | Benchmarks 6 methods across 13 datasets |
| Paper | Year | Key Result |
|---|---|---|
| FrugalGPT | 2023 | Seminal cascade paper; up to 98% cost reduction |
| RouteLLM | 2024 | 2x+ cost reduction without quality loss |
| Hybrid LLM | 2024 | 40% fewer calls to large model |
| Unified Routing + Cascading | 2024 | +14% over individual strategies |
| Dynamic Routing Survey | 2026 | Comprehensive survey |
| Pay for Hints | 2026 | Small model gets hints, not full answers |
| Paper | Year | Key Result |
|---|---|---|
| Lost in the Middle | 2023 | Models struggle with mid-context info |
| Context Rot | 2025 | Degradation before context limits |
| RAG vs Long Context | 2025 | Complementary strengths by query type |
| Self-Route Hybrid | 2024 | Adaptive RAG + long context |
| InfiniteICL | 2025 | 90% reduction, 103% performance |
| YaRN Context Extension | 2023 | 10x less tokens for context extension |
| SkyLadder | 2025 | 22% training time savings |
| TRIM | 2024 | 19.4% token savings on GPT-4o |
| Paper | Year | Key Result |
|---|---|---|
| PagedAttention (vLLM) | 2023 | Near-zero KV cache waste |
| RadixAttention (SGLang) | 2023 | Auto KV cache reuse |
| KV Cache Survey (2026) | 2026 | Comprehensive techniques survey |
| VectorQ Semantic Caching | 2025 | Up to 100x latency reduction |
| KV-Compress | 2024 | Variable-head-rate compression |
| vAttention | 2024 | 1.99x throughput over vLLM |
| LazyLLM | 2024 | Dynamic token pruning at prefill |
| SlimInfer | 2025 | 1.88x latency reduction |
| Mirror Speculative Decoding | 2025 | Breaks serial barrier |
| LongSpec | 2025 | Constant memory speculative decoding |
| Paper | Year | Key Result |
|---|---|---|
| APE (Automatic Prompt Engineer) | 2022 | LLMs generate optimal prompts |
| Concise Chain-of-Thought | 2024 | 48.7% shorter, negligible quality loss |
| Chain of Draft | 2025 | Only 7.6% of CoT tokens used |
| Semantic Compression | 2023 | Semantic compression with LLMs |
- LLM Safe Haven - Security toolkit for AI coding agents.
npx llm-safe-havenhardens Claude Code, Cursor, Windsurf in 60 seconds. Companion project — agent retries from security failures waste tokens.
- Simon Willison: LLM Pricing - Ongoing coverage of cost collapse.
- Simon Willison: LLMs in 2024 - MoE efficiency, cost trends.
- Eugene Yan: LLM Patterns - Caching (50%+ savings), fine-tuning, RAG, guardrails.
- Chip Huyen: AI OSS Analysis - 900 most popular AI tools analyzed.
- goodailist.com - Daily-updated tracker of 15K+ AI repos.
- HN: How Are You Handling LLM API Costs in Production?
- HN: The LLM Agent Cost Curve
- HN: Genosis - LLM Cost Optimization
- Latent Space: Artificial Analysis - The "smiling curve of AI costs".
This work is licensed under Creative Commons Attribution 4.0 International.