A list of awesome papers on compression and acceleration of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs).
Continuously updated. Welcome to star and watch!
- [2026.05] We have open-sourced a comprehensive, continuously-updated taxonomy of 400+ papers covering model compression, inference acceleration, and system co-design for efficient large models.
Model-side Compression
Inference-side Acceleration
Training and Fine-tuning Efficiency
System and Hardware Co-design
Evaluation and Applications
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | ICLR 2023 | Link | Link |
| 2025 | OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting |
ICLR 2025 | Link | Link |
| 2025 | SpinQuant: LLM quantization with learned rotations | ICLR 2025 | Link | Link |
| 2022 | SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | ICML 2023 | Link | Link |
| 2023 | AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | MLSys 2024 | Link | Link |
| 2024 | QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks | ICML 2024 | Link | Link |
| 2025 | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | MLSys 2025 | Link | Link |
| 2024 | QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | NeurIPS 2024 | Link | Link |
| 2024 | Atom: Low-bit Quantization for Efficient and Accurate LLM Serving | MLSys 2024 | Link | Link |
| 2024 | OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models | ICLR 2024 | Link | Link |
| 2023 | QuIP: 2-Bit Quantization of Large Language Models With Guarantees | NeurIPS 2023 | Link | Link |
| 2022 | LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | NeurIPS 2022 | Link | Link |
| 2023 | Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling | EMNLP 2023 | Link | Link |
| 2025 | GPTAQ: Efficient Finetuning-Free Quantization for Asyetric Calibration | ICML 2025 | Link | Link |
| 2024 | MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization | NeurIPS 2024 | Link | Link |
| 2024 | AffineQuant: Affine Transformation Quantization for Large Language Models | ICLR 2024 | Link | Link |
| 2024 | LLM-QAT: Data-Free Quantization Aware Training for Large Language Models | ACL 2024 | Link | Link |
| 2024 | BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation | ACL 2024 | Link | Link |
| 2023 | OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models | AAAI 2024 (Oral) | Link | Link |
| 2024 | SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression | ICLR 2024 | Link | Link |
| 2022 | ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers | NeurIPS 2022 | Link | Link |
| 2024 | LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models | ICLR 2024 | Link | Link |
| 2024 | OneBit: Towards Extremely Low-bit Large Language Models | NeurIPS 2024 | Link | Link |
| 2023 | LLM-FP4: 4-bit Floating-Point Quantized Transformers | EMNLP 2023 | Link | Link |
| 2024 | FlatQuant: Flatness Matters for LLM Quantization | ICML 2025 | Link | Link |
| 2024 | SqueezeLLM: Dense-and-Sparse Quantization | ICML 2024 | Link | Link |
| 2023 | RPTQ: Reorder-based Post-training Quantization for Large Language Models | Link | Link |
|
| 2024 | QQQ: Quality Quattuor-Bit Quantization for Large Language Models | ICLR | Link | Link |
| 2024 | Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs | Link | Link |
|
| 2025 | CBQ: Cross-Block Quantization for Large Language Models | ICLR 2025 | Link | N/A |
| 2025 | MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods | ICLR 2025 | Link | N/A |
| 2025 | SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators | ICLR 2025 | Link | N/A |
| 2025 | Progressive Mixed-Precision Decoding for Efficient LLM Inference | ICLR 2025 | Link | N/A |
| 2025 | Surprising Effectiveness of Pretraining Ternary Language Models at Scale | ICLR 2025 | Link | N/A |
| 2025 | EfficientQAT: Efficient Quantization-Aware Training for Large Language Models | ACL 2025 | Link | Link |
| 2025 | MiniKV: Pushing the Limits of 2-Bit KV Cache via Compression and System Co-Design for Efficient Long Context Inference | ACL 2025 Findings | Link | N/A |
| 2025 | AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations | ACL 2025 | Link | N/A |
| 2025 | Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition | ACL 2025 Findings | Link | N/A |
| 2025 | LittleBit: Ultra Low-Bit Quantization via Latent Factorization | NeurIPS 2025 | Link | Link |
| 2024 | Extreme Compression of Large Language Models via Additive Quantization | ICML 2024 | Link | Link |
| 2024 | BiLLM: Pushing the Limit of Post-Training Quantization for LLMs | ICML 2024 | Link | Link |
| 2024 | LQER: Low-Rank Quantization Error Reconstruction for LLMs | ICML 2024 | Link | N/A |
| 2024 | Evaluating Quantized Large Language Models | ICML 2024 | Link | Link |
| 2024 | QMoE: Sub-1-Bit Compression of Trillion Parameter Models | MLSys 2024 | Link | Link |
| 2024 | DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs | NeurIPS 2024 | Link | Link |
| 2024 | QBB: Quantization with Binary Bases for LLMs | NeurIPS 2024 | Link | N/A |
| 2024 | Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models | NeurIPS 2024 | Link | N/A |
| 2024 | VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models | EMNLP 2024 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | Q-VLM: Post-training Quantization for Large Vision Language Models | NIPS 2024 | Link | Link |
| 2025 | MBQ:Modality-Balanced Quantization for Large Vision-Language Models | CVPR 2025 | Link | Link |
| 2025 | MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization | ACM MM 2025 | Link | Link |
| 2025 | CASP: Compression of Large Multimodal Models Based on Attention Sparsity | CVPR 2025 | Link | Link |
| 2024 | Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation | ACM MM 2024 | Link | N/A |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | A Simple and Effective Pruning Approach for Large Language Models | ICLR 2024 | Link | Link |
| 2024 | BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation | ICLR 2024 | Link | Link |
| 2024 | COPAL: Continual Pruning in Large Language Generative Models | ICML 2024 | Link | N/A |
| 2024 | Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models | ICML 2024 | Link | Link |
| 2025 | BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation | ICML 2025 | Link | N/A |
| 2025 | SAFE: Finding Sparse and Flat Minima to Improve Pruning | ICML 2025 | Link | Link |
| 2025 | SwiftPrune: Hessian-Free Weight Pruning for Large Language Models | EMNLP 2025 Findings | Link | N/A |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ICML 2023 | Link | Link |
| 2023 | Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs | ICLR 2024 | Link | Link |
| 2023 | The LLM Surgeon | ICLR 2024 | Link | Link |
| 2024 | Fast and Optimal Weight Update for Pruned Large Language Models | TMLR 2024 | Link | Link |
| 2024 | Pruning Foundation Models for High Accuracy without Retraining | EMNLP 2024 findings | Link | Link |
| 2024 | SparseLLM: Towards Global Pruning for Pre-trained Language Models | NeurIPS 2024 | Link | Link |
| 2024 | ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models | NeurIPS 2024 | Link | Link |
| 2024 | Shears: Unstructured Sparsity with Neural Low-rank Adapter Search | NAACL 2024 | Link | Link |
| 2025 | Wanda++: Pruning Large Language Models via Regional Gradients | ACL 2025 Findings | Link | Link |
| 2024 | Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization | ICLR 2025 | Link | Link |
| 2025 | Dynamic Low-Rank Sparse Adaptation for Large Language Models | ICLR 2025 | Link | Link |
| 2024 | Wasserstein Distances, Neuronal Entanglement, and Sparsity | ICLR 2025 | Link | Link |
| 2025 | Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision | ICML 2025 | Link | N/A |
| 2025 | An Efficient Pruner for Large Language Model with Theoretical Guarantee | ICML 2025 | Link | N/A |
| 2025 | DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration | NeurIPS 2025 | Link | Link |
| 2025 | Multi-Objective One-Shot Pruning for Large Language Models | NeurIPS 2025 | Link | N/A |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity | ICML 2024 | Link | Link |
| 2024 | ALS: Adaptive Layer Sparsity for Large Language Models via Activation Correlation Assessment | NeurIPS 2024 | Link | Link |
| 2024 | Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models | NeurIPS 2024 | Link | Link |
| 2024 | AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models | NeurIPS 2024 | Link | Link |
| 2024 | EvoPress: Accurate Dynamic Model Compression via Evolutionary Search | ICML 2025 | Link | Link |
| 2025 | Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective | ICML 2025 | Link | Link |
| 2025 | DLP: Dynamic Layerwise Pruning in Large Language Models | ICML 2025 | Link | Link |
| 2025 | Lua-LLM: Learning Unstructured-Sparsity Allocation for Large Language Models | NeurIPS 2025 | Link | N/A |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition | ICLR 2025 | Link | Link |
| 2025 | Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models | ICML 2025 | Link | Link |
| 2025 | 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models | EMNLP 2025 | Link | Link |
| 2025 | 3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | On the Impact of Calibration Data in Post-training Quantization and Pruning | ACL 2024 | Link | Link |
| 2024 | Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning | EMNLP 2024 | Link | Link |
| 2024 | Beware of Calibration Data for Pruning Large Language Models | ICLR 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | Compressing LLMs: The Truth is Rarely Pure and Never Simple | ICLR 2024 | Link | Link |
| 2025 | Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compresssion | ICML 2025 | Link | Link |
| 2025 | Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs | EMNLP 2025 Findings | Link | N/A |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | WRP: Weight Recover Prune for Structured Sparsity | ACL 2024 | Link | Link |
| 2024 | Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models | ICLR 2024 | Link | Link |
| 2024 | Pruning Large Language Models with Semi-Structural Adaptive Sparse Training | AAAI 2025 | Link | Link |
| 2024 | MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models | NeurIPS 2024 | Link | Link |
| 2025 | ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs | ICML 2025 | Link | Link |
| 2025 | PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models | NeurIPS 2025 | Link | Link |
| 2025 | TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | LLM-Pruner: On the Structural Pruning of Large Language Models | NeurIPS 2023 | Link | Link |
| 2023 | Fluctuation-based Adaptive Structured Pruning for Large Language Models | AAAI 2024 | Link | Link |
| 2023 | Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning | ICLR 2024 | Link | Link |
| 2024 | BlockPruner: Fine-grained Pruning for Large Language Models | ACL 2025 Findings | Link | Link |
| 2024 | Structured Optimal Brain Pruning for Large Language Models | EMNLP 2024 | Link | N/A |
| 2024 | Search for Efficient Large Language Models | NeurIPS 2024 | Link | Link |
| 2024 | SlimGPT: Layer-wise Structured Pruning for Large Language Models | NeurIPS 2024 | Link | N/A |
| 2024 | Compact Language Models via Pruning and Knowledge Distillation | NeurIPS 2024 | Link | Link |
| 2024 | DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models | NeurIPS 2024 | Link | Link |
| 2025 | Tyr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization | NeurIPS 2025 | Link | Link |
| 2025 | Olica: Efficient Structured Pruning of Large Language Models without Retraining | ICML 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | Shortened LLaMA: A Simple Depth Pruning for Large Language Models | ICLR 2024 workshop | Link | Link |
| 2024 | LaCo: Large Language Model Pruning via Layer Collapse | EMNLP 2024 Findings | Link | Link |
| 2024 | Shortgpt: Layers in large language models are more redundant than you expect | ACL 2025 Findings | Link | Link |
| 2024 | Streamlining Redundant Layers to Compress Large Language Models | ICLR 2025 | Link | Link |
| 2024 | SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks | ICML 2024 | Link | Link |
| 2024 | Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging | EMNLP 2024 | Link | N/A |
| 2024 | TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs | ACL 2025 | Link | Link |
| 2025 | A Simple Linear Patch Revives Layer-Pruned Large Language Models | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning | ACL 2024 Findings | Link | Link |
| 2024 | SliceGPT: Compress Large Language Models by Deleting Rows and Columns | ICLR 2024 | Link | Link |
| 2024 | APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference | ICML 2024 | Link | Link |
| 2024 | Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations | ACL 2024 Findings | Link | Link |
| 2024 | LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models | ICML 2024 | Link | Link |
| 2024 | Pruning as a Domain-specific LLM Extractor | NAACL 2024 Findings | Link | Link |
| 2024 | Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient | ACL 2025 | Link | Link |
| 2025 | One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models | ACL 2025 Findings | Link | N/A |
| 2024 | RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model | NAACL 2024 Findings | Link | N/A |
| 2024 | Finding Transformer Circuits with Edge Pruning | NeurIPS 2024 | Link | Link |
| 2024 | MoDeGPT: Modular Decomposition for Large Language Model Compression | ICLR 2025 | Link | Link |
| 2024 | The Unreasonable Ineffectiveness of the Deeper Layers | ICLR 2025 | Link | N/A |
| 2024 | PAT: Pruning-Aware Tuning for Large Language Models | AAAI 2025 | Link | Link |
| 2024 | Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy | EMNLP 2024 Findings | Link | Link |
| 2024 | LEMON: Reviving Stronger and Smaller LMs from Larger LMs with Linear Parameter Fusion | ACL 2024 | Link | N/A |
| 2024 | DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization | ACL 2025 | Link | Link |
| 2025 | You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning | ICLR 2025 | Link | Link |
| 2025 | LLaMaFlex: Many-in-one LLMs via Generalized Pruning and Weight Sharing | ICLR 2025 | Link | N/A |
| 2025 | Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing | ICLR 2025 | Link | Link |
| 2025 | Instruction-Following Pruning for Large Language Models | ICML 2025 | Link | N/A |
| 2025 | Let LLM Tell What to Prune and How Much to Prune | ICML 2025 | Link | Link |
| 2025 | Prompt-based Depth Pruning of Large Language Models | ICML 2025 | Link | Link |
| 2025 | IG-Pruning: Input-Guided Block Pruning for Large Language Models | EMNLP 2025 | Link | Link |
| 2025 | PIP: Perturbation-based Iterative Pruning for Large Language Models | EMNLP 2025 Findings | Link | N/A |
| 2025 | ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization | NeurIPS 2025 | Link | Link |
| 2025 | Restoring Pruned Large Language Models via Lost Component Compensation | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML 2024 | Link | Link |
| 2023 | ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models | ICLR 2024 | Link | N/A |
| 2024 | CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models | COLM 2024 | Link | Link |
| 2024 | ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models | EMNLP 2024 | Link | Link |
| 2024 | Training-Free Activation Sparsity in Large Language Models | ICLR 2025 | Link | Link |
| 2024 | Sparsing Law: Towards Large Language Models with Greater Activation Sparsity | ICML 2025 | Link | Link |
| 2025 | La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation | ICML 2025 | Link | N/A |
| 2025 | R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference | ICLR 2025 | Link | Link |
| 2024 | Sirius: Contextual Sparsity with Correction for Efficient LLMs | NeurIPS 2024 | Link | Link |
| 2024 | Learn To be Efficient: Build Structured Sparsity in Large Language Models | NeurIPS 2024 | Link | Link |
| 2025 | Weight-Aware Activation Sparsity with Constrained Bayesian Optimization Scheduling for Large Language Models | EMNLP 2025 | Link | Link |
| 2025 | Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models | EMNLP 2024 Findings | Link | Link |
| 2024 | Effective Interplay between Sparsity and Quantization: From Theory to Practice | ICLR 2025 | Link | Link |
| 2024 | Compressing large language models by joint sparsification and quantization | ICML 2024 | Link | Link |
| 2024 | SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression | ICML 2025 | Link | Link |
| 2025 | Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs | arxiv 2025 | Link | Link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025 | Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression | CVPR 2025 | Link | Link |
| 2025 | LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation | ICLR 2025 | Link | Link |
| 2025 | Pre-training Distillation for Large Language Models: A Design Space Exploration | ACL 2025 | Link | N/A |
| 2025 | TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models | ICLR 2025 | Link | N/A |
| 2025 | Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling | ICLR 2025 | Link | N/A |
| 2025 | Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation | CVPR 2025 | Link | N/A |
| 2025 | Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models | AAAI 2025 | Link | N/A |
| 2025 | Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models | COLING 2025 | Link | N/A |
| 2025 | Lillama: Large Language Models Compression via Low-Rank Feature Distillation | NAACL 2025 | Link | Link |
| 2025 | Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs | TMLR 2025 | Link | Link |
| 2024 | MiniLLM: Knowledge Distillation of Large Language Models | ICLR 2024 | Link | Link |
| 2024 | On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes | ICLR 2024 | Link | N/A |
| 2024 | DistiLLM: Towards Streamlined Distillation for Large Language Models | ICML 2024 | Link | Link |
| 2024 | DDK: Distilling Domain Knowledge for Efficient Large Language Models | NeurIPS 2024 | Link | N/A |
| 2024 | Adversarial Moment-Matching Distillation of Large Language Models | NeurIPS 2024 | Link | N/A |
| 2024 | PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning | EMNLP 2024 Findings | Link | Link |
| 2024 | Dual-Space Knowledge Distillation for Large Language Models | EMNLP 2024 | Link | Link |
| 2024 | ELAD: Explanation-Guided Large Language Models Active Distillation | ACL 2024 Findings | Link | N/A |
| 2024 | Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs Distillation | EMNLP 2024 | Link | N/A |
| 2024 | CLIP-KD: An Empirical Study of CLIP Model Distillation | CVPR 2024 | Link | Link |
| 2024 | Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data | AAAI 2024 | Link | Link |
| 2024 | LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | EACL 2024 | Link | Link |
| 2024 | Aligning Large and Small Language Models via Chain-of-Thought Reasoning | EACL 2024 | Link | Link |
| 2024 | Weight-Inherited Distillation for Task-Agnostic BERT Compression | NAACL 2024 | Link | Link |
| 2024 | Knowledge Fusion of Large Language Models | ICLR 2024 | Link | Link |
| 2024 | OPENCHAT: Advancing Open-source Language Models with Mixed-Quality Data | ICLR 2024 | Link | Link |
| 2024 | Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models | ICML 2024 | Link | Link |
| 2023 | AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression | ACL 2023 | Link | Link |
| 2023 | DiffKD: Diffusion-based Knowledge Distillation for Large Language Models | NeurIPS 2023 | Link | Link |
| 2023 | SCOTT: Self-Consistent Chain-of-Thought Distillation | ACL 2023 | Link | Link |
| 2023 | Distilling Script Knowledge from Large Language Models for Constrained Language Planning | ACL 2023 | Link | Link |
| 2023 | DOT: A Distillation-Oriented Trainer | ICCV 2023 | Link | Link |
| 2023 | Specializing Smaller Language Models towards Multi-Step Reasoning | ICML 2023 | Link | Link |
| 2023 | DISCO: Distilling Counterfactuals with Large Language Models | ACL 2023 | Link | Link |
| 2023 | Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind | NeurIPS 2023 | Link | Link |
| 2023 | PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation | EMNLP 2023 | Link | Link |
| 2023 | Democratizing Reasoning Ability: Tailored Learning from Large Language Model | EMNLP 2023 | Link | Link |
| 2023 | GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model | ACL 2023 | Link | Link |
| 2023 | Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes | ACL 2023 | Link | Link |
| 2023 | Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models | EMNLP 2023 | Link | Link |
| 2023 | f-Divergence Minimization for Sequence-Level Knowledge Distillation | ACL 2023 | Link | Link |
| 2023 | Symbolic Chain-of-Thought Distillation: Small Models Can Also Think Step-by-Step | ACL 2023 | Link | N/A |
| 2023 | Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks | NeurIPS 2023 | Link | Link |
| 2023 | Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation | EMNLP 2023 | Link | Link |
| 2023 | Lion: Adversarial Distillation of Closed-Source Large Language Model | EMNLP 2023 | Link | Link |
| 2023 | InheritSumm: A General, Versatile and Compact Summarizer by Distilling from GPT | EMNLP 2023 | Link | N/A |
| 2023 | Aligning Large Language Models through Synthetic Feedback | EMNLP 2023 | Link | Link |
| 2023 | MCC-KD: Multi-CoT Consistent Knowledge Distillation | EMNLP 2023 Findings | Link | N/A |
| 2023 | Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization | ICLR 2023 | Link | Link |
| 2023 | Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data | EMNLP 2023 | Link | Link |
| 2022 | TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ECCV 2022 | Link | Link |
| 2022 | DIST: Distilling Large Language Models with Small-Scale Data | NeurIPS 2022 | Link | Link |
| 2022 | Decoupled Knowledge Distillation | CVPR 2022 | Link | Link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025 | Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models | ICML 2025 | Link | Link |
| 2025 | Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives | ICLR 2025 | Link | Link |
| 2025 | SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs | ICLR 2025 | Link | Link |
| 2025 | Dynamic Low-Rank Sparse Adaptation for Large Language Models | ICLR 2025 | Link | Link |
| 2025 | MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition | ICML 2025 | Link | N/A |
| 2025 | Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition | ACL 2025 Findings | Link | N/A |
| 2025 | Delta Decompression for MoE-based LLMs Compression | PMLR 2025 | Link | N/A |
| 2025 | LittleBit: Ultra Low-Bit Quantization via Latent Factorization | NeurIPS 2025 | Link | N/A |
| 2024 | Compressing Large Language Models using Low Rank and Low Precision Decomposition | NeurIPS 2024 | Link | Link |
| 2024 | Unified Low-rank Compression Framework for Click-through Rate Prediction | KDD 2024 | Link | Link |
| 2024 | SliceGPT: Compress Large Language Models by Deleting Rows and Columns | ICLR 2024 | Link | Link |
| 2024 | Low-Rank Knowledge Decomposition for Medical Foundation Models | CVPR 2024 | Link | Link |
| 2024 | LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking | CVPR 2024 | Link | Link |
| 2024 | Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization | ISCA 2024 | Link | N/A |
| 2024 | LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning | ICLR 2024 | Link | N/A |
| 2024 | LQER: Low-Rank Quantization Error Reconstruction for LLMs | ICML 2024 | Link | N/A |
| 2024 | Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations | ACL 2024 Findings | Link | N/A |
| 2024 | Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization | ACL 2024 Findings | Link | N/A |
| 2024 | Surgical Feature-Space Decomposition of LLMs: Why, When and How? | ACL 2024 | Link | N/A |
| 2024 | DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models | EMNLP 2024 | Link | N/A |
| 2023 | LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation | ICML 2023 | Link | Link |
| 2022 | Compressible-composable NeRF via Rank-residual Decomposition | NeurIPS 2022 | Link | Link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ICLR 2023 | Link | Link |
| 2023 | Hyena Hierarchy: Towards Larger Convolutional Language Models | ICML 2023 | Link | Link |
| 2023 | RWKV: Reinventing RNNs for the Transformer Era | EMNLP 2023 Findings | Link | Link |
| 2023 | Mamba: Linear-Time Sequence Modeling with Selective State Spaces | arXiv 2023 | Link | Link |
| 2023 | RetNet: Retentive Network: A Successor to Transformer for Large Language Models | arXiv 2023 | Link | Link |
| 2023 | Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture | NeurIPS 2023 | Link | Link |
| 2024 | Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2) | ICML 2024 | Link | Link |
| 2024 | Gated Linear Attention Transformers with Hardware-Efficient Training | ICML 2024 | Link | Link |
| 2024 | Based: Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff | ICML 2024 | Link | Link |
| 2024 | Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models | arXiv 2024 | Link | N/A |
| 2024 | Jamba: A Hybrid Transformer-Mamba Language Model | arXiv 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | EMNLP 2023 | Link | N/A |
| 2023 | HyperAttention: Long-context Attention in Near-Linear Time | NeurIPS 2023 | Link | N/A |
| 2024 | FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | ICLR 2024 | Link | Link |
| 2024 | Ring Attention with Blockwise Transformers for Near-Infinite Context | ICLR 2024 | Link | Link |
| 2024 | FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision | NeurIPS 2024 | Link | Link |
| 2024 | ThunderKittens: Simple, Fast, and Adorable AI Kernels | NeurIPS 2024 | Link | Link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024 | DeepSeek-MoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models | ACL 2024 | Link | Link |
| 2024 | Mixtral of Experts | arXiv 2024 | Link | Link |
| 2025 | Ada-K Routing: Boosting the Efficiency of MoE-based LLMs | ICLR 2025 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024 | You Only Cache Once: Decoder-Decoder Architectures for Language Models | arXiv 2024 | Link | Link |
| 2024 | Scalable MatMul-free Language Modeling | arXiv 2024 | Link | Link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025 | HASS: Learning Harmonized Representations for Speculative Sampling | ICLR 2025 | Link | Link |
| 2025 | PEARL: Parallel Speculative Decoding with Adaptive Draft Length | ICLR 2025 | Link | Link |
| 2025 | SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration | ICLR 2025 | Link | Link |
| 2025 | Pre-Training Curriculum for Multi-Token Prediction in Language Models | ACL 2025 | Link | Link |
| 2025 | Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree | ACL 2025 | Link | Link |
| 2025 | SAM Decoding: Speculative Decoding via Suffix Automaton | ACL 2025 | Link | Link |
| 2025 | DReSD: Dense Retrieval for Speculative Decoding | ACL 2025 Findings | Link | N/A |
| 2025 | EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models | NAACL 2025 | Link | N/A |
| 2025 | Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding | NAACL 2025 Findings | Link | N/A |
| 2025 | SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths | COLM 2025 | Link | Link |
| 2024 | Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting | NeurIPS 2024 | Link | Link |
| 2024 | EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees | EMNLP 2024 | Link | Link |
| 2024 | Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads | ICML 2024 | Link | Link |
| 2024 | EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty | ICML 2024 | Link | Link |
| 2024 | Online Speculative Decoding | ICML 2024 | Link | N/A |
| 2024 | SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices | NeurIPS 2024 | Link | Link |
| 2024 | Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding | NeurIPS 2024 | Link | Link |
| 2024 | Cascade Speculative Drafting for Even Faster LLM Inference | NeurIPS 2024 | Link | Link |
| 2024 | Accelerating Blockwise Parallel Language Models with Draft Refinement | NeurIPS 2024 | Link | N/A |
| 2024 | Graph-Structured Speculative Decoding | ACL 2024 Findings | Link | N/A |
| 2024 | Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism | ACL 2024 Findings | Link | N/A |
| 2024 | SLiM: Speculative Decoding with Hypothesis Reduction | NAACL 2024 Findings | Link | N/A |
| 2024 | SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification | ASPLOS 2024 | Link | N/A |
| 2024 | REST: Retrieval-Based Speculative Decoding | NAACL 2024 | Link | Link |
| 2024 | Lookahead Decoding: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding | ICML 2024 | Link | Link |
| 2024 | LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding | ACL 2024 | Link | Link |
| 2024 | Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding | EMNLP 2024 | Link | Link |
| 2024 | CLLMs: Consistency Large Language Models | ICLR 2024 | Link | Link |
| 2023 | Fast Inference from Transformers via Speculative Decoding | ICML 2023 | Link | N/A |
| 2023 | SpecTr: Fast Speculative Decoding via Optimal Transport | NeurIPS 2023 | Link | N/A |
| 2023 | Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation | EMNLP 2023 Findings | Link | Link |
| 2023 | Speculative Decoding with Big Little Decoder | NeurIPS 2023 | Link | Link |
| Year | Title | Venue | Paper | Code | Category |
|---|---|---|---|---|---|
| 2025 | InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation | Link | Link |
Token Eviction | |
| 2025 | R-KV: Redundancy-aware KV Cache Compression for Reasoning Models | Link | Link |
Token Eviction | |
| 2025 | SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator | ICML 2025 | Link | Link |
Token Eviction |
| 2025 | RazorAttention: Efficient KV Cache Compression Through Retrieval Heads | ICLR 2025 | Link | N/A | Token Eviction |
| 2025 | Squeezed Attention: Accelerating Long Context Length LLM Inference | ACL 2025 | Link | Link |
Token Eviction |
| 2025 | LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models | ICML 2025 | Link | Link |
Budget Allocation |
| 2025 | CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences | ICLR 2025 | Link | Link |
Budget Allocation |
| 2025 | AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning | ICCV 2025 | Link | Link |
Cache Merging |
| 2025 | MiniKV: Pushing the Limits of 2-Bit KV Cache via Compression and System Co-Design for Efficient Long Context Inference | ACL 2025 Findings | Link | N/A | Quantization |
| 2025 | Palu: KV-Cache Compression with Low-Rank Projection | ICLR 2025 | Link | Link |
Low Rank Projection |
| 2025 | LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy | ICLR 2025 | Link | N/A | Low Rank Projection |
| 2025 | Preserving Large Activations: The Key to KV Cache Pruning | ICLR 2025 | Link | N/A | Token Eviction |
| 2025 | VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration | ICLR 2025 | Link | N/A | Budget Allocation |
| 2025 | Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning | ICLR 2025 | Link | Link |
Token Eviction |
| 2025 | TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization | ACL 2025 Findings | Link | N/A | System/Offloading |
| 2025 | KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation | ACL 2025 Findings | Link | Link |
System/Offloading |
| 2025 | KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding | ACL 2025 | Link | Link |
Low Rank Projection |
| 2025 | DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs | EMNLP 2025 Findings | Link | N/A | Budget Allocation |
| 2024 | SnapKV: LLM Knows What You are Looking for Before Generation | NeurIPS 2024 | Link | Link |
Token Eviction |
| 2024 | InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory | NeurIPS 2024 | Link | Link |
Token Eviction |
| 2024 | Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs | ICLR 2024 | Link | Link |
Token Eviction |
| 2024 | Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference | MLSys 2024 | Link | Link |
Token Eviction |
| 2024 | Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference | ICML 2024 | Link | Link |
Token Eviction |
| 2024 | On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference | Link | Link |
Token Eviction | |
| 2024 | PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling | Link | Link |
Budget Allocation | |
| 2024 | MiniCache: KV Cache Compression in Depth Dimension for Large Language Models | NeurIPS 2024 | Link | Link |
Cache Merging |
| 2024 | CaM: Cache Merging for Memory-efficient LLMs Inference | ICML 2024 | Link | Link |
Cache Merging |
| 2024 | Compressed Context Memory For Online Language Model Interaction | ICLR 2024 | Link | Link |
Cache Merging |
| 2024 | Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference | ICML 2024 | Link | Link |
Cache Merging |
| 2024 | LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference | EMNLP 2024 Findings | Link | Link |
Cache Merging |
| 2024 | CHAI: Clustered Head Attention for Efficient LLM Inference | ICML 2024 | Link | Link |
Cache Merging |
| 2024 | D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models | ICLR 2025 | Link | Link |
Cache Merging |
| 2024 | IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact | ACL 2024 | Link | Link |
Quantization |
| 2024 | KIVI: A Tuning-Free Asyetric 2bit Quantization for KV Cache | ICML 2024 | Link | Link |
Quantization |
| 2024 | KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization | NeurIPS 2024 | Link | Link |
Quantization |
| 2024 | SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models | COLM 2024 | Link | Link |
Quantization |
| 2024 | GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM | NeurIPS 2024 | Link | Link |
Quantization |
| 2024 | NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time | ACL 2024 | Link | Link |
Token Eviction |
| 2024 | SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation | ACL 2025 | Link | Link |
Token Eviction |
| 2024 | AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asyetric Quantization Configurations | ACL 2025 | Link | N/A | Quantization |
| 2024 | Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference | ICML 2024 | Link | Link |
Cache Merging |
| 2024 | Layer-Condensed KV Cache for Efficient Inference of Large Language Models | ACL 2024 | Link | Link |
Cache Sharing |
| 2024 | FINCH: Prompt-guided Key-Value Cache Compression for Large Language Models | TACL 2024 / EMNLP 2024 | Link | N/A | Token Eviction |
| 2024 | KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches | EMNLP 2024 Findings | Link | Link |
Benchmark |
| 2024 | Eigen Attention: Attention in Low-Rank Space for KV Cache Compression | EMNLP 2024 Findings | Link | Link |
Low Rank Projection |
| 2024 | A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression | EMNLP 2024 | Link | Link |
Token Eviction |
| 2023 | H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | NeurIPS 2023 | Link | Link |
Token Eviction |
| 2023 | Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time | NeurIPS 2023 | Link | Link |
Token Eviction |
| 2023 | Efficient Streaming Language Models with Attention Sinks | ICLR 2024 | Link | Link |
Token Eviction |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models | EMNLP 2023 | Link | Link |
| 2023 | Selective Context: Compressing Context to Enhance Inference Efficiency of Large Language Models | EMNLP 2023 | Link | Link |
| 2023 | Learning to Compress Prompts with Gist Tokens | NeurIPS 2023 | Link | Link |
| 2023 | Adapting Language Models to Compress Contexts (AutoCompressors) | EMNLP 2023 | Link | Link |
| 2023 | Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers | NeurIPS 2023 | Link | N/A |
| 2024 | In-context Autoencoder for Context Compression in a Large Language Model (ICAE) | ICLR 2024 | Link | Link |
| 2024 | RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation | ICLR 2024 | Link | N/A |
| 2024 | LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression | ACL 2024 | Link | Link |
| 2024 | LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression | ACL 2024 Findings | Link | Link |
| 2024 | Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon | ICML 2024 | Link | Link |
| 2024 | xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token | NeurIPS 2024 | Link | N/A |
| 2025 | 500xCompressor: Generalized Prompt Compression for Large Language Models | AAAI 2025 | Link | N/A |
| 2024 | Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage | EACL 2024 Findings | Link | N/A |
| 2024 | Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles | EMNLP 2024 Findings | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2022 | CALM: Confident Adaptive Language Modeling | NeurIPS 2022 | Link | N/A |
| 2023 | FREE: Fast and Robust Early Exiting Framework for Autoregressive Language Models | EMNLP 2023 | Link | N/A |
| 2023 | SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference | arXiv 2023 | Link | N/A |
| 2023 | Speculative Decoding with Big Little Decoder (BiLD) | NeurIPS 2023 | Link | Link |
| 2024 | ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference | AAAI 2024 | Link | N/A |
| 2024 | EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism | ICML 2024 | Link | Link |
| 2024 | LayerSkip: Enabling Early-Exit Inference and Self-Speculative Decoding | ACL 2024 | Link | Link |
| 2024 | Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting | NeurIPS 2024 | Link | Link |
| 2024 | Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding | ACL 2024 | Link | N/A |
| 2024 | Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning | ICLR 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | AdaTape: Foundation Model with Adaptive Computation via Elastic Input Sequence | ICML 2023 | Link | N/A |
| 2023 | CoLT5: Faster Long-Range Transformers with Conditional Computation | EMNLP 2023 | Link | N/A |
| 2024 | Mixture of Depths: Dynamically Allocating Compute in Transformer-Based Language Models | ICML 2024 | Link | N/A |
| 2024 | Think before you speak: Training Language Models With Pause Tokens | ICLR 2024 | Link | N/A |
| 2024 | MatFormer: Nested Transformer for Elastic Inference | ICLR 2024 | Link | Link |
| 2024 | FLEXTRON: Many-in-One Flexible Large Language Model | ICML 2024 | Link | N/A |
| 2024 | LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference | arXiv 2024 | Link | N/A |
| 2024 | PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | MLSys 2024 | Link | Link |
| 2024 | D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models | NeurIPS 2024 | Link | Link |
| 2024 | Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models | NeurIPS 2024 Workshop | Link | N/A |
| 2024 | RouteLLM: Learning to Route LLMs with Preference Data | arXiv 2024 | Link | Link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning | ICLR 2023 | Link | Link |
| 2023 | QLoRA: Efficient Finetuning of Quantized LLMs | NeurIPS 2023 | Link | Link |
| 2023 | LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models | EMNLP 2023 | Link | Link |
| 2023 | LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | NeurIPS 2023 | Link | Link |
| 2023 | DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation | EACL 2023 | Link | N/A |
| 2024 | LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models | ICLR 2024 | Link | Link |
| 2024 | VeRA: Vector-based Random Matrix Adaptation | ICLR 2024 | Link | N/A |
| 2024 | LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models | ICLR 2024 | Link | Link |
| 2024 | NOLA: Compressing LoRA using Linear Combination of Random Basis | ICLR 2024 | Link | N/A |
| 2024 | DoRA: Weight-Decomposed Low-Rank Adaptation | ICML 2024 | Link | Link |
| 2024 | LoRA+: Efficient Low-Rank Adaptation of Large Models | ICML 2024 | Link | Link |
| 2024 | RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation | ICML 2024 | Link | Link |
| 2024 | PiSSA: Principal Singular Values and Singular Vectors Adaptation of LLMs | NeurIPS 2024 | Link | Link |
| 2024 | LoRAHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition | ACL 2024 | Link | Link |
| 2024 | LoRA Learns Less and Forgets Less | ICML 2024 | Link | N/A |
| 2024 | ReFT: Representation Finetuning for Language Models | NeurIPS 2024 | Link | Link |
| 2024 | S²FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity | NeurIPS 2024 | Link | N/A |
| 2024 | CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning | NeurIPS 2024 | Link | N/A |
| 2024 | HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning | NeurIPS 2024 | Link | Link |
| 2024 | Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models | ICML 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | QLoRA: Efficient Finetuning of Quantized LLMs | NeurIPS 2023 | Link | Link |
| 2023 | PEQA: Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization | NeurIPS 2023 | Link | N/A |
| 2024 | QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models | ICLR 2024 | Link | Link |
| 2024 | LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models | ICLR 2024 | Link | Link |
| 2024 | LQ-LoRA: Low-Rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning | ICLR 2024 | Link | Link |
| 2024 | IR-QLoRA: Accurate LoRA-Finetuning Quantization of LLMs via Information Retention | ICML 2024 | Link | Link |
| 2024 | BitDelta: Your Fine-Tune May Only Be Worth One Bit | ICML 2024 | Link | Link |
| 2024 | EfficientQAT: Efficient Quantization-Aware Training for Large Language Models | ACL 2025 | Link | Link |
| 2024 | AQLM: Extreme Compression of Large Language Models via Additive Quantization | ICML 2024 | Link | Link |
| 2024 | The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58) | arXiv 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions | NeurIPS 2023 | Link | N/A |
| 2024 | GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection | ICML 2024 | Link | Link |
| 2024 | Flora: Low-Rank Adapters Are Secretly Gradient Compressors | ICML 2024 | Link | Link |
| 2024 | ReLoRA: High-Rank Training Through Low-Rank Updates | ICLR 2024 | Link | Link |
| 2024 | Full Parameter Fine-Tuning for Large Language Models with Limited Resources (LOMO) | ACL 2024 | Link | Link |
| 2024 | AdaLomo: Low-memory Optimization with Adaptive Learning Rate | ACL 2024 Findings | Link | Link |
| 2024 | SLTrain: A Sparse Plus Low-Rank Approach for Parameter and Memory Efficient Pretraining | NeurIPS 2024 | Link | N/A |
| 2024 | Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients | arXiv 2024 | Link | N/A |
| 2025 | Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? | ICML 2025 | Link | N/A |
| 2024 | Memory-Efficient LLM Training with Online Subspace Descent | NeurIPS 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | Reducing Activation Recomputation in Large Transformer Models | MLSys 2023 | Link | Link |
| 2023 | CAME: Confidence-guided Adaptive Memory Efficient Optimization | ACL 2023 | Link | N/A |
| 2023 | Training Transformers with 4-bit Integers | NeurIPS 2023 | Link | Link |
| 2023 | MeZO: Fine-Tuning Language Models with Just Forward Passes | NeurIPS 2023 | Link | Link |
| 2024 | Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training | ICLR 2024 | Link | Link |
| 2024 | ZeRO++: Extremely Efficient Collective Communication for Giant Model Training | ICML 2024 | Link | Link |
| 2024 | Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization | ICML 2024 | Link | Link |
| 2024 | Adam-mini: Use Fewer Learning Rates, To Gain More | NeurIPS 2024 | Link | Link |
| 2024 | LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning | NeurIPS 2024 | Link | N/A |
| 2024 | VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections | NeurIPS 2024 | Link | N/A |
| 2024 | ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models | ICML 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) | SOSP 2023 | Link | Link |
| 2023 | FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | ICML 2023 | Link | Link |
| 2023 | AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving | OSDI 2023 | Link | N/A |
| 2024 | SGLang: Efficient Execution of Structured Language Model Programs | NeurIPS 2024 | Link | Link |
| 2024 | DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | OSDI 2024 | Link | N/A |
| 2024 | Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve | OSDI 2024 | Link | N/A |
| 2024 | S-LoRA: Serving Thousands of Concurrent LoRA Adapters | MLSys 2024 | Link | Link |
| 2024 | SpecInfer: Accelerating Generative LLM Serving with Tree-based Speculative Inference and Verification | ASPLOS 2024 | Link | Link |
| 2024 | SpotServe: Serving Generative Large Language Models on Preemptible Instances | ASPLOS 2024 | Link | N/A |
| 2024 | Splitwise: Efficient Generative LLM Inference Using Phase Splitting | ISCA 2024 | Link | N/A |
| 2024 | MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving | arXiv 2024 | Link | Link |
| 2024 | ServerlessLLM: Low-Latency Serverless Inference for Large Language Models | OSDI 2024 | Link | Link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | Efficiently Scaling Transformer Inference | MLSys 2023 | Link | N/A |
| 2024 | Llumnix: Dynamic Scheduling for Large Language Model Serving | OSDI 2024 | Link | Link |
| 2024 | Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference with Chunked Prefills | OSDI 2024 | Link | N/A |
| 2024 | DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving | OSDI 2024 | Link | N/A |
| 2024 | Splitwise: Efficient Generative LLM Inference Using Phase Splitting | ISCA 2024 | Link | N/A |
| 2024 | Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction | arXiv 2024 | Link | N/A |
| 2024 | Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving | arXiv 2024 | Link | N/A |
| 2024 | Efficient LLM Scheduling by Learning to Rank | arXiv 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks | ASPLOS 2023 | Link | N/A |
| 2023 | HyperAttention: Long-context Attention in Near-Linear Time | NeurIPS 2023 | Link | N/A |
| 2024 | FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | ICLR 2024 | Link | Link |
| 2024 | FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision | NeurIPS 2024 | Link | Link |
| 2024 | ThunderKittens: Simple, Fast, and Adorable AI Kernels | NeurIPS 2024 | Link | Link |
| 2024 | Ring Attention with Blockwise Transformers for Near-Infinite Context | ICLR 2024 | Link | Link |
| 2025 | FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving | MLSys 2025 | Link | Link |
| 2024 | NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention | NeurIPS 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs | ASPLOS 2023 | Link | Link |
| 2023 | TensorIR: An Abstraction for Automatic Tensorized Program Optimization | ASPLOS 2023 | Link | Link |
| 2023 | Welder: Scheduling Deep Learning Memory Access via Tile-level Fusion | OSDI 2023 | Link | N/A |
| 2023 | OLLA: Optimizing the Lifetime and Location of Arrays to Reduce Memory Usage of Neural Networks | MLSys 2023 | Link | N/A |
| 2024 | Ladder: Enabling Efficient Low-Bit Quantization and Inference with Compiler Co-Design | OSDI 2024 | Link | N/A |
| 2024 | An LLM Compiler for Parallel Function Calling | ICML 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | Efficiently Scaling Transformer Inference | MLSys 2023 | Link | N/A |
| 2024 | AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | MLSys 2024 | Link | Link |
| 2024 | Atom: Low-bit Quantization for Efficient and Accurate LLM Serving | MLSys 2024 | Link | Link |
| 2024 | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | MLSys 2025 | Link | Link |
| 2024 | MLC-LLM: Universal LLM Deployment on Consumer Devices with ML Compilation | MLSys 2024 | Link | Link |
| 2024 | HexGen: Generative Inference of Large Language Model over Heterogeneous Environment | ICML 2024 | Link | Link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding | EMNLP 2023 | Link | Link |
| 2024 | Lost in the Middle: How Language Models Use Long Contexts | TACL 2024 | Link | N/A |
| 2024 | LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding | ACL 2024 | Link | Link |
| 2024 | L-Eval: Instituting Standardized Evaluation for Long Context Language Models | ACL 2024 | Link | Link |
| 2024 | RULER: What's the Real Context Size of Your Long-Context Language Models? | NAACL 2024 | Link | Link |
| 2024 | InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens | ACL 2024 | Link | Link |
| 2024 | Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of LLMs | ACL 2024 | Link | N/A |
| 2024 | BABILong: Testing the Limits of LLMs with Long Context Reasoning Benchmarks | NeurIPS 2024 | Link | N/A |
| 2024 | M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models | ACL 2024 | Link | N/A |
| 2024 | Ada-LEval: Evaluating Long-context LLMs with Length-adaptable Benchmarks | NAACL 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | Compressing LLMs: The Truth is Rarely Pure and Never Simple | ICLR 2024 | Link | Link |
| 2024 | ShortGPT: Layers in Large Language Models are More Redundant Than You Expect | ACL 2025 Findings | Link | N/A |
| 2024 | The Unreasonable Ineffectiveness of the Deeper Layers | ICLR 2025 | Link | N/A |
| 2024 | LASER: Layer-Selective Rank Reduction for Improving Reasoning | ICML 2024 | Link | N/A |
| 2025 | Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression | ICML 2025 | Link | Link |
| 2025 | Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs | EMNLP 2025 Findings | Link | N/A |
| 2025 | Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning | arXiv 2025 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023 | Safety Alignment Should Be Made More Manageable | NeurIPS 2023 | Link | N/A |
| 2024 | Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To | ICLR 2024 | Link | N/A |
| 2024 | Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation | ICLR 2024 | Link | N/A |
| 2024 | Playing It Safe: Defending Against Backdoors with Activation Clustering in Quantized LLMs | AAAI 2024 | Link | N/A |
| 2024 | Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning | NeurIPS 2024 | Link | N/A |
| 2024 | Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models | NeurIPS 2024 | Link | N/A |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024 | Honeybee: Locality-enhanced Projector for Multimodal LLM | CVPR 2024 | Link | Link |
| 2024 | MobileVLM: A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices | CVPR 2024 | Link | Link |
| 2024 | TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones | EMNLP 2024 | Link | Link |
| 2024 | LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model | AAAI 2024 | Link | N/A |
| 2024 | FastV: An Image is Worth 1/2 Tokens After Layer 2 | ECCV 2024 | Link | Link |
| 2024 | Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | NeurIPS 2024 | Link | Link |
| 2024 | TokenPacker: Efficient Visual Projector for Multimodal LLM | NeurIPS 2024 | Link | N/A |
| 2024 | Matryoshka Multimodal Models | NeurIPS 2024 | Link | N/A |
| 2025 | LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation | ICLR 2025 | Link | Link |
| 2025 | VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow Guidance | ICCV 2025 | Link | N/A |
| 2024 | MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer | CVPR 2024 | Link | Link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024 | MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases | ICML 2024 | Link | N/A |
| 2024 | LLM in a flash: Efficient Large Language Model Inference with Limited Memory | ICML 2024 | Link | N/A |
| 2024 | PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | MLSys 2024 | Link | Link |
| 2024 | EdgeMoE: Fast On-Device Inference of Mixture-of-Experts Based Large Language Models | MLSys 2024 | Link | N/A |
| 2024 | LLMCad: Fast and Scalable On-device Large Language Model Inference | MLSys 2024 | Link | N/A |
| 2024 | Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs | ICML 2024 | Link | N/A |
| 2024 | MLC-LLM: Universal LLM Deployment on Consumer Devices with ML Compilation | MLSys 2024 | Link | Link |
| 2024 | MobileQuant: Mobile-friendly Quantization for On-device Language Models | EMNLP 2024 Findings | Link | N/A |
| 2024 | GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment | ACL 2024 Findings | Link | N/A |
| 2024 | PowerInfer-2: Fast Large Language Model Inference on a Smartphone | arXiv 2024 | Link | Link |
We welcome contributions from the community! If you find any missing papers or errors, please feel free to:
- Open an Issue to report errors or suggest papers
- Submit a Pull Request to add new papers
- Star this repository if you find it helpful
