Skip to content

MAC-AutoML/Awesome-Efficient-Large-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

187 Commits
 
 
 
 
 
 

Repository files navigation

Awesome-Efficient-Large-Models

Awesome License: MIT Last Commit Papers PRs Welcome

A list of awesome papers on compression and acceleration of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs).

Continuously updated. Welcome to star and watch!

Paper Collection | Contributing


News

  • [2026.05] We have open-sourced a comprehensive, continuously-updated taxonomy of 400+ papers covering model compression, inference acceleration, and system co-design for efficient large models.

Taxonomy


Table of Contents

Model-side Compression
Inference-side Acceleration
Training and Fine-tuning Efficiency
System and Hardware Co-design
Evaluation and Applications

Model-side Compression

Quantization

LLM Quantization

Year Title Venue Paper code
2023 GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ICLR 2023 Link Link
2025 OSTQuant: Refining Large Language Model Quantization with
Orthogonal and Scaling Transformations for Better Distribution Fitting
ICLR 2025 Link Link
2025 SpinQuant: LLM quantization with learned rotations ICLR 2025 Link Link
2022 SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models ICML 2023 Link Link
2023 AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration MLSys 2024 Link Link
2024 QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks ICML 2024 Link Link
2025 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving MLSys 2025 Link Link
2024 QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs NeurIPS 2024 Link Link
2024 Atom: Low-bit Quantization for Efficient and Accurate LLM Serving MLSys 2024 Link Link
2024 OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models ICLR 2024 Link Link
2023 QuIP: 2-Bit Quantization of Large Language Models With Guarantees NeurIPS 2023 Link Link
2022 LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale NeurIPS 2022 Link Link
2023 Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling EMNLP 2023 Link Link
2025 GPTAQ: Efficient Finetuning-Free Quantization for Asyetric Calibration ICML 2025 Link Link
2024 MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization NeurIPS 2024 Link Link
2024 AffineQuant: Affine Transformation Quantization for Large Language Models ICLR 2024 Link Link
2024 LLM-QAT: Data-Free Quantization Aware Training for Large Language Models ACL 2024 Link Link
2024 BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation ACL 2024 Link Link
2023 OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models AAAI 2024 (Oral) Link Link
2024 SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression ICLR 2024 Link Link
2022 ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers NeurIPS 2022 Link Link
2024 LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models ICLR 2024 Link Link
2024 OneBit: Towards Extremely Low-bit Large Language Models NeurIPS 2024 Link Link
2023 LLM-FP4: 4-bit Floating-Point Quantized Transformers EMNLP 2023 Link Link
2024 FlatQuant: Flatness Matters for LLM Quantization ICML 2025 Link Link
2024 SqueezeLLM: Dense-and-Sparse Quantization ICML 2024 Link Link
2023 RPTQ: Reorder-based Post-training Quantization for Large Language Models Link Link
2024 QQQ: Quality Quattuor-Bit Quantization for Large Language Models ICLR Link Link
2024 Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs Link Link
2025 CBQ: Cross-Block Quantization for Large Language Models ICLR 2025 Link N/A
2025 MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods ICLR 2025 Link N/A
2025 SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators ICLR 2025 Link N/A
2025 Progressive Mixed-Precision Decoding for Efficient LLM Inference ICLR 2025 Link N/A
2025 Surprising Effectiveness of Pretraining Ternary Language Models at Scale ICLR 2025 Link N/A
2025 EfficientQAT: Efficient Quantization-Aware Training for Large Language Models ACL 2025 Link Link
2025 MiniKV: Pushing the Limits of 2-Bit KV Cache via Compression and System Co-Design for Efficient Long Context Inference ACL 2025 Findings Link N/A
2025 AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations ACL 2025 Link N/A
2025 Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition ACL 2025 Findings Link N/A
2025 LittleBit: Ultra Low-Bit Quantization via Latent Factorization NeurIPS 2025 Link Link
2024 Extreme Compression of Large Language Models via Additive Quantization ICML 2024 Link Link
2024 BiLLM: Pushing the Limit of Post-Training Quantization for LLMs ICML 2024 Link Link
2024 LQER: Low-Rank Quantization Error Reconstruction for LLMs ICML 2024 Link N/A
2024 Evaluating Quantized Large Language Models ICML 2024 Link Link
2024 QMoE: Sub-1-Bit Compression of Trillion Parameter Models MLSys 2024 Link Link
2024 DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs NeurIPS 2024 Link Link
2024 QBB: Quantization with Binary Bases for LLMs NeurIPS 2024 Link N/A
2024 Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models NeurIPS 2024 Link N/A
2024 VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models EMNLP 2024 Link Link

VLM Quantization

Year Title Venue Paper code
2024 Q-VLM: Post-training Quantization for Large Vision Language Models NIPS 2024 Link Link
2025 MBQ:Modality-Balanced Quantization for Large Vision-Language Models CVPR 2025 Link Link
2025 MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization ACM MM 2025 Link Link
2025 CASP: Compression of Large Multimodal Models Based on Attention Sparsity CVPR 2025 Link Link
2024 Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation ACM MM 2024 Link N/A

Pruning / Sparsity

Unstructured Pruning

Pruning without Weight Update

Year Title Venue Paper code
2023 A Simple and Effective Pruning Approach for Large Language Models ICLR 2024 Link Link
2024 BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation ICLR 2024 Link Link
2024 COPAL: Continual Pruning in Large Language Generative Models ICML 2024 Link N/A
2024 Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models ICML 2024 Link Link
2025 BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation ICML 2025 Link N/A
2025 SAFE: Finding Sparse and Flat Minima to Improve Pruning ICML 2025 Link Link
2025 SwiftPrune: Hessian-Free Weight Pruning for Large Language Models EMNLP 2025 Findings Link N/A

Pruning with Weight Update

Year Title Venue Paper code
2023 SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot ICML 2023 Link Link
2023 Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs ICLR 2024 Link Link
2023 The LLM Surgeon ICLR 2024 Link Link
2024 Fast and Optimal Weight Update for Pruned Large Language Models TMLR 2024 Link Link
2024 Pruning Foundation Models for High Accuracy without Retraining EMNLP 2024 findings Link Link
2024 SparseLLM: Towards Global Pruning for Pre-trained Language Models NeurIPS 2024 Link Link
2024 ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models NeurIPS 2024 Link Link
2024 Shears: Unstructured Sparsity with Neural Low-rank Adapter Search NAACL 2024 Link Link
2025 Wanda++: Pruning Large Language Models via Regional Gradients ACL 2025 Findings Link Link
2024 Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization ICLR 2025 Link Link
2025 Dynamic Low-Rank Sparse Adaptation for Large Language Models ICLR 2025 Link Link
2024 Wasserstein Distances, Neuronal Entanglement, and Sparsity ICLR 2025 Link Link
2025 Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision ICML 2025 Link N/A
2025 An Efficient Pruner for Large Language Model with Theoretical Guarantee ICML 2025 Link N/A
2025 DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration NeurIPS 2025 Link Link
2025 Multi-Objective One-Shot Pruning for Large Language Models NeurIPS 2025 Link N/A

Sparsity Rate Allocation

Year Title Venue Paper code
2023 Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity ICML 2024 Link Link
2024 ALS: Adaptive Layer Sparsity for Large Language Models via Activation Correlation Assessment NeurIPS 2024 Link Link
2024 Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models NeurIPS 2024 Link Link
2024 AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models NeurIPS 2024 Link Link
2024 EvoPress: Accurate Dynamic Model Compression via Evolutionary Search ICML 2025 Link Link
2025 Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective ICML 2025 Link Link
2025 DLP: Dynamic Layerwise Pruning in Large Language Models ICML 2025 Link Link
2025 Lua-LLM: Learning Unstructured-Sparsity Allocation for Large Language Models NeurIPS 2025 Link N/A

Sparse plus Low-Rank Compression

Year Title Venue Paper code
2024 OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition ICLR 2025 Link Link
2025 Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models ICML 2025 Link Link
2025 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models EMNLP 2025 Link Link
2025 3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs NeurIPS 2025 Link Link

Calibration Dataset

Year Title Venue Paper code
2024 On the Impact of Calibration Data in Post-training Quantization and Pruning ACL 2024 Link Link
2024 Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning EMNLP 2024 Link Link
2024 Beware of Calibration Data for Pruning Large Language Models ICLR 2025 Link Link

Evaluation of Pruned Model

Year Title Venue Paper code
2023 Compressing LLMs: The Truth is Rarely Pure and Never Simple ICLR 2024 Link Link
2025 Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compresssion ICML 2025 Link Link
2025 Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs EMNLP 2025 Findings Link N/A

Semi-structured Pruning

Year Title Venue Paper code
2024 WRP: Weight Recover Prune for Structured Sparsity ACL 2024 Link Link
2024 Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models ICLR 2024 Link Link
2024 Pruning Large Language Models with Semi-Structural Adaptive Sparse Training AAAI 2025 Link Link
2024 MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models NeurIPS 2024 Link Link
2025 ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs ICML 2025 Link Link
2025 PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models NeurIPS 2025 Link Link
2025 TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks NeurIPS 2025 Link Link

Structured Pruning

Head and Neuron Pruning

Year Title Venue Paper code
2023 LLM-Pruner: On the Structural Pruning of Large Language Models NeurIPS 2023 Link Link
2023 Fluctuation-based Adaptive Structured Pruning for Large Language Models AAAI 2024 Link Link
2023 Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning ICLR 2024 Link Link
2024 BlockPruner: Fine-grained Pruning for Large Language Models ACL 2025 Findings Link Link
2024 Structured Optimal Brain Pruning for Large Language Models EMNLP 2024 Link N/A
2024 Search for Efficient Large Language Models NeurIPS 2024 Link Link
2024 SlimGPT: Layer-wise Structured Pruning for Large Language Models NeurIPS 2024 Link N/A
2024 Compact Language Models via Pruning and Knowledge Distillation NeurIPS 2024 Link Link
2024 DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models NeurIPS 2024 Link Link
2025 Tyr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization NeurIPS 2025 Link Link
2025 Olica: Efficient Structured Pruning of Large Language Models without Retraining ICML 2025 Link Link

Layer Pruning

Year Title Venue Paper code
2024 Shortened LLaMA: A Simple Depth Pruning for Large Language Models ICLR 2024 workshop Link Link
2024 LaCo: Large Language Model Pruning via Layer Collapse EMNLP 2024 Findings Link Link
2024 Shortgpt: Layers in large language models are more redundant than you expect ACL 2025 Findings Link Link
2024 Streamlining Redundant Layers to Compress Large Language Models ICLR 2025 Link Link
2024 SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks ICML 2024 Link Link
2024 Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging EMNLP 2024 Link N/A
2024 TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs ACL 2025 Link Link
2025 A Simple Linear Patch Revives Layer-Pruned Large Language Models NeurIPS 2025 Link Link

Other Topics

Year Title Venue Paper code
2023 Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning ACL 2024 Findings Link Link
2024 SliceGPT: Compress Large Language Models by Deleting Rows and Columns ICLR 2024 Link Link
2024 APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference ICML 2024 Link Link
2024 Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations ACL 2024 Findings Link Link
2024 LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models ICML 2024 Link Link
2024 Pruning as a Domain-specific LLM Extractor NAACL 2024 Findings Link Link
2024 Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient ACL 2025 Link Link
2025 One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models ACL 2025 Findings Link N/A
2024 RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model NAACL 2024 Findings Link N/A
2024 Finding Transformer Circuits with Edge Pruning NeurIPS 2024 Link Link
2024 MoDeGPT: Modular Decomposition for Large Language Model Compression ICLR 2025 Link Link
2024 The Unreasonable Ineffectiveness of the Deeper Layers ICLR 2025 Link N/A
2024 PAT: Pruning-Aware Tuning for Large Language Models AAAI 2025 Link Link
2024 Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy EMNLP 2024 Findings Link Link
2024 LEMON: Reviving Stronger and Smaller LMs from Larger LMs with Linear Parameter Fusion ACL 2024 Link N/A
2024 DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization ACL 2025 Link Link
2025 You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning ICLR 2025 Link Link
2025 LLaMaFlex: Many-in-one LLMs via Generalized Pruning and Weight Sharing ICLR 2025 Link N/A
2025 Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing ICLR 2025 Link Link
2025 Instruction-Following Pruning for Large Language Models ICML 2025 Link N/A
2025 Let LLM Tell What to Prune and How Much to Prune ICML 2025 Link Link
2025 Prompt-based Depth Pruning of Large Language Models ICML 2025 Link Link
2025 IG-Pruning: Input-Guided Block Pruning for Large Language Models EMNLP 2025 Link Link
2025 PIP: Perturbation-based Iterative Pruning for Large Language Models EMNLP 2025 Findings Link N/A
2025 ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization NeurIPS 2025 Link Link
2025 Restoring Pruned Large Language Models via Lost Component Compensation NeurIPS 2025 Link Link

Activation Sparsity

Year Title Venue Paper code
2023 Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ICML 2024 Link Link
2023 ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models ICLR 2024 Link N/A
2024 CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models COLM 2024 Link Link
2024 ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models EMNLP 2024 Link Link
2024 Training-Free Activation Sparsity in Large Language Models ICLR 2025 Link Link
2024 Sparsing Law: Towards Large Language Models with Greater Activation Sparsity ICML 2025 Link Link
2025 La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation ICML 2025 Link N/A
2025 R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference ICLR 2025 Link Link
2024 Sirius: Contextual Sparsity with Correction for Efficient LLMs NeurIPS 2024 Link Link
2024 Learn To be Efficient: Build Structured Sparsity in Large Language Models NeurIPS 2024 Link Link
2025 Weight-Aware Activation Sparsity with Constrained Bayesian Optimization Scheduling for Large Language Models EMNLP 2025 Link Link
2025 Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity NeurIPS 2025 Link Link

Joint Sparsification and Quantization

Year Title Venue Paper code
2024 SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models EMNLP 2024 Findings Link Link
2024 Effective Interplay between Sparsity and Quantization: From Theory to Practice ICLR 2025 Link Link
2024 Compressing large language models by joint sparsification and quantization ICML 2024 Link Link
2024 SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression ICML 2025 Link Link
2025 Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs arxiv 2025 Link Link

Knowledge Distillation

Year Title Venue Paper Code
2025 Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression CVPR 2025 Link Link
2025 LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation ICLR 2025 Link Link
2025 Pre-training Distillation for Large Language Models: A Design Space Exploration ACL 2025 Link N/A
2025 TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models ICLR 2025 Link N/A
2025 Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling ICLR 2025 Link N/A
2025 Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation CVPR 2025 Link N/A
2025 Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models AAAI 2025 Link N/A
2025 Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models COLING 2025 Link N/A
2025 Lillama: Large Language Models Compression via Low-Rank Feature Distillation NAACL 2025 Link Link
2025 Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs TMLR 2025 Link Link
2024 MiniLLM: Knowledge Distillation of Large Language Models ICLR 2024 Link Link
2024 On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes ICLR 2024 Link N/A
2024 DistiLLM: Towards Streamlined Distillation for Large Language Models ICML 2024 Link Link
2024 DDK: Distilling Domain Knowledge for Efficient Large Language Models NeurIPS 2024 Link N/A
2024 Adversarial Moment-Matching Distillation of Large Language Models NeurIPS 2024 Link N/A
2024 PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning EMNLP 2024 Findings Link Link
2024 Dual-Space Knowledge Distillation for Large Language Models EMNLP 2024 Link Link
2024 ELAD: Explanation-Guided Large Language Models Active Distillation ACL 2024 Findings Link N/A
2024 Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs Distillation EMNLP 2024 Link N/A
2024 CLIP-KD: An Empirical Study of CLIP Model Distillation CVPR 2024 Link Link
2024 Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data AAAI 2024 Link Link
2024 LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions EACL 2024 Link Link
2024 Aligning Large and Small Language Models via Chain-of-Thought Reasoning EACL 2024 Link Link
2024 Weight-Inherited Distillation for Task-Agnostic BERT Compression NAACL 2024 Link Link
2024 Knowledge Fusion of Large Language Models ICLR 2024 Link Link
2024 OPENCHAT: Advancing Open-source Language Models with Mixed-Quality Data ICLR 2024 Link Link
2024 Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models ICML 2024 Link Link
2023 AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression ACL 2023 Link Link
2023 DiffKD: Diffusion-based Knowledge Distillation for Large Language Models NeurIPS 2023 Link Link
2023 SCOTT: Self-Consistent Chain-of-Thought Distillation ACL 2023 Link Link
2023 Distilling Script Knowledge from Large Language Models for Constrained Language Planning ACL 2023 Link Link
2023 DOT: A Distillation-Oriented Trainer ICCV 2023 Link Link
2023 Specializing Smaller Language Models towards Multi-Step Reasoning ICML 2023 Link Link
2023 DISCO: Distilling Counterfactuals with Large Language Models ACL 2023 Link Link
2023 Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind NeurIPS 2023 Link Link
2023 PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation EMNLP 2023 Link Link
2023 Democratizing Reasoning Ability: Tailored Learning from Large Language Model EMNLP 2023 Link Link
2023 GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model ACL 2023 Link Link
2023 Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes ACL 2023 Link Link
2023 Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models EMNLP 2023 Link Link
2023 f-Divergence Minimization for Sequence-Level Knowledge Distillation ACL 2023 Link Link
2023 Symbolic Chain-of-Thought Distillation: Small Models Can Also Think Step-by-Step ACL 2023 Link N/A
2023 Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks NeurIPS 2023 Link Link
2023 Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation EMNLP 2023 Link Link
2023 Lion: Adversarial Distillation of Closed-Source Large Language Model EMNLP 2023 Link Link
2023 InheritSumm: A General, Versatile and Compact Summarizer by Distilling from GPT EMNLP 2023 Link N/A
2023 Aligning Large Language Models through Synthetic Feedback EMNLP 2023 Link Link
2023 MCC-KD: Multi-CoT Consistent Knowledge Distillation EMNLP 2023 Findings Link N/A
2023 Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization ICLR 2023 Link Link
2023 Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data EMNLP 2023 Link Link
2022 TinyViT: Fast Pretraining Distillation for Small Vision Transformers ECCV 2022 Link Link
2022 DIST: Distilling Large Language Models with Small-Scale Data NeurIPS 2022 Link Link
2022 Decoupled Knowledge Distillation CVPR 2022 Link Link

Low-rank Decomposition

Year Title Venue Paper Code
2025 Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models ICML 2025 Link Link
2025 Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives ICLR 2025 Link Link
2025 SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs ICLR 2025 Link Link
2025 Dynamic Low-Rank Sparse Adaptation for Large Language Models ICLR 2025 Link Link
2025 MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition ICML 2025 Link N/A
2025 Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition ACL 2025 Findings Link N/A
2025 Delta Decompression for MoE-based LLMs Compression PMLR 2025 Link N/A
2025 LittleBit: Ultra Low-Bit Quantization via Latent Factorization NeurIPS 2025 Link N/A
2024 Compressing Large Language Models using Low Rank and Low Precision Decomposition NeurIPS 2024 Link Link
2024 Unified Low-rank Compression Framework for Click-through Rate Prediction KDD 2024 Link Link
2024 SliceGPT: Compress Large Language Models by Deleting Rows and Columns ICLR 2024 Link Link
2024 Low-Rank Knowledge Decomposition for Medical Foundation Models CVPR 2024 Link Link
2024 LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking CVPR 2024 Link Link
2024 Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization ISCA 2024 Link N/A
2024 LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning ICLR 2024 Link N/A
2024 LQER: Low-Rank Quantization Error Reconstruction for LLMs ICML 2024 Link N/A
2024 Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations ACL 2024 Findings Link N/A
2024 Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization ACL 2024 Findings Link N/A
2024 Surgical Feature-Space Decomposition of LLMs: Why, When and How? ACL 2024 Link N/A
2024 DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models EMNLP 2024 Link N/A
2023 LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation ICML 2023 Link Link
2022 Compressible-composable NeRF via Rank-residual Decomposition NeurIPS 2022 Link Link

Efficient Architecture

State Space Models and Linear Attention

Year Title Venue Paper Code
2023 Hungry Hungry Hippos: Towards Language Modeling with State Space Models ICLR 2023 Link Link
2023 Hyena Hierarchy: Towards Larger Convolutional Language Models ICML 2023 Link Link
2023 RWKV: Reinventing RNNs for the Transformer Era EMNLP 2023 Findings Link Link
2023 Mamba: Linear-Time Sequence Modeling with Selective State Spaces arXiv 2023 Link Link
2023 RetNet: Retentive Network: A Successor to Transformer for Large Language Models arXiv 2023 Link Link
2023 Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture NeurIPS 2023 Link Link
2024 Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2) ICML 2024 Link Link
2024 Gated Linear Attention Transformers with Hardware-Efficient Training ICML 2024 Link Link
2024 Based: Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff ICML 2024 Link Link
2024 Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models arXiv 2024 Link N/A
2024 Jamba: A Hybrid Transformer-Mamba Language Model arXiv 2024 Link N/A

Efficient Attention Mechanisms

Year Title Venue Paper Code
2023 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints EMNLP 2023 Link N/A
2023 HyperAttention: Long-context Attention in Near-Linear Time NeurIPS 2023 Link N/A
2024 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning ICLR 2024 Link Link
2024 Ring Attention with Blockwise Transformers for Near-Infinite Context ICLR 2024 Link Link
2024 FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision NeurIPS 2024 Link Link
2024 ThunderKittens: Simple, Fast, and Adorable AI Kernels NeurIPS 2024 Link Link

Mixture of Experts Efficiency

Year Title Venue Paper Code
2024 DeepSeek-MoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models ACL 2024 Link Link
2024 Mixtral of Experts arXiv 2024 Link Link
2025 Ada-K Routing: Boosting the Efficiency of MoE-based LLMs ICLR 2025 Link N/A

Other Efficient Architectures

Year Title Venue Paper Code
2024 You Only Cache Once: Decoder-Decoder Architectures for Language Models arXiv 2024 Link Link
2024 Scalable MatMul-free Language Modeling arXiv 2024 Link Link

Inference-side Acceleration

Speculative Decoding

Year Title Venue Paper Code
2025 HASS: Learning Harmonized Representations for Speculative Sampling ICLR 2025 Link Link
2025 PEARL: Parallel Speculative Decoding with Adaptive Draft Length ICLR 2025 Link Link
2025 SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration ICLR 2025 Link Link
2025 Pre-Training Curriculum for Multi-Token Prediction in Language Models ACL 2025 Link Link
2025 Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree ACL 2025 Link Link
2025 SAM Decoding: Speculative Decoding via Suffix Automaton ACL 2025 Link Link
2025 DReSD: Dense Retrieval for Speculative Decoding ACL 2025 Findings Link N/A
2025 EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models NAACL 2025 Link N/A
2025 Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding NAACL 2025 Findings Link N/A
2025 SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths COLM 2025 Link Link
2024 Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting NeurIPS 2024 Link Link
2024 EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees EMNLP 2024 Link Link
2024 Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads ICML 2024 Link Link
2024 EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty ICML 2024 Link Link
2024 Online Speculative Decoding ICML 2024 Link N/A
2024 SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices NeurIPS 2024 Link Link
2024 Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding NeurIPS 2024 Link Link
2024 Cascade Speculative Drafting for Even Faster LLM Inference NeurIPS 2024 Link Link
2024 Accelerating Blockwise Parallel Language Models with Draft Refinement NeurIPS 2024 Link N/A
2024 Graph-Structured Speculative Decoding ACL 2024 Findings Link N/A
2024 Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism ACL 2024 Findings Link N/A
2024 SLiM: Speculative Decoding with Hypothesis Reduction NAACL 2024 Findings Link N/A
2024 SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification ASPLOS 2024 Link N/A
2024 REST: Retrieval-Based Speculative Decoding NAACL 2024 Link Link
2024 Lookahead Decoding: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding ICML 2024 Link Link
2024 LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding ACL 2024 Link Link
2024 Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding EMNLP 2024 Link Link
2024 CLLMs: Consistency Large Language Models ICLR 2024 Link Link
2023 Fast Inference from Transformers via Speculative Decoding ICML 2023 Link N/A
2023 SpecTr: Fast Speculative Decoding via Optimal Transport NeurIPS 2023 Link N/A
2023 Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation EMNLP 2023 Findings Link Link
2023 Speculative Decoding with Big Little Decoder NeurIPS 2023 Link Link

KV Cache Optimization

Year Title Venue Paper Code Category
2025 InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation Link Link Token Eviction
2025 R-KV: Redundancy-aware KV Cache Compression for Reasoning Models Link Link Token Eviction
2025 SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator ICML 2025 Link Link Token Eviction
2025 RazorAttention: Efficient KV Cache Compression Through Retrieval Heads ICLR 2025 Link N/A Token Eviction
2025 Squeezed Attention: Accelerating Long Context Length LLM Inference ACL 2025 Link Link Token Eviction
2025 LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models ICML 2025 Link Link Budget Allocation
2025 CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences ICLR 2025 Link Link Budget Allocation
2025 AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning ICCV 2025 Link Link Cache Merging
2025 MiniKV: Pushing the Limits of 2-Bit KV Cache via Compression and System Co-Design for Efficient Long Context Inference ACL 2025 Findings Link N/A Quantization
2025 Palu: KV-Cache Compression with Low-Rank Projection ICLR 2025 Link Link Low Rank Projection
2025 LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy ICLR 2025 Link N/A Low Rank Projection
2025 Preserving Large Activations: The Key to KV Cache Pruning ICLR 2025 Link N/A Token Eviction
2025 VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration ICLR 2025 Link N/A Budget Allocation
2025 Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning ICLR 2025 Link Link Token Eviction
2025 TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization ACL 2025 Findings Link N/A System/Offloading
2025 KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation ACL 2025 Findings Link Link System/Offloading
2025 KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding ACL 2025 Link Link Low Rank Projection
2025 DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs EMNLP 2025 Findings Link N/A Budget Allocation
2024 SnapKV: LLM Knows What You are Looking for Before Generation NeurIPS 2024 Link Link Token Eviction
2024 InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory NeurIPS 2024 Link Link Token Eviction
2024 Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs ICLR 2024 Link Link Token Eviction
2024 Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference MLSys 2024 Link Link Token Eviction
2024 Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference ICML 2024 Link Link Token Eviction
2024 On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference Link Link Token Eviction
2024 PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling Link Link Budget Allocation
2024 MiniCache: KV Cache Compression in Depth Dimension for Large Language Models NeurIPS 2024 Link Link Cache Merging
2024 CaM: Cache Merging for Memory-efficient LLMs Inference ICML 2024 Link Link Cache Merging
2024 Compressed Context Memory For Online Language Model Interaction ICLR 2024 Link Link Cache Merging
2024 Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference ICML 2024 Link Link Cache Merging
2024 LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference EMNLP 2024 Findings Link Link Cache Merging
2024 CHAI: Clustered Head Attention for Efficient LLM Inference ICML 2024 Link Link Cache Merging
2024 D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models ICLR 2025 Link Link Cache Merging
2024 IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact ACL 2024 Link Link Quantization
2024 KIVI: A Tuning-Free Asyetric 2bit Quantization for KV Cache ICML 2024 Link Link Quantization
2024 KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization NeurIPS 2024 Link Link Quantization
2024 SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models COLM 2024 Link Link Quantization
2024 GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM NeurIPS 2024 Link Link Quantization
2024 NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time ACL 2024 Link Link Token Eviction
2024 SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation ACL 2025 Link Link Token Eviction
2024 AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asyetric Quantization Configurations ACL 2025 Link N/A Quantization
2024 Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference ICML 2024 Link Link Cache Merging
2024 Layer-Condensed KV Cache for Efficient Inference of Large Language Models ACL 2024 Link Link Cache Sharing
2024 FINCH: Prompt-guided Key-Value Cache Compression for Large Language Models TACL 2024 / EMNLP 2024 Link N/A Token Eviction
2024 KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches EMNLP 2024 Findings Link Link Benchmark
2024 Eigen Attention: Attention in Low-Rank Space for KV Cache Compression EMNLP 2024 Findings Link Link Low Rank Projection
2024 A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression EMNLP 2024 Link Link Token Eviction
2023 H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models NeurIPS 2023 Link Link Token Eviction
2023 Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time NeurIPS 2023 Link Link Token Eviction
2023 Efficient Streaming Language Models with Attention Sinks ICLR 2024 Link Link Token Eviction

Prompt / Context Compression

Year Title Venue Paper Code
2023 LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models EMNLP 2023 Link Link
2023 Selective Context: Compressing Context to Enhance Inference Efficiency of Large Language Models EMNLP 2023 Link Link
2023 Learning to Compress Prompts with Gist Tokens NeurIPS 2023 Link Link
2023 Adapting Language Models to Compress Contexts (AutoCompressors) EMNLP 2023 Link Link
2023 Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers NeurIPS 2023 Link N/A
2024 In-context Autoencoder for Context Compression in a Large Language Model (ICAE) ICLR 2024 Link Link
2024 RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation ICLR 2024 Link N/A
2024 LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression ACL 2024 Link Link
2024 LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression ACL 2024 Findings Link Link
2024 Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon ICML 2024 Link Link
2024 xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token NeurIPS 2024 Link N/A
2025 500xCompressor: Generalized Prompt Compression for Large Language Models AAAI 2025 Link N/A
2024 Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage EACL 2024 Findings Link N/A
2024 Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles EMNLP 2024 Findings Link N/A

Early Exit

Year Title Venue Paper Code
2022 CALM: Confident Adaptive Language Modeling NeurIPS 2022 Link N/A
2023 FREE: Fast and Robust Early Exiting Framework for Autoregressive Language Models EMNLP 2023 Link N/A
2023 SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference arXiv 2023 Link N/A
2023 Speculative Decoding with Big Little Decoder (BiLD) NeurIPS 2023 Link Link
2024 ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference AAAI 2024 Link N/A
2024 EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism ICML 2024 Link Link
2024 LayerSkip: Enabling Early-Exit Inference and Self-Speculative Decoding ACL 2024 Link Link
2024 Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting NeurIPS 2024 Link Link
2024 Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding ACL 2024 Link N/A
2024 Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning ICLR 2024 Link N/A

Adaptive Computation

Year Title Venue Paper Code
2023 AdaTape: Foundation Model with Adaptive Computation via Elastic Input Sequence ICML 2023 Link N/A
2023 CoLT5: Faster Long-Range Transformers with Conditional Computation EMNLP 2023 Link N/A
2024 Mixture of Depths: Dynamically Allocating Compute in Transformer-Based Language Models ICML 2024 Link N/A
2024 Think before you speak: Training Language Models With Pause Tokens ICLR 2024 Link N/A
2024 MatFormer: Nested Transformer for Elastic Inference ICLR 2024 Link Link
2024 FLEXTRON: Many-in-One Flexible Large Language Model ICML 2024 Link N/A
2024 LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference arXiv 2024 Link N/A
2024 PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU MLSys 2024 Link Link
2024 D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models NeurIPS 2024 Link Link
2024 Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models NeurIPS 2024 Workshop Link N/A
2024 RouteLLM: Learning to Route LLMs with Preference Data arXiv 2024 Link Link

Training and Fine-tuning Efficiency

PEFT

Year Title Venue Paper Code
2023 AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning ICLR 2023 Link Link
2023 QLoRA: Efficient Finetuning of Quantized LLMs NeurIPS 2023 Link Link
2023 LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models EMNLP 2023 Link Link
2023 LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model NeurIPS 2023 Link Link
2023 DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation EACL 2023 Link N/A
2024 LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models ICLR 2024 Link Link
2024 VeRA: Vector-based Random Matrix Adaptation ICLR 2024 Link N/A
2024 LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models ICLR 2024 Link Link
2024 NOLA: Compressing LoRA using Linear Combination of Random Basis ICLR 2024 Link N/A
2024 DoRA: Weight-Decomposed Low-Rank Adaptation ICML 2024 Link Link
2024 LoRA+: Efficient Low-Rank Adaptation of Large Models ICML 2024 Link Link
2024 RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation ICML 2024 Link Link
2024 PiSSA: Principal Singular Values and Singular Vectors Adaptation of LLMs NeurIPS 2024 Link Link
2024 LoRAHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition ACL 2024 Link Link
2024 LoRA Learns Less and Forgets Less ICML 2024 Link N/A
2024 ReFT: Representation Finetuning for Language Models NeurIPS 2024 Link Link
2024 S²FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity NeurIPS 2024 Link N/A
2024 CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning NeurIPS 2024 Link N/A
2024 HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning NeurIPS 2024 Link Link
2024 Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models ICML 2024 Link N/A

Quantized Fine-tuning

Year Title Venue Paper Code
2023 QLoRA: Efficient Finetuning of Quantized LLMs NeurIPS 2023 Link Link
2023 PEQA: Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization NeurIPS 2023 Link N/A
2024 QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models ICLR 2024 Link Link
2024 LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models ICLR 2024 Link Link
2024 LQ-LoRA: Low-Rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning ICLR 2024 Link Link
2024 IR-QLoRA: Accurate LoRA-Finetuning Quantization of LLMs via Information Retention ICML 2024 Link Link
2024 BitDelta: Your Fine-Tune May Only Be Worth One Bit ICML 2024 Link Link
2024 EfficientQAT: Efficient Quantization-Aware Training for Large Language Models ACL 2025 Link Link
2024 AQLM: Extreme Compression of Large Language Models via Additive Quantization ICML 2024 Link Link
2024 The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58) arXiv 2024 Link N/A

Low-rank Gradient Training

Year Title Venue Paper Code
2023 Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions NeurIPS 2023 Link N/A
2024 GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection ICML 2024 Link Link
2024 Flora: Low-Rank Adapters Are Secretly Gradient Compressors ICML 2024 Link Link
2024 ReLoRA: High-Rank Training Through Low-Rank Updates ICLR 2024 Link Link
2024 Full Parameter Fine-Tuning for Large Language Models with Limited Resources (LOMO) ACL 2024 Link Link
2024 AdaLomo: Low-memory Optimization with Adaptive Learning Rate ACL 2024 Findings Link Link
2024 SLTrain: A Sparse Plus Low-Rank Approach for Parameter and Memory Efficient Pretraining NeurIPS 2024 Link N/A
2024 Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients arXiv 2024 Link N/A
2025 Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? ICML 2025 Link N/A
2024 Memory-Efficient LLM Training with Online Subspace Descent NeurIPS 2024 Link N/A

Memory-efficient Training

Year Title Venue Paper Code
2023 Reducing Activation Recomputation in Large Transformer Models MLSys 2023 Link Link
2023 CAME: Confidence-guided Adaptive Memory Efficient Optimization ACL 2023 Link N/A
2023 Training Transformers with 4-bit Integers NeurIPS 2023 Link Link
2023 MeZO: Fine-Tuning Language Models with Just Forward Passes NeurIPS 2023 Link Link
2024 Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training ICLR 2024 Link Link
2024 ZeRO++: Extremely Efficient Collective Communication for Giant Model Training ICML 2024 Link Link
2024 Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization ICML 2024 Link Link
2024 Adam-mini: Use Fewer Learning Rates, To Gain More NeurIPS 2024 Link Link
2024 LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning NeurIPS 2024 Link N/A
2024 VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections NeurIPS 2024 Link N/A
2024 ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models ICML 2024 Link N/A

System and Hardware Co-design

Serving Systems

Year Title Venue Paper Code
2023 Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) SOSP 2023 Link Link
2023 FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU ICML 2023 Link Link
2023 AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving OSDI 2023 Link N/A
2024 SGLang: Efficient Execution of Structured Language Model Programs NeurIPS 2024 Link Link
2024 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving OSDI 2024 Link N/A
2024 Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve OSDI 2024 Link N/A
2024 S-LoRA: Serving Thousands of Concurrent LoRA Adapters MLSys 2024 Link Link
2024 SpecInfer: Accelerating Generative LLM Serving with Tree-based Speculative Inference and Verification ASPLOS 2024 Link Link
2024 SpotServe: Serving Generative Large Language Models on Preemptible Instances ASPLOS 2024 Link N/A
2024 Splitwise: Efficient Generative LLM Inference Using Phase Splitting ISCA 2024 Link N/A
2024 MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving arXiv 2024 Link Link
2024 ServerlessLLM: Low-Latency Serverless Inference for Large Language Models OSDI 2024 Link Link

Batching and Scheduling

Year Title Venue Paper Code
2023 Efficiently Scaling Transformer Inference MLSys 2023 Link N/A
2024 Llumnix: Dynamic Scheduling for Large Language Model Serving OSDI 2024 Link Link
2024 Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference with Chunked Prefills OSDI 2024 Link N/A
2024 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving OSDI 2024 Link N/A
2024 Splitwise: Efficient Generative LLM Inference Using Phase Splitting ISCA 2024 Link N/A
2024 Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction arXiv 2024 Link N/A
2024 Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving arXiv 2024 Link N/A
2024 Efficient LLM Scheduling by Learning to Rank arXiv 2024 Link N/A

Kernel Optimization

Year Title Venue Paper Code
2023 FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks ASPLOS 2023 Link N/A
2023 HyperAttention: Long-context Attention in Near-Linear Time NeurIPS 2023 Link N/A
2024 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning ICLR 2024 Link Link
2024 FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision NeurIPS 2024 Link Link
2024 ThunderKittens: Simple, Fast, and Adorable AI Kernels NeurIPS 2024 Link Link
2024 Ring Attention with Blockwise Transformers for Near-Infinite Context ICLR 2024 Link Link
2025 FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving MLSys 2025 Link Link
2024 NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention NeurIPS 2024 Link N/A

Compiler Optimization

Year Title Venue Paper Code
2023 Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs ASPLOS 2023 Link Link
2023 TensorIR: An Abstraction for Automatic Tensorized Program Optimization ASPLOS 2023 Link Link
2023 Welder: Scheduling Deep Learning Memory Access via Tile-level Fusion OSDI 2023 Link N/A
2023 OLLA: Optimizing the Lifetime and Location of Arrays to Reduce Memory Usage of Neural Networks MLSys 2023 Link N/A
2024 Ladder: Enabling Efficient Low-Bit Quantization and Inference with Compiler Co-Design OSDI 2024 Link N/A
2024 An LLM Compiler for Parallel Function Calling ICML 2024 Link N/A

Hardware-aware Deployment

Year Title Venue Paper Code
2023 Efficiently Scaling Transformer Inference MLSys 2023 Link N/A
2024 AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration MLSys 2024 Link Link
2024 Atom: Low-bit Quantization for Efficient and Accurate LLM Serving MLSys 2024 Link Link
2024 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving MLSys 2025 Link Link
2024 MLC-LLM: Universal LLM Deployment on Consumer Devices with ML Compilation MLSys 2024 Link Link
2024 HexGen: Generative Inference of Large Language Model over Heterogeneous Environment ICML 2024 Link Link

Evaluation and Applications

Long-context Evaluation

Year Title Venue Paper Code
2023 ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding EMNLP 2023 Link Link
2024 Lost in the Middle: How Language Models Use Long Contexts TACL 2024 Link N/A
2024 LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding ACL 2024 Link Link
2024 L-Eval: Instituting Standardized Evaluation for Long Context Language Models ACL 2024 Link Link
2024 RULER: What's the Real Context Size of Your Long-Context Language Models? NAACL 2024 Link Link
2024 InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens ACL 2024 Link Link
2024 Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of LLMs ACL 2024 Link N/A
2024 BABILong: Testing the Limits of LLMs with Long Context Reasoning Benchmarks NeurIPS 2024 Link N/A
2024 M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models ACL 2024 Link N/A
2024 Ada-LEval: Evaluating Long-context LLMs with Length-adaptable Benchmarks NAACL 2024 Link N/A

Reasoning Robustness

Year Title Venue Paper Code
2023 Compressing LLMs: The Truth is Rarely Pure and Never Simple ICLR 2024 Link Link
2024 ShortGPT: Layers in Large Language Models are More Redundant Than You Expect ACL 2025 Findings Link N/A
2024 The Unreasonable Ineffectiveness of the Deeper Layers ICLR 2025 Link N/A
2024 LASER: Layer-Selective Rank Reduction for Improving Reasoning ICML 2024 Link N/A
2025 Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression ICML 2025 Link Link
2025 Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs EMNLP 2025 Findings Link N/A
2025 Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning arXiv 2025 Link N/A

Safety under Compression

Year Title Venue Paper Code
2023 Safety Alignment Should Be Made More Manageable NeurIPS 2023 Link N/A
2024 Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To ICLR 2024 Link N/A
2024 Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation ICLR 2024 Link N/A
2024 Playing It Safe: Defending Against Backdoors with Activation Clustering in Quantized LLMs AAAI 2024 Link N/A
2024 Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning NeurIPS 2024 Link N/A
2024 Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models NeurIPS 2024 Link N/A

Multimodal LLMs

Year Title Venue Paper Code
2024 Honeybee: Locality-enhanced Projector for Multimodal LLM CVPR 2024 Link Link
2024 MobileVLM: A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices CVPR 2024 Link Link
2024 TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones EMNLP 2024 Link Link
2024 LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model AAAI 2024 Link N/A
2024 FastV: An Image is Worth 1/2 Tokens After Layer 2 ECCV 2024 Link Link
2024 Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models NeurIPS 2024 Link Link
2024 TokenPacker: Efficient Visual Projector for Multimodal LLM NeurIPS 2024 Link N/A
2024 Matryoshka Multimodal Models NeurIPS 2024 Link N/A
2025 LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation ICLR 2025 Link Link
2025 VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow Guidance ICCV 2025 Link N/A
2024 MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer CVPR 2024 Link Link

Edge Deployment

Year Title Venue Paper Code
2024 MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases ICML 2024 Link N/A
2024 LLM in a flash: Efficient Large Language Model Inference with Limited Memory ICML 2024 Link N/A
2024 PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU MLSys 2024 Link Link
2024 EdgeMoE: Fast On-Device Inference of Mixture-of-Experts Based Large Language Models MLSys 2024 Link N/A
2024 LLMCad: Fast and Scalable On-device Large Language Model Inference MLSys 2024 Link N/A
2024 Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs ICML 2024 Link N/A
2024 MLC-LLM: Universal LLM Deployment on Consumer Devices with ML Compilation MLSys 2024 Link Link
2024 MobileQuant: Mobile-friendly Quantization for On-device Language Models EMNLP 2024 Findings Link N/A
2024 GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment ACL 2024 Findings Link N/A
2024 PowerInfer-2: Fast Large Language Model Inference on a Smartphone arXiv 2024 Link Link

Contributing

We welcome contributions from the community! If you find any missing papers or errors, please feel free to:

  • Open an Issue to report errors or suggest papers
  • Submit a Pull Request to add new papers
  • Star this repository if you find it helpful

About

A list of awesome papers on compression and acceleration of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs).

Topics

Resources

License

Stars

Watchers

Forks

Contributors