Awesome-Efficient-Large-Models

A list of awesome papers on compression and acceleration of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs).

Continuously updated. Welcome to star and watch!

Paper Collection | Contributing

News

[2026.05] We have open-sourced a comprehensive, continuously-updated taxonomy of 400+ papers covering model compression, inference acceleration, and system co-design for efficient large models.

Taxonomy

Quantization

LLM Quantization

Year	Title	Venue	Paper	code
2023	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers	ICLR 2023	Link	Link
2025	OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting	ICLR 2025	Link	Link
2025	SpinQuant: LLM quantization with learned rotations	ICLR 2025	Link	Link
2022	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	ICML 2023	Link	Link
2023	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	MLSys 2024	Link	Link
2024	QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks	ICML 2024	Link	Link
2025	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	MLSys 2025	Link	Link
2024	QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs	NeurIPS 2024	Link	Link
2024	Atom: Low-bit Quantization for Efficient and Accurate LLM Serving	MLSys 2024	Link	Link
2024	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models	ICLR 2024	Link	Link
2023	QuIP: 2-Bit Quantization of Large Language Models With Guarantees	NeurIPS 2023	Link	Link
2022	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale	NeurIPS 2022	Link	Link
2023	Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling	EMNLP 2023	Link	Link
2025	GPTAQ: Efficient Finetuning-Free Quantization for Asyetric Calibration	ICML 2025	Link	Link
2024	MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization	NeurIPS 2024	Link	Link
2024	AffineQuant: Affine Transformation Quantization for Large Language Models	ICLR 2024	Link	Link
2024	LLM-QAT: Data-Free Quantization Aware Training for Large Language Models	ACL 2024	Link	Link
2024	BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation	ACL 2024	Link	Link
2023	OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models	AAAI 2024 (Oral)	Link	Link
2024	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression	ICLR 2024	Link	Link
2022	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers	NeurIPS 2022	Link	Link
2024	LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models	ICLR 2024	Link	Link
2024	OneBit: Towards Extremely Low-bit Large Language Models	NeurIPS 2024	Link	Link
2023	LLM-FP4: 4-bit Floating-Point Quantized Transformers	EMNLP 2023	Link	Link
2024	FlatQuant: Flatness Matters for LLM Quantization	ICML 2025	Link	Link
2024	SqueezeLLM: Dense-and-Sparse Quantization	ICML 2024	Link	Link
2023	RPTQ: Reorder-based Post-training Quantization for Large Language Models		Link	Link
2024	QQQ: Quality Quattuor-Bit Quantization for Large Language Models	ICLR	Link	Link
2024	Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs		Link	Link
2025	CBQ: Cross-Block Quantization for Large Language Models	ICLR 2025	Link	N/A
2025	MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods	ICLR 2025	Link	N/A
2025	SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators	ICLR 2025	Link	N/A
2025	Progressive Mixed-Precision Decoding for Efficient LLM Inference	ICLR 2025	Link	N/A
2025	Surprising Effectiveness of Pretraining Ternary Language Models at Scale	ICLR 2025	Link	N/A
2025	EfficientQAT: Efficient Quantization-Aware Training for Large Language Models	ACL 2025	Link	Link
2025	MiniKV: Pushing the Limits of 2-Bit KV Cache via Compression and System Co-Design for Efficient Long Context Inference	ACL 2025 Findings	Link	N/A
2025	AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations	ACL 2025	Link	N/A
2025	Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition	ACL 2025 Findings	Link	N/A
2025	LittleBit: Ultra Low-Bit Quantization via Latent Factorization	NeurIPS 2025	Link	Link
2024	Extreme Compression of Large Language Models via Additive Quantization	ICML 2024	Link	Link
2024	BiLLM: Pushing the Limit of Post-Training Quantization for LLMs	ICML 2024	Link	Link
2024	LQER: Low-Rank Quantization Error Reconstruction for LLMs	ICML 2024	Link	N/A
2024	Evaluating Quantized Large Language Models	ICML 2024	Link	Link
2024	QMoE: Sub-1-Bit Compression of Trillion Parameter Models	MLSys 2024	Link	Link
2024	DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs	NeurIPS 2024	Link	Link
2024	QBB: Quantization with Binary Bases for LLMs	NeurIPS 2024	Link	N/A
2024	Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models	NeurIPS 2024	Link	N/A
2024	VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models	EMNLP 2024	Link	Link

VLM Quantization

Year	Title	Venue	Paper	code
2024	Q-VLM: Post-training Quantization for Large Vision Language Models	NIPS 2024	Link	Link
2025	MBQ:Modality-Balanced Quantization for Large Vision-Language Models	CVPR 2025	Link	Link
2025	MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization	ACM MM 2025	Link	Link
2025	CASP: Compression of Large Multimodal Models Based on Attention Sparsity	CVPR 2025	Link	Link
2024	Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation	ACM MM 2024	Link	N/A

Pruning / Sparsity

Unstructured Pruning

Pruning without Weight Update

Year	Title	Venue	Paper	code
2023	A Simple and Effective Pruning Approach for Large Language Models	ICLR 2024	Link	Link
2024	BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation	ICLR 2024	Link	Link
2024	COPAL: Continual Pruning in Large Language Generative Models	ICML 2024	Link	N/A
2024	Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models	ICML 2024	Link	Link
2025	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation	ICML 2025	Link	N/A
2025	SAFE: Finding Sparse and Flat Minima to Improve Pruning	ICML 2025	Link	Link
2025	SwiftPrune: Hessian-Free Weight Pruning for Large Language Models	EMNLP 2025 Findings	Link	N/A

Pruning with Weight Update

Year	Title	Venue	Paper	code
2023	SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot	ICML 2023	Link	Link
2023	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs	ICLR 2024	Link	Link
2023	The LLM Surgeon	ICLR 2024	Link	Link
2024	Fast and Optimal Weight Update for Pruned Large Language Models	TMLR 2024	Link	Link
2024	Pruning Foundation Models for High Accuracy without Retraining	EMNLP 2024 findings	Link	Link
2024	SparseLLM: Towards Global Pruning for Pre-trained Language Models	NeurIPS 2024	Link	Link
2024	ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models	NeurIPS 2024	Link	Link
2024	Shears: Unstructured Sparsity with Neural Low-rank Adapter Search	NAACL 2024	Link	Link
2025	Wanda++: Pruning Large Language Models via Regional Gradients	ACL 2025 Findings	Link	Link
2024	Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization	ICLR 2025	Link	Link
2025	Dynamic Low-Rank Sparse Adaptation for Large Language Models	ICLR 2025	Link	Link
2024	Wasserstein Distances, Neuronal Entanglement, and Sparsity	ICLR 2025	Link	Link
2025	Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision	ICML 2025	Link	N/A
2025	An Efficient Pruner for Large Language Model with Theoretical Guarantee	ICML 2025	Link	N/A
2025	DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration	NeurIPS 2025	Link	Link
2025	Multi-Objective One-Shot Pruning for Large Language Models	NeurIPS 2025	Link	N/A

Sparsity Rate Allocation

Year	Title	Venue	Paper	code
2023	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity	ICML 2024	Link	Link
2024	ALS: Adaptive Layer Sparsity for Large Language Models via Activation Correlation Assessment	NeurIPS 2024	Link	Link
2024	Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models	NeurIPS 2024	Link	Link
2024	AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models	NeurIPS 2024	Link	Link
2024	EvoPress: Accurate Dynamic Model Compression via Evolutionary Search	ICML 2025	Link	Link
2025	Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective	ICML 2025	Link	Link
2025	DLP: Dynamic Layerwise Pruning in Large Language Models	ICML 2025	Link	Link
2025	Lua-LLM: Learning Unstructured-Sparsity Allocation for Large Language Models	NeurIPS 2025	Link	N/A

Sparse plus Low-Rank Compression

Year	Title	Venue	Paper	code
2024	OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition	ICLR 2025	Link	Link
2025	Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models	ICML 2025	Link	Link
2025	1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models	EMNLP 2025	Link	Link
2025	3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs	NeurIPS 2025	Link	Link

Calibration Dataset

Year	Title	Venue	Paper	code
2024	On the Impact of Calibration Data in Post-training Quantization and Pruning	ACL 2024	Link	Link
2024	Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning	EMNLP 2024	Link	Link
2024	Beware of Calibration Data for Pruning Large Language Models	ICLR 2025	Link	Link

Evaluation of Pruned Model

Year	Title	Venue	Paper	code
2023	Compressing LLMs: The Truth is Rarely Pure and Never Simple	ICLR 2024	Link	Link
2025	Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compresssion	ICML 2025	Link	Link
2025	Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs	EMNLP 2025 Findings	Link	N/A

Semi-structured Pruning

Year	Title	Venue	Paper	code
2024	WRP: Weight Recover Prune for Structured Sparsity	ACL 2024	Link	Link
2024	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models	ICLR 2024	Link	Link
2024	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training	AAAI 2025	Link	Link
2024	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models	NeurIPS 2024	Link	Link
2025	ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs	ICML 2025	Link	Link
2025	PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models	NeurIPS 2025	Link	Link
2025	TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks	NeurIPS 2025	Link	Link

Structured Pruning

Head and Neuron Pruning

Year	Title	Venue	Paper	code
2023	LLM-Pruner: On the Structural Pruning of Large Language Models	NeurIPS 2023	Link	Link
2023	Fluctuation-based Adaptive Structured Pruning for Large Language Models	AAAI 2024	Link	Link
2023	Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning	ICLR 2024	Link	Link
2024	BlockPruner: Fine-grained Pruning for Large Language Models	ACL 2025 Findings	Link	Link
2024	Structured Optimal Brain Pruning for Large Language Models	EMNLP 2024	Link	N/A
2024	Search for Efficient Large Language Models	NeurIPS 2024	Link	Link
2024	SlimGPT: Layer-wise Structured Pruning for Large Language Models	NeurIPS 2024	Link	N/A
2024	Compact Language Models via Pruning and Knowledge Distillation	NeurIPS 2024	Link	Link
2024	DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models	NeurIPS 2024	Link	Link
2025	Tyr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization	NeurIPS 2025	Link	Link
2025	Olica: Efficient Structured Pruning of Large Language Models without Retraining	ICML 2025	Link	Link

Layer Pruning

Year	Title	Venue	Paper	code
2024	Shortened LLaMA: A Simple Depth Pruning for Large Language Models	ICLR 2024 workshop	Link	Link
2024	LaCo: Large Language Model Pruning via Layer Collapse	EMNLP 2024 Findings	Link	Link
2024	Shortgpt: Layers in large language models are more redundant than you expect	ACL 2025 Findings	Link	Link
2024	Streamlining Redundant Layers to Compress Large Language Models	ICLR 2025	Link	Link
2024	SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks	ICML 2024	Link	Link
2024	Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging	EMNLP 2024	Link	N/A
2024	TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs	ACL 2025	Link	Link
2025	A Simple Linear Patch Revives Layer-Pruned Large Language Models	NeurIPS 2025	Link	Link

Activation Sparsity

Year	Title	Venue	Paper	code
2023	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time	ICML 2024	Link	Link
2023	ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models	ICLR 2024	Link	N/A
2024	CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models	COLM 2024	Link	Link
2024	ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models	EMNLP 2024	Link	Link
2024	Training-Free Activation Sparsity in Large Language Models	ICLR 2025	Link	Link
2024	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity	ICML 2025	Link	Link
2025	La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation	ICML 2025	Link	N/A
2025	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference	ICLR 2025	Link	Link
2024	Sirius: Contextual Sparsity with Correction for Efficient LLMs	NeurIPS 2024	Link	Link
2024	Learn To be Efficient: Build Structured Sparsity in Large Language Models	NeurIPS 2024	Link	Link
2025	Weight-Aware Activation Sparsity with Constrained Bayesian Optimization Scheduling for Large Language Models	EMNLP 2025	Link	Link
2025	Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity	NeurIPS 2025	Link	Link

Joint Sparsification and Quantization

Year	Title	Venue	Paper	code
2024	SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models	EMNLP 2024 Findings	Link	Link
2024	Effective Interplay between Sparsity and Quantization: From Theory to Practice	ICLR 2025	Link	Link
2024	Compressing large language models by joint sparsification and quantization	ICML 2024	Link	Link
2024	SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression	ICML 2025	Link	Link
2025	Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs	arxiv 2025	Link	Link

Knowledge Distillation

Year	Title	Venue	Paper	Code
2025	Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression	CVPR 2025	Link	Link
2025	LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation	ICLR 2025	Link	Link
2025	Pre-training Distillation for Large Language Models: A Design Space Exploration	ACL 2025	Link	N/A
2025	TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models	ICLR 2025	Link	N/A
2025	Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling	ICLR 2025	Link	N/A
2025	Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation	CVPR 2025	Link	N/A
2025	Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models	AAAI 2025	Link	N/A
2025	Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models	COLING 2025	Link	N/A
2025	Lillama: Large Language Models Compression via Low-Rank Feature Distillation	NAACL 2025	Link	Link
2025	Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs	TMLR 2025	Link	Link
2024	MiniLLM: Knowledge Distillation of Large Language Models	ICLR 2024	Link	Link
2024	On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes	ICLR 2024	Link	N/A
2024	DistiLLM: Towards Streamlined Distillation for Large Language Models	ICML 2024	Link	Link
2024	DDK: Distilling Domain Knowledge for Efficient Large Language Models	NeurIPS 2024	Link	N/A
2024	Adversarial Moment-Matching Distillation of Large Language Models	NeurIPS 2024	Link	N/A
2024	PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning	EMNLP 2024 Findings	Link	Link
2024	Dual-Space Knowledge Distillation for Large Language Models	EMNLP 2024	Link	Link
2024	ELAD: Explanation-Guided Large Language Models Active Distillation	ACL 2024 Findings	Link	N/A
2024	Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs Distillation	EMNLP 2024	Link	N/A
2024	CLIP-KD: An Empirical Study of CLIP Model Distillation	CVPR 2024	Link	Link
2024	Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data	AAAI 2024	Link	Link
2024	LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions	EACL 2024	Link	Link
2024	Aligning Large and Small Language Models via Chain-of-Thought Reasoning	EACL 2024	Link	Link
2024	Weight-Inherited Distillation for Task-Agnostic BERT Compression	NAACL 2024	Link	Link
2024	Knowledge Fusion of Large Language Models	ICLR 2024	Link	Link
2024	OPENCHAT: Advancing Open-source Language Models with Mixed-Quality Data	ICLR 2024	Link	Link
2024	Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models	ICML 2024	Link	Link
2023	AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression	ACL 2023	Link	Link
2023	DiffKD: Diffusion-based Knowledge Distillation for Large Language Models	NeurIPS 2023	Link	Link
2023	SCOTT: Self-Consistent Chain-of-Thought Distillation	ACL 2023	Link	Link
2023	Distilling Script Knowledge from Large Language Models for Constrained Language Planning	ACL 2023	Link	Link
2023	DOT: A Distillation-Oriented Trainer	ICCV 2023	Link	Link
2023	Specializing Smaller Language Models towards Multi-Step Reasoning	ICML 2023	Link	Link
2023	DISCO: Distilling Counterfactuals with Large Language Models	ACL 2023	Link	Link
2023	Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind	NeurIPS 2023	Link	Link
2023	PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation	EMNLP 2023	Link	Link
2023	Democratizing Reasoning Ability: Tailored Learning from Large Language Model	EMNLP 2023	Link	Link
2023	GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model	ACL 2023	Link	Link
2023	Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes	ACL 2023	Link	Link
2023	Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models	EMNLP 2023	Link	Link
2023	f-Divergence Minimization for Sequence-Level Knowledge Distillation	ACL 2023	Link	Link
2023	Symbolic Chain-of-Thought Distillation: Small Models Can Also Think Step-by-Step	ACL 2023	Link	N/A
2023	Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks	NeurIPS 2023	Link	Link
2023	Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation	EMNLP 2023	Link	Link
2023	Lion: Adversarial Distillation of Closed-Source Large Language Model	EMNLP 2023	Link	Link
2023	InheritSumm: A General, Versatile and Compact Summarizer by Distilling from GPT	EMNLP 2023	Link	N/A
2023	Aligning Large Language Models through Synthetic Feedback	EMNLP 2023	Link	Link
2023	MCC-KD: Multi-CoT Consistent Knowledge Distillation	EMNLP 2023 Findings	Link	N/A
2023	Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization	ICLR 2023	Link	Link
2023	Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data	EMNLP 2023	Link	Link
2022	TinyViT: Fast Pretraining Distillation for Small Vision Transformers	ECCV 2022	Link	Link
2022	DIST: Distilling Large Language Models with Small-Scale Data	NeurIPS 2022	Link	Link
2022	Decoupled Knowledge Distillation	CVPR 2022	Link	Link

Low-rank Decomposition

Year	Title	Venue	Paper	Code
2025	Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models	ICML 2025	Link	Link
2025	Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives	ICLR 2025	Link	Link
2025	SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs	ICLR 2025	Link	Link
2025	Dynamic Low-Rank Sparse Adaptation for Large Language Models	ICLR 2025	Link	Link
2025	MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition	ICML 2025	Link	N/A
2025	Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition	ACL 2025 Findings	Link	N/A
2025	Delta Decompression for MoE-based LLMs Compression	PMLR 2025	Link	N/A
2025	LittleBit: Ultra Low-Bit Quantization via Latent Factorization	NeurIPS 2025	Link	N/A
2024	Compressing Large Language Models using Low Rank and Low Precision Decomposition	NeurIPS 2024	Link	Link
2024	Unified Low-rank Compression Framework for Click-through Rate Prediction	KDD 2024	Link	Link
2024	SliceGPT: Compress Large Language Models by Deleting Rows and Columns	ICLR 2024	Link	Link
2024	Low-Rank Knowledge Decomposition for Medical Foundation Models	CVPR 2024	Link	Link
2024	LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking	CVPR 2024	Link	Link
2024	Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization	ISCA 2024	Link	N/A
2024	LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning	ICLR 2024	Link	N/A
2024	LQER: Low-Rank Quantization Error Reconstruction for LLMs	ICML 2024	Link	N/A
2024	Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations	ACL 2024 Findings	Link	N/A
2024	Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization	ACL 2024 Findings	Link	N/A
2024	Surgical Feature-Space Decomposition of LLMs: Why, When and How?	ACL 2024	Link	N/A
2024	DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models	EMNLP 2024	Link	N/A
2023	LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation	ICML 2023	Link	Link
2022	Compressible-composable NeRF via Rank-residual Decomposition	NeurIPS 2022	Link	Link

Efficient Architecture

State Space Models and Linear Attention

Year	Title	Venue	Paper	Code
2023	Hungry Hungry Hippos: Towards Language Modeling with State Space Models	ICLR 2023	Link	Link
2023	Hyena Hierarchy: Towards Larger Convolutional Language Models	ICML 2023	Link	Link
2023	RWKV: Reinventing RNNs for the Transformer Era	EMNLP 2023 Findings	Link	Link
2023	Mamba: Linear-Time Sequence Modeling with Selective State Spaces	arXiv 2023	Link	Link
2023	RetNet: Retentive Network: A Successor to Transformer for Large Language Models	arXiv 2023	Link	Link
2023	Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture	NeurIPS 2023	Link	Link
2024	Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2)	ICML 2024	Link	Link
2024	Gated Linear Attention Transformers with Hardware-Efficient Training	ICML 2024	Link	Link
2024	Based: Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff	ICML 2024	Link	Link
2024	Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models	arXiv 2024	Link	N/A
2024	Jamba: A Hybrid Transformer-Mamba Language Model	arXiv 2024	Link	N/A

Efficient Attention Mechanisms

Year	Title	Venue	Paper	Code
2023	GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	EMNLP 2023	Link	N/A
2023	HyperAttention: Long-context Attention in Near-Linear Time	NeurIPS 2023	Link	N/A
2024	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	ICLR 2024	Link	Link
2024	Ring Attention with Blockwise Transformers for Near-Infinite Context	ICLR 2024	Link	Link
2024	FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision	NeurIPS 2024	Link	Link
2024	ThunderKittens: Simple, Fast, and Adorable AI Kernels	NeurIPS 2024	Link	Link

Mixture of Experts Efficiency

Year	Title	Venue	Paper	Code
2024	DeepSeek-MoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	ACL 2024	Link	Link
2024	Mixtral of Experts	arXiv 2024	Link	Link
2025	Ada-K Routing: Boosting the Efficiency of MoE-based LLMs	ICLR 2025	Link	N/A

Other Efficient Architectures

Year	Title	Venue	Paper	Code
2024	You Only Cache Once: Decoder-Decoder Architectures for Language Models	arXiv 2024	Link	Link
2024	Scalable MatMul-free Language Modeling	arXiv 2024	Link	Link

Speculative Decoding

Year	Title	Venue	Paper	Code
2025	HASS: Learning Harmonized Representations for Speculative Sampling	ICLR 2025	Link	Link
2025	PEARL: Parallel Speculative Decoding with Adaptive Draft Length	ICLR 2025	Link	Link
2025	SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration	ICLR 2025	Link	Link
2025	Pre-Training Curriculum for Multi-Token Prediction in Language Models	ACL 2025	Link	Link
2025	Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree	ACL 2025	Link	Link
2025	SAM Decoding: Speculative Decoding via Suffix Automaton	ACL 2025	Link	Link
2025	DReSD: Dense Retrieval for Speculative Decoding	ACL 2025 Findings	Link	N/A
2025	EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models	NAACL 2025	Link	N/A
2025	Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding	NAACL 2025 Findings	Link	N/A
2025	SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths	COLM 2025	Link	Link
2024	Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting	NeurIPS 2024	Link	Link
2024	EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees	EMNLP 2024	Link	Link
2024	Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads	ICML 2024	Link	Link
2024	EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty	ICML 2024	Link	Link
2024	Online Speculative Decoding	ICML 2024	Link	N/A
2024	SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices	NeurIPS 2024	Link	Link
2024	Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding	NeurIPS 2024	Link	Link
2024	Cascade Speculative Drafting for Even Faster LLM Inference	NeurIPS 2024	Link	Link
2024	Accelerating Blockwise Parallel Language Models with Draft Refinement	NeurIPS 2024	Link	N/A
2024	Graph-Structured Speculative Decoding	ACL 2024 Findings	Link	N/A
2024	Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism	ACL 2024 Findings	Link	N/A
2024	SLiM: Speculative Decoding with Hypothesis Reduction	NAACL 2024 Findings	Link	N/A
2024	SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification	ASPLOS 2024	Link	N/A
2024	REST: Retrieval-Based Speculative Decoding	NAACL 2024	Link	Link
2024	Lookahead Decoding: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding	ICML 2024	Link	Link
2024	LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding	ACL 2024	Link	Link
2024	Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding	EMNLP 2024	Link	Link
2024	CLLMs: Consistency Large Language Models	ICLR 2024	Link	Link
2023	Fast Inference from Transformers via Speculative Decoding	ICML 2023	Link	N/A
2023	SpecTr: Fast Speculative Decoding via Optimal Transport	NeurIPS 2023	Link	N/A
2023	Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation	EMNLP 2023 Findings	Link	Link
2023	Speculative Decoding with Big Little Decoder	NeurIPS 2023	Link	Link

KV Cache Optimization

Year	Title	Venue	Paper	Code	Category
2025	InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation		Link	Link	Token Eviction
2025	R-KV: Redundancy-aware KV Cache Compression for Reasoning Models		Link	Link	Token Eviction
2025	SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator	ICML 2025	Link	Link	Token Eviction
2025	RazorAttention: Efficient KV Cache Compression Through Retrieval Heads	ICLR 2025	Link	N/A	Token Eviction
2025	Squeezed Attention: Accelerating Long Context Length LLM Inference	ACL 2025	Link	Link	Token Eviction
2025	LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models	ICML 2025	Link	Link	Budget Allocation
2025	CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences	ICLR 2025	Link	Link	Budget Allocation
2025	AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning	ICCV 2025	Link	Link	Cache Merging
2025	MiniKV: Pushing the Limits of 2-Bit KV Cache via Compression and System Co-Design for Efficient Long Context Inference	ACL 2025 Findings	Link	N/A	Quantization
2025	Palu: KV-Cache Compression with Low-Rank Projection	ICLR 2025	Link	Link	Low Rank Projection
2025	LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy	ICLR 2025	Link	N/A	Low Rank Projection
2025	Preserving Large Activations: The Key to KV Cache Pruning	ICLR 2025	Link	N/A	Token Eviction
2025	VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration	ICLR 2025	Link	N/A	Budget Allocation
2025	Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning	ICLR 2025	Link	Link	Token Eviction
2025	TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization	ACL 2025 Findings	Link	N/A	System/Offloading
2025	KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation	ACL 2025 Findings	Link	Link	System/Offloading
2025	KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding	ACL 2025	Link	Link	Low Rank Projection
2025	DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs	EMNLP 2025 Findings	Link	N/A	Budget Allocation
2024	SnapKV: LLM Knows What You are Looking for Before Generation	NeurIPS 2024	Link	Link	Token Eviction
2024	InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory	NeurIPS 2024	Link	Link	Token Eviction
2024	Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs	ICLR 2024	Link	Link	Token Eviction
2024	Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference	MLSys 2024	Link	Link	Token Eviction
2024	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	ICML 2024	Link	Link	Token Eviction
2024	On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference		Link	Link	Token Eviction
2024	PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling		Link	Link	Budget Allocation
2024	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	NeurIPS 2024	Link	Link	Cache Merging
2024	CaM: Cache Merging for Memory-efficient LLMs Inference	ICML 2024	Link	Link	Cache Merging
2024	Compressed Context Memory For Online Language Model Interaction	ICLR 2024	Link	Link	Cache Merging
2024	Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference	ICML 2024	Link	Link	Cache Merging
2024	LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference	EMNLP 2024 Findings	Link	Link	Cache Merging
2024	CHAI: Clustered Head Attention for Efficient LLM Inference	ICML 2024	Link	Link	Cache Merging
2024	D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models	ICLR 2025	Link	Link	Cache Merging
2024	IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact	ACL 2024	Link	Link	Quantization
2024	KIVI: A Tuning-Free Asyetric 2bit Quantization for KV Cache	ICML 2024	Link	Link	Quantization
2024	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	NeurIPS 2024	Link	Link	Quantization
2024	SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models	COLM 2024	Link	Link	Quantization
2024	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM	NeurIPS 2024	Link	Link	Quantization
2024	NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time	ACL 2024	Link	Link	Token Eviction
2024	SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation	ACL 2025	Link	Link	Token Eviction
2024	AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asyetric Quantization Configurations	ACL 2025	Link	N/A	Quantization
2024	Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference	ICML 2024	Link	Link	Cache Merging
2024	Layer-Condensed KV Cache for Efficient Inference of Large Language Models	ACL 2024	Link	Link	Cache Sharing
2024	FINCH: Prompt-guided Key-Value Cache Compression for Large Language Models	TACL 2024 / EMNLP 2024	Link	N/A	Token Eviction
2024	KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches	EMNLP 2024 Findings	Link	Link	Benchmark
2024	Eigen Attention: Attention in Low-Rank Space for KV Cache Compression	EMNLP 2024 Findings	Link	Link	Low Rank Projection
2024	A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression	EMNLP 2024	Link	Link	Token Eviction
2023	H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	NeurIPS 2023	Link	Link	Token Eviction
2023	Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time	NeurIPS 2023	Link	Link	Token Eviction
2023	Efficient Streaming Language Models with Attention Sinks	ICLR 2024	Link	Link	Token Eviction

Prompt / Context Compression

Year	Title	Venue	Paper	Code
2023	LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models	EMNLP 2023	Link	Link
2023	Selective Context: Compressing Context to Enhance Inference Efficiency of Large Language Models	EMNLP 2023	Link	Link
2023	Learning to Compress Prompts with Gist Tokens	NeurIPS 2023	Link	Link
2023	Adapting Language Models to Compress Contexts (AutoCompressors)	EMNLP 2023	Link	Link
2023	Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers	NeurIPS 2023	Link	N/A
2024	In-context Autoencoder for Context Compression in a Large Language Model (ICAE)	ICLR 2024	Link	Link
2024	RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation	ICLR 2024	Link	N/A
2024	LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression	ACL 2024	Link	Link
2024	LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression	ACL 2024 Findings	Link	Link
2024	Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon	ICML 2024	Link	Link
2024	xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token	NeurIPS 2024	Link	N/A
2025	500xCompressor: Generalized Prompt Compression for Large Language Models	AAAI 2025	Link	N/A
2024	Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage	EACL 2024 Findings	Link	N/A
2024	Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles	EMNLP 2024 Findings	Link	N/A

Early Exit

Year	Title	Venue	Paper	Code
2022	CALM: Confident Adaptive Language Modeling	NeurIPS 2022	Link	N/A
2023	FREE: Fast and Robust Early Exiting Framework for Autoregressive Language Models	EMNLP 2023	Link	N/A
2023	SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference	arXiv 2023	Link	N/A
2023	Speculative Decoding with Big Little Decoder (BiLD)	NeurIPS 2023	Link	Link
2024	ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference	AAAI 2024	Link	N/A
2024	EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism	ICML 2024	Link	Link
2024	LayerSkip: Enabling Early-Exit Inference and Self-Speculative Decoding	ACL 2024	Link	Link
2024	Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting	NeurIPS 2024	Link	Link
2024	Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding	ACL 2024	Link	N/A
2024	Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning	ICLR 2024	Link	N/A

Adaptive Computation

Year	Title	Venue	Paper	Code
2023	AdaTape: Foundation Model with Adaptive Computation via Elastic Input Sequence	ICML 2023	Link	N/A
2023	CoLT5: Faster Long-Range Transformers with Conditional Computation	EMNLP 2023	Link	N/A
2024	Mixture of Depths: Dynamically Allocating Compute in Transformer-Based Language Models	ICML 2024	Link	N/A
2024	Think before you speak: Training Language Models With Pause Tokens	ICLR 2024	Link	N/A
2024	MatFormer: Nested Transformer for Elastic Inference	ICLR 2024	Link	Link
2024	FLEXTRON: Many-in-One Flexible Large Language Model	ICML 2024	Link	N/A
2024	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference	arXiv 2024	Link	N/A
2024	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU	MLSys 2024	Link	Link
2024	D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models	NeurIPS 2024	Link	Link
2024	Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models	NeurIPS 2024 Workshop	Link	N/A
2024	RouteLLM: Learning to Route LLMs with Preference Data	arXiv 2024	Link	Link

PEFT

Year	Title	Venue	Paper	Code
2023	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning	ICLR 2023	Link	Link
2023	QLoRA: Efficient Finetuning of Quantized LLMs	NeurIPS 2023	Link	Link
2023	LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models	EMNLP 2023	Link	Link
2023	LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	NeurIPS 2023	Link	Link
2023	DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation	EACL 2023	Link	N/A
2024	LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models	ICLR 2024	Link	Link
2024	VeRA: Vector-based Random Matrix Adaptation	ICLR 2024	Link	N/A
2024	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models	ICLR 2024	Link	Link
2024	NOLA: Compressing LoRA using Linear Combination of Random Basis	ICLR 2024	Link	N/A
2024	DoRA: Weight-Decomposed Low-Rank Adaptation	ICML 2024	Link	Link
2024	LoRA+: Efficient Low-Rank Adaptation of Large Models	ICML 2024	Link	Link
2024	RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation	ICML 2024	Link	Link
2024	PiSSA: Principal Singular Values and Singular Vectors Adaptation of LLMs	NeurIPS 2024	Link	Link
2024	LoRAHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition	ACL 2024	Link	Link
2024	LoRA Learns Less and Forgets Less	ICML 2024	Link	N/A
2024	ReFT: Representation Finetuning for Language Models	NeurIPS 2024	Link	Link
2024	S²FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity	NeurIPS 2024	Link	N/A
2024	CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning	NeurIPS 2024	Link	N/A
2024	HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning	NeurIPS 2024	Link	Link
2024	Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models	ICML 2024	Link	N/A

Quantized Fine-tuning

Year	Title	Venue	Paper	Code
2023	QLoRA: Efficient Finetuning of Quantized LLMs	NeurIPS 2023	Link	Link
2023	PEQA: Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization	NeurIPS 2023	Link	N/A
2024	QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models	ICLR 2024	Link	Link
2024	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models	ICLR 2024	Link	Link
2024	LQ-LoRA: Low-Rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning	ICLR 2024	Link	Link
2024	IR-QLoRA: Accurate LoRA-Finetuning Quantization of LLMs via Information Retention	ICML 2024	Link	Link
2024	BitDelta: Your Fine-Tune May Only Be Worth One Bit	ICML 2024	Link	Link
2024	EfficientQAT: Efficient Quantization-Aware Training for Large Language Models	ACL 2025	Link	Link
2024	AQLM: Extreme Compression of Large Language Models via Additive Quantization	ICML 2024	Link	Link
2024	The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58)	arXiv 2024	Link	N/A

Low-rank Gradient Training

Year	Title	Venue	Paper	Code
2023	Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions	NeurIPS 2023	Link	N/A
2024	GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection	ICML 2024	Link	Link
2024	Flora: Low-Rank Adapters Are Secretly Gradient Compressors	ICML 2024	Link	Link
2024	ReLoRA: High-Rank Training Through Low-Rank Updates	ICLR 2024	Link	Link
2024	Full Parameter Fine-Tuning for Large Language Models with Limited Resources (LOMO)	ACL 2024	Link	Link
2024	AdaLomo: Low-memory Optimization with Adaptive Learning Rate	ACL 2024 Findings	Link	Link
2024	SLTrain: A Sparse Plus Low-Rank Approach for Parameter and Memory Efficient Pretraining	NeurIPS 2024	Link	N/A
2024	Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients	arXiv 2024	Link	N/A
2025	Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?	ICML 2025	Link	N/A
2024	Memory-Efficient LLM Training with Online Subspace Descent	NeurIPS 2024	Link	N/A

Memory-efficient Training

Year	Title	Venue	Paper	Code
2023	Reducing Activation Recomputation in Large Transformer Models	MLSys 2023	Link	Link
2023	CAME: Confidence-guided Adaptive Memory Efficient Optimization	ACL 2023	Link	N/A
2023	Training Transformers with 4-bit Integers	NeurIPS 2023	Link	Link
2023	MeZO: Fine-Tuning Language Models with Just Forward Passes	NeurIPS 2023	Link	Link
2024	Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training	ICLR 2024	Link	Link
2024	ZeRO++: Extremely Efficient Collective Communication for Giant Model Training	ICML 2024	Link	Link
2024	Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization	ICML 2024	Link	Link
2024	Adam-mini: Use Fewer Learning Rates, To Gain More	NeurIPS 2024	Link	Link
2024	LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning	NeurIPS 2024	Link	N/A
2024	VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections	NeurIPS 2024	Link	N/A
2024	ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models	ICML 2024	Link	N/A

Serving Systems

Year	Title	Venue	Paper	Code
2023	Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM)	SOSP 2023	Link	Link
2023	FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU	ICML 2023	Link	Link
2023	AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving	OSDI 2023	Link	N/A
2024	SGLang: Efficient Execution of Structured Language Model Programs	NeurIPS 2024	Link	Link
2024	DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving	OSDI 2024	Link	N/A
2024	Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve	OSDI 2024	Link	N/A
2024	S-LoRA: Serving Thousands of Concurrent LoRA Adapters	MLSys 2024	Link	Link
2024	SpecInfer: Accelerating Generative LLM Serving with Tree-based Speculative Inference and Verification	ASPLOS 2024	Link	Link
2024	SpotServe: Serving Generative Large Language Models on Preemptible Instances	ASPLOS 2024	Link	N/A
2024	Splitwise: Efficient Generative LLM Inference Using Phase Splitting	ISCA 2024	Link	N/A
2024	MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving	arXiv 2024	Link	Link
2024	ServerlessLLM: Low-Latency Serverless Inference for Large Language Models	OSDI 2024	Link	Link

Batching and Scheduling

Year	Title	Venue	Paper	Code
2023	Efficiently Scaling Transformer Inference	MLSys 2023	Link	N/A
2024	Llumnix: Dynamic Scheduling for Large Language Model Serving	OSDI 2024	Link	Link
2024	Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference with Chunked Prefills	OSDI 2024	Link	N/A
2024	DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving	OSDI 2024	Link	N/A
2024	Splitwise: Efficient Generative LLM Inference Using Phase Splitting	ISCA 2024	Link	N/A
2024	Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction	arXiv 2024	Link	N/A
2024	Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving	arXiv 2024	Link	N/A
2024	Efficient LLM Scheduling by Learning to Rank	arXiv 2024	Link	N/A

Kernel Optimization

Year	Title	Venue	Paper	Code
2023	FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks	ASPLOS 2023	Link	N/A
2023	HyperAttention: Long-context Attention in Near-Linear Time	NeurIPS 2023	Link	N/A
2024	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	ICLR 2024	Link	Link
2024	FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision	NeurIPS 2024	Link	Link
2024	ThunderKittens: Simple, Fast, and Adorable AI Kernels	NeurIPS 2024	Link	Link
2024	Ring Attention with Blockwise Transformers for Near-Infinite Context	ICLR 2024	Link	Link
2025	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	MLSys 2025	Link	Link
2024	NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention	NeurIPS 2024	Link	N/A

Compiler Optimization

Year	Title	Venue	Paper	Code
2023	Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs	ASPLOS 2023	Link	Link
2023	TensorIR: An Abstraction for Automatic Tensorized Program Optimization	ASPLOS 2023	Link	Link
2023	Welder: Scheduling Deep Learning Memory Access via Tile-level Fusion	OSDI 2023	Link	N/A
2023	OLLA: Optimizing the Lifetime and Location of Arrays to Reduce Memory Usage of Neural Networks	MLSys 2023	Link	N/A
2024	Ladder: Enabling Efficient Low-Bit Quantization and Inference with Compiler Co-Design	OSDI 2024	Link	N/A
2024	An LLM Compiler for Parallel Function Calling	ICML 2024	Link	N/A

Hardware-aware Deployment

Year	Title	Venue	Paper	Code
2023	Efficiently Scaling Transformer Inference	MLSys 2023	Link	N/A
2024	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	MLSys 2024	Link	Link
2024	Atom: Low-bit Quantization for Efficient and Accurate LLM Serving	MLSys 2024	Link	Link
2024	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	MLSys 2025	Link	Link
2024	MLC-LLM: Universal LLM Deployment on Consumer Devices with ML Compilation	MLSys 2024	Link	Link
2024	HexGen: Generative Inference of Large Language Model over Heterogeneous Environment	ICML 2024	Link	Link

Long-context Evaluation

Year	Title	Venue	Paper	Code
2023	ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding	EMNLP 2023	Link	Link
2024	Lost in the Middle: How Language Models Use Long Contexts	TACL 2024	Link	N/A
2024	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding	ACL 2024	Link	Link
2024	L-Eval: Instituting Standardized Evaluation for Long Context Language Models	ACL 2024	Link	Link
2024	RULER: What's the Real Context Size of Your Long-Context Language Models?	NAACL 2024	Link	Link
2024	InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens	ACL 2024	Link	Link
2024	Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of LLMs	ACL 2024	Link	N/A
2024	BABILong: Testing the Limits of LLMs with Long Context Reasoning Benchmarks	NeurIPS 2024	Link	N/A
2024	M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models	ACL 2024	Link	N/A
2024	Ada-LEval: Evaluating Long-context LLMs with Length-adaptable Benchmarks	NAACL 2024	Link	N/A

Reasoning Robustness

Year	Title	Venue	Paper	Code
2023	Compressing LLMs: The Truth is Rarely Pure and Never Simple	ICLR 2024	Link	Link
2024	ShortGPT: Layers in Large Language Models are More Redundant Than You Expect	ACL 2025 Findings	Link	N/A
2024	The Unreasonable Ineffectiveness of the Deeper Layers	ICLR 2025	Link	N/A
2024	LASER: Layer-Selective Rank Reduction for Improving Reasoning	ICML 2024	Link	N/A
2025	Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression	ICML 2025	Link	Link
2025	Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs	EMNLP 2025 Findings	Link	N/A
2025	Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning	arXiv 2025	Link	N/A

Safety under Compression

Year	Title	Venue	Paper	Code
2023	Safety Alignment Should Be Made More Manageable	NeurIPS 2023	Link	N/A
2024	Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To	ICLR 2024	Link	N/A
2024	Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation	ICLR 2024	Link	N/A
2024	Playing It Safe: Defending Against Backdoors with Activation Clustering in Quantized LLMs	AAAI 2024	Link	N/A
2024	Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning	NeurIPS 2024	Link	N/A
2024	Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models	NeurIPS 2024	Link	N/A

Multimodal LLMs

Year	Title	Venue	Paper	Code
2024	Honeybee: Locality-enhanced Projector for Multimodal LLM	CVPR 2024	Link	Link
2024	MobileVLM: A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices	CVPR 2024	Link	Link
2024	TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones	EMNLP 2024	Link	Link
2024	LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model	AAAI 2024	Link	N/A
2024	FastV: An Image is Worth 1/2 Tokens After Layer 2	ECCV 2024	Link	Link
2024	Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	NeurIPS 2024	Link	Link
2024	TokenPacker: Efficient Visual Projector for Multimodal LLM	NeurIPS 2024	Link	N/A
2024	Matryoshka Multimodal Models	NeurIPS 2024	Link	N/A
2025	LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation	ICLR 2025	Link	Link
2025	VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow Guidance	ICCV 2025	Link	N/A
2024	MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer	CVPR 2024	Link	Link

Edge Deployment

Year	Title	Venue	Paper	Code
2024	MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases	ICML 2024	Link	N/A
2024	LLM in a flash: Efficient Large Language Model Inference with Limited Memory	ICML 2024	Link	N/A
2024	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU	MLSys 2024	Link	Link
2024	EdgeMoE: Fast On-Device Inference of Mixture-of-Experts Based Large Language Models	MLSys 2024	Link	N/A
2024	LLMCad: Fast and Scalable On-device Large Language Model Inference	MLSys 2024	Link	N/A
2024	Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs	ICML 2024	Link	N/A
2024	MLC-LLM: Universal LLM Deployment on Consumer Devices with ML Compilation	MLSys 2024	Link	Link
2024	MobileQuant: Mobile-friendly Quantization for On-device Language Models	EMNLP 2024 Findings	Link	N/A
2024	GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment	ACL 2024 Findings	Link	N/A
2024	PowerInfer-2: Fast Large Language Model Inference on a Smartphone	arXiv 2024	Link	Link

Contributing

We welcome contributions from the community! If you find any missing papers or errors, please feel free to:

Open an Issue to report errors or suggest papers
Submit a Pull Request to add new papers
Star this repository if you find it helpful

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
LICENSE		LICENSE
README.md		README.md
Taxonomy.png		Taxonomy.png

Year	Title	Venue	Paper	code
2023	Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning	ACL 2024 Findings	Link	Link
2024	SliceGPT: Compress Large Language Models by Deleting Rows and Columns	ICLR 2024	Link	Link
2024	APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference	ICML 2024	Link	Link
2024	Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations	ACL 2024 Findings	Link	Link
2024	LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models	ICML 2024	Link	Link
2024	Pruning as a Domain-specific LLM Extractor	NAACL 2024 Findings	Link	Link
2024	Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient	ACL 2025	Link	Link
2025	One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models	ACL 2025 Findings	Link	N/A
2024	RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model	NAACL 2024 Findings	Link	N/A
2024	Finding Transformer Circuits with Edge Pruning	NeurIPS 2024	Link	Link
2024	MoDeGPT: Modular Decomposition for Large Language Model Compression	ICLR 2025	Link	Link
2024	The Unreasonable Ineffectiveness of the Deeper Layers	ICLR 2025	Link	N/A
2024	PAT: Pruning-Aware Tuning for Large Language Models	AAAI 2025	Link	Link
2024	Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy	EMNLP 2024 Findings	Link	Link
2024	LEMON: Reviving Stronger and Smaller LMs from Larger LMs with Linear Parameter Fusion	ACL 2024	Link	N/A
2024	DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization	ACL 2025	Link	Link
2025	You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning	ICLR 2025	Link	Link
2025	LLaMaFlex: Many-in-one LLMs via Generalized Pruning and Weight Sharing	ICLR 2025	Link	N/A
2025	Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing	ICLR 2025	Link	Link
2025	Instruction-Following Pruning for Large Language Models	ICML 2025	Link	N/A
2025	Let LLM Tell What to Prune and How Much to Prune	ICML 2025	Link	Link
2025	Prompt-based Depth Pruning of Large Language Models	ICML 2025	Link	Link
2025	IG-Pruning: Input-Guided Block Pruning for Large Language Models	EMNLP 2025	Link	Link
2025	PIP: Perturbation-based Iterative Pruning for Large Language Models	EMNLP 2025 Findings	Link	N/A
2025	ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization	NeurIPS 2025	Link	Link
2025	Restoring Pruned Large Language Models via Lost Component Compensation	NeurIPS 2025	Link	Link

Folders and files

Latest commit

History

Repository files navigation