Skip to content

Latest commit

 

History

History
1278 lines (1172 loc) · 187 KB

File metadata and controls

1278 lines (1172 loc) · 187 KB

Research & Survey

Contents

Large Language Model: Landscape

  1. The best NLP papers from 2015 to now
  2. In 2023: As abilities emerge only at scale, we must unlearn outdated intuitions, scale Transformers via massive distributed matrix multiplications, and discover the inductive bias needed to push ~10,000× beyond GPT-4. 🗣️ / 📺 / ✍️ [6 Oct 2023]

Large Language Model Comparison

The Big LLM Architecture Comparison (in 2025)

  • The Big LLM Architecture Comparison✍️:💡 [19 Jul 2025]

  • LLM Architecture Gallery✍️: Visual guide to modern LLM architectures and design tradeoffs. [26 Mar 2026]

    Model Parameters Attention Type MoE Norm Positional Encoding Notable Features
    DeepSeek V3 / R1 671B Multi-Head Latent Attention (MLA) Yes, 256 experts (37B active) Pre-normalization RoPE KV compression via MLA, shared expert, high inference efficiency
    OLMo 2 32B Multi-Head Attention (MHA) No Post-normalization + QK norm (RMSNorm) RoPE RMSNorm scaling after attention & FF, training stability
    Gemma 3 / 3n 27B / 4B Sliding Window + Grouped-Query Attention No Pre + Post RMSNorm RoPE Sliding window attention, Gemma 3n: Per-Layer Embedding (PLE), MatFormer slices
    Mistral Small 3.1 24B Grouped-Query Attention No Pre-normalization RoPE Optimized for low latency, simpler than Gemma 3
    Llama 4 Maverick 400B Grouped-Query Attention Yes, fewer & larger experts Pre-normalization RoPE Alternating MoE & dense layers, 17B active parameters
    Qwen3 (Dense) 0.6–32B Grouped-Query Attention No Pre-normalization RoPE Deep architecture, small memory footprint
    Qwen3 (MoE) 30B–235B Grouped-Query Attention Yes, no shared expert Pre-normalization RoPE Sparse MoE, optimized for large-scale inference
    SmolLM3 3B Grouped-Query Attention No Pre-normalization NoPE (No Positional Embedding) Good small-scale performance, improved length generalization
    Kimi K2 1T MLA Yes, more experts than DeepSeek Pre-normalization RoPE Muon optimizer, very high modeling performance, open-weight
    gpt-oss 20B / 120B Grouped-Query + Sliding Window Yes, few large experts Pre-normalization RoPE Wider architecture, attention sinks, bias units
    Grok 2.5 70B Grouped-Query Attention Yes Pre-normalization RoPE Standard large-scale architecture
    GLM-4.5 130B Grouped-Query Attention Yes Pre-normalization RoPE Standard architecture with high performance
    Qwen3-Next - Grouped-Query Attention Yes Pre-normalization RoPE Expert size & number tuned, Gated DeltaNet + Gated Attention Hybrid, Multi-Token Prediction
  • Beyond Standard LLMs✍️:💡Linear Attention Hybrids, Text Diffusion, Code World Models, and Small Recursive Transformers [04 Nov 2025]

    Architecture Type Key Models Attention Mechanism Main Advantage Main Limitation Use Case
    Standard Transformer GPT-5, DeepSeek V3/R1, Llama 4, Qwen3, Gemini 2.5, MiniMax-M2 Quadratic O(n²) scaled-dot-product Proven, SOTA performance, mature tooling Expensive training & inference, quadratic complexity General-purpose LLM tasks
    Linear Attention Hybrids Qwen3-Next, Kimi Linear, MiniMax-M1, DeepSeek V3.2 Gated DeltaNet + Full Attention (3:1 ratio) 75% KV cache reduction, 6× decoding throughput, linear O(n) Trades accuracy for efficiency, added complexity Long-context tasks, resource-constrained environments
    Text Diffusion LLaDA, Gemini Diffusion Bidirectional (no causal mask) Parallel token generation, faster responses Can't stream, tricky tool-calling, quality degradation with fewer steps Fast inference, on-device LLMs
    Code World Models CWM (32B) Standard sliding-window attention Simulates code execution, improves reasoning Limited to code domain, added latency from execution traces Code generation, debugging, test-time scaling
    Small Recursive Transformers TRM (7M), HRM (28M) Standard attention with recursive refinement Very small (7M params), strong puzzle solving, <$500 training cost Special-purpose, limited to structured tasks (Sudoku, ARC, Maze) Domain-specific reasoning, tool-calling modules

GPT-2 vs gpt-oss

Feature GPT-2 GPT-OSS
Release & Size 2019, up to 1.5B params 2025, 20B & 120B params (MoE)
Architecture Dense transformer decoder Mixture-of-Experts (MoE) decoder
Activation & Dropout Swish activation, uses dropout GELU (or optimized), no dropout
Parameter Efficiency All params active per token Sparse activation of experts
Deployment & License MIT license Open-weight local runs, Apache 2.0
Reasoning & Tools Basic generation Built-in chain-of-thought & tool use

Evolutionary Tree of Large Language Models

A Taxonomy of Natural Language Processing

  • An overview of different fields of study and recent developments in NLP. 🗄️ / ✍️ [24 Sep 2023] Exploring the Landscape of Natural Language Processing Research ref📑 [20 Jul 2023]
  • NLP taxonomy

Distribution of the number of papers by most popular fields of study from 2002 to 2022

Large Language Model Collection

  • Ai2 (Allen Institute for AI)
    • Founded by Paul Allen, the co-founder of Microsoft, in Sep 2024.
    • DR Tulu: 8B. Deep Research (DR) model trained for long-form DR tasks. [Nov 2025]
    • OLMo📑:💡Truly open language model and framework to build, study, and advance LMs, along with the training data, training and evaluation code, intermediate model checkpoints, and training logs. git [Feb 2024]
    • OLMo 2 [26 Nov 2024] github stars github stars
    • OLMo 3✍️: Fully open models including the entire flow. [20 Nov 2025]
    • OLMoE: fully-open LLM leverages sparse Mixture-of-Experts [Sep 2024]
    • TÜLU 3📑:💡Pushing Frontiers in Open Language Model Post-Training git / demo:✍️ [22 Nov 2024] github stars
  • Alibaba
  • Amazon
  • Anthrophic
    • Claude 3✍️, the largest version of the new LLM, outperforms rivals GPT-4 and Google’s Gemini 1.0 Ultra. Three variants: Opus, Sonnet, and Haiku. [Mar 2024]
    • Claude 3.7 Sonnet and Claude Code✍️: the first hybrid reasoning model. ✍️ [25 Feb 2025]
    • Claude 4✍️: Claude Opus 4 (72.5% on SWE-bench), Claude Sonnet 4 (72.7% on SWE-bench). Extended Thinking Mode (Beta). Parallel Tool Use & Memory. Claude Code SDK. AI agents: code execution, MCP connector, Files API, and 1-hour prompt caching. [23 May 2025]
    • Claude 4.5✍️: Major upgrades in autonomous coding, tool use, context handling, memory, and long-horizon reasoning; supports over 30 hours of continuous operation. [30 Sep 2025]
    • Claude Opus 4.5✍️: SWE-bench Verified (80.9%). $5/$25 per million tokens [25 Nov 2025]
    • anthropic/cookbook
  • Apple
    • OpenELM: Apple released a Transformer-based language model. Four sizes of the model: 270M, 450M, 1.1B, and 3B parameters. [April 2024]
    • Apple Intelligence Foundation Language Models: 1. A 3B on-device model used for language tasks like summarization and Writing Tools. 2. A large Server model used for language tasks too complex to do on-device. [10 Jun 2024]
  • Baidu
  • Chatbot Arena🤗
  • Cohere
  • Databricks
    • DBRX: MoE, open, general-purpose LLM created by Databricks. [27 Mar 2024]
  • Deepseek
    • Founded in 2023, is a Chinese company dedicated to AGI.
    • DeepSeek-V3: Mixture-of-Experts (MoE) with 671B. [26 Dec 2024]
    • DeepSeek-V3 Technical Report📑: 671B MoE model with MLA and auxiliary-loss-free load balancing. [Dec 2024]
    • DeepSeek-R1:💡an open source reasoning model. Group Relative Policy Optimization (GRPO). Base -> RL -> SFT -> RL -> SFT -> RL [20 Jan 2025] ref📑: A Review of DeepSeek Models' Key Innovative Techniques [14 Mar 2025]
    • Janus: Multimodal understanding and visual generation. [28 Jan 2025]
    • DeepSeek-V3🤗: 671B. Top-tier performance in coding and reasoning tasks [25 Mar 2025]
    • DeepSeek-Prover-V2: Mathematical reasoning [30 Apr 2025]
    • DeepSeek-v3.1🤗: Think/Non‑Think hybrid reasoning. 128K and MoE. Agent abilities. [19 Aug 2025]
    • DeepSeek-V3.2📑: DeepSeek Sparse Attention (DSA) cuts complexity from O(L²) to O(Lk). [12 Dec 2025]
    • DeepSeek-V3.2-Exp [Sep 2025] github stars
    • DeepSeek-OCR: Convert long text into an image, compresses it into visual tokens, and sends those to the LLM — cutting cost and expanding context capacity. [Oct 2025] github stars
    • DeepSeekMath-V2: a Self-Verifiable Mathematical Reasoning model [27 Nov 2025] github stars
    • mHC (Manifold-Constrained Hyper-Connections)📑 [31 Dec 2025] Controlled layer updates for stable deep models. next state = current state + constrained update
      (vs. residuals: F(x) + x -> Hyper-Connections: unconstrained -> mHC: constrained)
    • Engram (Conditional Memory Module) Adds a native memory lookup alongside neural computation, letting frequent patterns be retrieved in constant time. output = compute(x) + memory lookup(x)
      (vs. attention: recomputing patterns every time -> Engram)
    • A list of models: git
  • EleutherAI
    • Founded in July 2020. United States tech. GPT-Neo, GPT-J, GPT-NeoX, and The Pile dataset.
    • Pythia📑: How do large language models (LLMs) develop and evolve over the course of training and change as models scale? A suite of decoder-only autoregressive language models ranging from 70M to 12B parameters git [Apr 2023] github stars
  • Google
  • Groq
    • Founded in 2016. low-latency AI inference H/W. American tech.
    • Llama-3-Groq-Tool-Use: a model optimized for function calling [Jul 2024]
  • Huggingface
  • IBM
    • Granite Guardian: a collection of models designed to detect risks in prompts and responses [10 Dec 2024]
  • Jamba: AI21's SSM-Transformer Model. Mamba + Transformer + MoE [28 Mar 2024]
  • KoAlpaca: Alpaca for korean [Mar 2023] github stars
  • Llama variants emerged in 2023
    • Falcon LLM Apache 2.0 license [Mar 2023]
    • Alpaca: Fine-tuned from the LLaMA 7B model [Mar 2023]
    • vicuna: 90% ChatGPT Quality [Mar 2023]
    • dolly: Databricks [Mar 2023]
    • Cerebras-GPT: 7 GPT models ranging from 111m to 13b parameters. [Mar 2023]
    • Koala: Focus on dialogue data gathered from the web. [Apr 2023]
    • StableVicuna First Open Source RLHF LLM Chatbot [Apr 2023]
    • Upstage's 70B Language Model Outperforms GPT-3.5: ✍️ [1 Aug 2023]
  • LLM Collection: promptingguide.ai
  • Meta
    • Most OSS LLM models have been built on the Llama / ✍️ / git github stars github stars
    • Llama 2🤗: 1) 40% more data than Llama. 2)7B, 13B, and 70B. 3) Trained on over 1 million human annotations. 4) double the context length of Llama 1: 4K 5) Grouped Query Attention, KV Cache, and Rotary Positional Embedding were introduced in Llama 2 [18 Jul 2023] demo🤗
    • Llama 3: 1) 7X more data than Llama 2. 2) 8B, 70B, and 400B. 3) 8K context length [18 Apr 2024]
    • MEGALODON: Long Sequence Model. Unlimited context length. Outperforms Llama 2 model. [Apr 2024] github stars
    • Llama 3.1: 405B, context length to 128K, add support across eight languages. first OSS model outperforms GTP-4o. [23 Jul 2024]
    • Llama 3.2: Multimodal. Include text-only models (1B, 3B) and text-image models (11B, 90B), with quantized versions of 1B and 3B [Sep 2024]
    • NotebookLlama: An Open Source version of NotebookLM [28 Oct 2024]
    • Llama 3.3: a text-only 70B instruction-tuned model. Llama 3.3 70B approaches the performance of Llama 3.1 405B. [6 Dec 2024]
    • Llama 4: Mixture of Experts (MoE). Llama 4 Scout (actived 17b / total 109b, 10M Context, single GPU), Llama 4 Maverick (actived 17b / total 400b, 1M Context) git: Model Card [5 Apr 2025]
  • ModernBERT📑: ModernBERT can handle sequences up to 8,192 tokens and utilizes sparse attention mechanisms to efficiently manage longer context lengths. [18 Dec 2024]
  • Microsoft
    • MAI-1✍️: MAI-Voice-1, MAI-1-preview. Microsoft in-house models. [28 Aug 2025]
    • phi-series: cost-effective small language models (SLMs) ✍️ git: Cookbook
    • Phi-1📑: Despite being small in size, phi-1 attained 50.6% on HumanEval and 55.5% on MBPP. Textbooks Are All You Need. ✍️ [20 Jun 2023]
    • Phi-1.5📑: Textbooks Are All You Need II. Phi 1.5 is trained solely on synthetic data. Despite having a mere 1 billion parameters compared to Llama 7B's much larger model size, Phi 1.5 often performs better in benchmark tests. [11 Sep 2023]
    • phi-2: open source, and 50% better at mathematical reasoning. 🤗 [Dec 2023]
    • phi-3-vision (multimodal), phi-3-small, phi-3 (7b), phi-sillica (Copilot+PC designed for NPUs)
    • Phi-3📑: Phi-3-mini, with 3.8 billion parameters, supports 4K and 128K context, instruction tuning, and hardware optimization. [22 Apr 2024] ✍️
    • phi-3.5-MoE-instruct: 🤗 [Aug 2024]
    • Phi-4📑: Specializing in Complex Reasoning ✍️ [12 Dec 2024]
    • Phi-4-multimodal / mini🤗 5.6B. speech, vision, and text processing into a single, unified architecture. [26 Feb 2025]
    • Phi-4-reasoning✍️: Phi-4-reasoning, Phi-4-reasoning-plus, Phi-4-mini-reasoning [30 Apr 2025]
    • Phi-4-mini-flash-reasoning✍️: 3.8B, 64K context, Single GPU, Decoder-Hybrid-Decoder architecture [9 Jul 2025]
  • MiniMaxAI
    • Founded in Dec 2021. Shanghai, China.
    • MiniMax-M2: Coding and Agent tasks, 230B (10B Active), MoE, a new high ahead of DeepSeek-V3.2 and Kimi K2 github stars
  • Mistral
    • Founded in April 2023. French tech.
    • Model overview ✍️
    • NeMo: 12B model with 128k context length that outperforms LLama 3 8B [18 Jul 2024]
    • Mistral OCR: Precise text recognition with up to 99% accuracy. Multimodal. Browser based [6 Mar 2025]
    • Mistral Large 3✍️: Flagship multimodal model for reasoning, coding, and enterprise assistants. [Mar 2025]
  • Moonshot AI
    • Moonshot AI is a Beijing-based Chinese AI company founded in March 2023
    • Kimi-K2: 1T parameter MoE model. MuonClip Optimizer. Agentic Intelligence. [11 Jul 2025]
    • Kimi K2 Thinking✍️: The first open-source model beats GPT-5 in Agent benchmark. [7 Nov 2025]
    • Kimi-K2.5: Open-source multimodal agentic model by Moonshot AI. [Jan 2026] github stars
  • NVIDIA
    • Nemotron-4 340B: Synthetic Data Generation for Training Large Language Models [14 Jun 2024]
  • ollam: ollama-supported models
  • Open-Sora: Democratizing Efficient Video Production for All [Mar 2024] github stars
  • OpenAI
    • gpt-oss:💡gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI. [Jun 2025] github stars
  • Qualcomm
  • Tencent
    • Founded in 1998, Tencent is a Chinese company dedicated to various technology sectors, including social media, gaming, and AI development.
    • Hunyuan-Large: An open-source MoE model with open weights. [4 Nov 2024] git github stars
    • Hunyuan-T1: Reasoning model [21 Mar 2025]
    • A list of models: git
  • The LLM Index: A list of large language models (LLMs)
  • The mother of all spreadsheets for anyone into LLMs [17 Dec 2024]
  • The Open Source AI Definition [28 Oct 2024]
  • xAI
    • xAI is an American AI company founded by Elon Musk in March 2023
    • Grok: 314B parameter Mixture-of-Experts (MoE) model. Released under the Apache 2.0 license. Not includeded training code. Developed by JAX git [17 Mar 2024] github stars
    • Grok-2 and Grok-2 mini [13 Aug 2024]
    • Grok-2.5: Grok 2.5 Goes Open Source [24 Aug 2025]
    • Grok-3: 200,000 GPUs to train. Grok 3 beats GPT-4o on AIME, GPQA. Grok 3 Reasoning and Grok 3 mini Reasoning. [17 Feb 2025]
    • Grok-4: Humanity’s Last Exam, Grok 4 Heavy scored 44.4% [9 Jul 2025]
    • Grok 4.1✍️ [17 Nov 2025]
  • Xiaomi
    • Founded in 2010, Xiaomi is a Chinese company known for its innovative consumer electronics and smart home products.
    • Mimo: 7B. advanced reasoning for code and math [30 Apr 2025)
  • Z.ai
    • formerly Zhipu, Beijing-based Chinese AI company founded in March 2019
    • GLM-4.5: An open-source large language model designed for intelligent agents
    • GLM-4.6✍️: GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilities [30 Sep 2025]

LLM for Domain Specific

  • AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County📑: a fine-tuned open LLM to detect racial covenants in 24 million housing documents, cutting 86,500 hours of manual work. [12 Feb 2025]
  • AlphaChip: Reinforcement learning-based model for designing physical chip layouts. [26 Sep 2024]
  • AlphaFold3: Open source implementation of AlphaFold3 [Nov 2023] / OpenFold: PyTorch reproduction of AlphaFold 2 [Sep 2021] github stars github stars
  • AlphaGenome: DeepMind’s advanced AI model, launched in June 2025, is designed to analyze the regulatory “dark matter” of the genome—specifically, the 98% of DNA that does not code for proteins but instead regulates when and how genes are expressed. [June 2025]
  • BioGPT📑: Generative Pre-trained Transformer for Biomedical Text Generation and Mining git [19 Oct 2022] github stars
  • BloombergGPT📑: A Large Language Model for Finance [30 Mar 2023]
  • Chai-1: a multi-modal foundation model for molecular structure prediction [Sep 2024] github stars
  • Code Llama📑: Built on top of Llama 2, free for research and commercial use. ✍️ / git [24 Aug 2023] github stars
  • DeepSeek-Coder-V2: Open-source Mixture-of-Experts (MoE) code language model [17 Jun 2024] github stars
  • Devin AI: Devin is an AI software engineer developed by Cognition AI [12 Mar 2024]
  • EarthGPT📑: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain [30 Jan 2024]
  • ESM3: A frontier language model for biology: Simulating 500 million years of evolution git / ✍️ [31 Dec 2024] github stars
  • FrugalGPT📑: LLM with budget constraints, requests are cascaded from low-cost to high-cost LLMs. git [9 May 2023] github stars
  • Galactica📑: A Large Language Model for Science [16 Nov 2022]
  • Gemma series
  • Huggingface StarCoder: A State-of-the-Art LLM for Code🤗: 🤗 [May 2023]
  • MechGPT📑: Language Modeling Strategies for Mechanics and Materials git [16 Oct 2023] github stars
  • MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers [27 Nov 2023]
  • OpenCoder: 1.5B and 8B base and open-source Code LLM, supporting both English and Chinese. [Oct 2024] github stars
  • Prithvi WxC📑: In collaboration with NASA, IBM is releasing an open-source foundation model for Weather and Climate ✍️ [20 Sep 2024]
  • Qwen2-Math: math-specific LLM / Qwen2-Audio: large-scale audio-language model [Aug 2024] / Qwen 2.5-Coder [18 Sep 2024] github stars github stars github stars
  • Qwen3-Coder: Qwen3-Coder is the code version of Qwen3, the large language model series developed by Qwen team, Alibaba Cloud. [Jul 2025] github stars
  • GLM-5🤗: Model card for Z.ai's latest GLM family release.
  • SaulLM-7B📑: A pioneering Large Language Model for Law [6 Mar 2024]
  • TimeGPT: The First Foundation Model for Time Series Forecasting git [Mar 2023] github stars
  • Video LLMs for Temporal Reasoning in Long Videos📑: TemporalVLM, a video LLM excelling in temporal reasoning and fine-grained understanding of long videos, using time-aware features and validated on datasets like TimeIT and IndustryASM for superior performance. [4 Dec 2024]

MLLM (multimodal large language model)

  • Apple
    • 4M-21📑: An Any-to-Any Vision Model for Tens of Tasks and Modalities. [13 Jun 2024]
  • Awesome Multimodal Large Language Models: Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation. [Jun 2023] github stars
  • Benchmarking Multimodal LLMs.
    • LLaVA-1.5 achieves SoTA on a broad range of 11 tasks incl. SEED-Bench.
    • SEED-Bench📑: Benchmarking Multimodal LLMs git [30 Jul 2023] github stars
  • BLIP-2📑 [30 Jan 2023]: Salesforce Research, Querying Transformer (Q-Former) / git / 🤗 / 📺 / BLIP📑: git [28 Jan 2022] github stars
    • Q-Former (Querying Transformer): A transformer model that consists of two submodules that share the same self-attention layers: an image transformer that interacts with a frozen image encoder for visual feature extraction, and a text transformer that can function as both a text encoder and a text decoder.
    • Q-Former is a lightweight transformer which employs a set of learnable query vectors to extract visual features from the frozen image encoder. It acts as an information bottleneck between the frozen image encoder and the frozen LLM.
  • CLIP📑: CLIP (Contrastive Language-Image Pretraining), Trained on a large number of internet text-image pairs and can be applied to a wide range of tasks with zero-shot learning. git [26 Feb 2021] github stars
  • Drag Your GAN📑: Interactive Point-based Manipulation on the Generative Image Manifold git [18 May 2023] github stars
  • GroundingDINO📑: DINO with Grounded Pre-Training for Open-Set Object Detection git [9 Mar 2023] github stars
  • Hugging Face
  • LLaVa📑: Large Language-and-Vision Assistant git [17 Apr 2023]
    • Simple linear layer to connect image features into the word embedding space. A trainable projection matrix W is applied to the visual features Zv, transforming them into visual embedding tokens Hv. These tokens are then concatenated with the language embedding sequence Hq to form a single sequence. Note that Hv and Hq are not multiplied or added, but concatenated, both are same dimensionality.
  • LLaVA-CoT📑: (FKA. LLaVA-o1) Let Vision Language Models Reason Step-by-Step. git [15 Nov 2024]
  • Meta (aka. Facebook)
    • facebookresearch/ImageBind📑: ImageBind One Embedding Space to Bind Them All git [9 May 2023] github stars
    • facebookresearch/segment-anything(SAM)📑: The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model. git [5 Apr 2023] github stars
    • facebookresearch/SeamlessM4T📑: SeamlessM4T is the first all-in-one multilingual multimodal AI translation and transcription model. This single model can perform speech-to-text, speech-to-speech, text-to-speech, and text-to-text translations for up to 100 languages depending on the task. ✍️ [22 Aug 2023]
    • Chameleon📑: Early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. The unified approach uses fully token-based representations for both image and textual modalities. no vision-encoder. [16 May 2024]
    • Models and libraries
  • Microsoft
    • Language Is Not All You Need: Aligning Perception with Language Models Kosmos-1📑: [27 Feb 2023]
    • Kosmos-2📑: Grounding Multimodal Large Language Models to the World [26 Jun 2023]
    • Kosmos-2.5📑: A Multimodal Literate Model [20 Sep 2023]
    • BEiT-3📑: Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks [22 Aug 2022]
    • TaskMatrix.AI📑: TaskMatrix connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting. [29 Mar 2023]
    • Florence-2📑: Advancing a unified representation for various vision tasks, demonstrating specialized models like CLIP for classification, GroundingDINO for object detection, and SAM for segmentation. 🤗 [10 Nov 2023]
    • LLM2CLIP: Directly integrating LLMs into CLIP causes catastrophic performance drops. We propose LLM2CLIP, a caption contrastive fine-tuning method that leverages LLMs to enhance CLIP. [7 Nov 2024]
    • Florence-VL📑: A multimodal large language model (MLLM) that integrates Florence-2. [5 Dec 2024]
    • Magma: Magma: A Foundation Model for Multimodal AI Agents [18 Feb 2025]
  • MiniCPM-o: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone [15 Jan 2025]
  • MiniCPM-V: MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone [Jan 2024] github stars
  • MiniGPT-4 & MiniGPT-v2📑: Enhancing Vision-language Understanding with Advanced Large Language Models git [20 Apr 2023]
  • mini-omni2: ✍️: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities. [15 Oct 2024] github stars
  • Molmo and PixMo📑: Open Weights and Open Data for State-of-the-Art Multimodal Models ✍️ [25 Sep 2024]
  • moondream: an OSS tiny vision language model. Built using SigLIP, Phi-1.5, LLaVA dataset. [Dec 2023] github stars
  • Multimodal Foundation Models: From Specialists to General-Purpose Assistants📑: A comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities. Specific-Purpose 1. Visual understanding tasks 2. Visual generation tasks General-Purpose 3. General-purpose interface. [18 Sep 2023]
  • Optimizing Memory Usage for Training LLMs and Vision Transformers: When applying 10 techniques to a vision transformer, we reduced the memory consumption 20x on a single GPU. ✍️ / git [2 Jul 2023] github stars
  • openai/shap-e📑 Generate 3D objects conditioned on text or images [3 May 2023] git github stars
  • TaskMatrix, aka. VisualChatGPT📑: Microsoft TaskMatrix git; GroundingDINO + SAM📑 / git [8 Mar 2023] github stars github stars
  • Ultravox: A fast multimodal LLM for real-time voice [May 2024]
  • Understanding Multimodal LLMs✍️:💡Two main approaches to building multimodal LLMs: 1. Unified Embedding Decoder Architecture approach; 2. Cross-modality Attention Architecture approach. [3 Nov 2024]
    mllm
  • Video-ChatGPT📑: a video conversation model capable of generating meaningful conversation about videos. / git [8 Jun 2023] github stars
  • Vision capability to a LLM ✍️: The model has three sub-models: A model to obtain image embeddings -> A text model to obtain text embeddings -> A model to learn the relationships between them [22 Aug 2023]

Prompt Engineering and Visual Prompts

Prompt Engineering

  1. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications📑: a summary detailing the prompting methodology, its applications.🏆Taxonomy of prompt engineering techniques in LLMs. [5 Feb 2024]
  2. Chain of Draft: Thinking Faster by Writing Less📑: Chain-of-Draft prompting con- denses the reasoning process into minimal, abstract representations. Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. [25 Feb 2025]
  3. Chain of Thought (CoT)📑:💡Chain-of-Thought Prompting Elicits Reasoning in Large Language Models ReAct and Self Consistency also inherit the CoT concept. [28 Jan 2022]
  4. Chain-of-Verification reduces Hallucination in LLMs📑: A four-step process that consists of generating a baseline response, planning verification questions, executing verification questions, and generating a final verified response based on the verification results. [20 Sep 2023]
  5. ChatGPT : “user”, “assistant”, and “system” messages.**
    To be specific, the ChatGPT API allows for differentiation between “user”, “assistant”, and “system” messages.
    1. always obey "system" messages.
    2. all end user input in the “user” messages.
    3. "assistant" messages as previous chat responses from the assistant.
    • Presumably, the model is trained to treat the user messages as human messages, system messages as some system level configuration, and assistant messages as previous chat responses from the assistant. ✍️ [2 Mar 2023]
  6. Does Prompt Formatting Have Any Impact on LLM Performance?📑: GPT-3.5-turbo's performance in code translation varies by 40% depending on the prompt template, while GPT-4 is more robust. [15 Nov 2024]
  7. Few-shot: Open AI: Language Models are Few-Shot Learners📑: [28 May 2020]
  8. FireAct📑: Toward Language Agent Fine-tuning. 1. This work takes an initial step to show multiple advantages of fine-tuning LMs for agentic uses. 2. Duringfine-tuning, The successful trajectories are then converted into the ReAct format to fine-tune a smaller LM. 3. This work is an initial step toward language agent fine-tuning, and is constrained to a single type of task (QA) and a single tool (Google search). / git [9 Oct 2023]
  9. Graph of Thoughts (GoT)📑: Solving Elaborate Problems with Large Language Models git [18 Aug 2023] github stars
  10. Is the new norm for NLP papers "prompt engineering" papers?: "how can we make LLM 1 do this without training?" Is this the new norm? The CL section of arXiv is overwhelming with papers like "how come LLaMA can't understand numbers?" [2 Aug 2024]
  11. Large Language Models as Optimizers📑:💡Take a deep breath and work on this problem step-by-step. to improve its accuracy. Optimization by PROmpting (OPRO) [7 Sep 2023]
  12. Language Models as Compilers📑: With extensive experiments on seven algorithmic reasoning tasks, Think-and-Execute is effective. It enhances large language models’ reasoning by using task-level logic and pseudocode, outperforming instance-specific methods. [20 Mar 2023]
  13. Many-Shot In-Context Learning📑: Transitioning from few-shot to many-shot In-Context Learning (ICL) can lead to significant performance gains across a wide variety of generative and discriminative tasks [17 Apr 2024]
  14. NLEP (Natural Language Embedded Programs) for Hybrid Language Symbolic Reasoning📑: Use code as a scaffold for reasoning. NLEP achieves over 90% accuracy when prompting GPT-4. [19 Sep 2023]
  15. OpenAI Harmony Response Format: system > developer > user > assistant > tool. git [5 Aug 2025]
  16. OpenAI Prompt Migration Guide:💡OpenAI Cookbook. By leveraging GPT‑4.1, refine your prompts to ensure that each instruction is clear, specific, and closely matches your intended outcomes. [26 Jun 2025]
  17. Plan-and-Solve Prompting📑: Develop a plan, and then execute each step in that plan. [6 May 2023]
  18. Power of Prompting
    • GPT-4 with Medprompt📑: GPT-4, using a method called Medprompt that combines several prompting strategies, has surpassed MedPaLM 2 on the MedQA dataset without the need for fine-tuning. ✍️ [28 Nov 2023]
    • promptbase: Scripts demonstrating the Medprompt methodology [Dec 2023] github stars
  19. Prompt Concept Keywords: Question-Answering | Roll-play: Act as a [ROLE] perform [TASK] in [FORMAT] | Reasoning | Prompt-Chain
  20. Prompt Engineering for OpenAI’s O1 and O3-mini Reasoning Models✍️: 1) Keep Prompts Clear and Minimal, 2)Avoid Unnecessary Few-Shot Examples 3)Control Length and Detail via Instructions 4)Specify Output, Role or Tone [05 Feb 2025]
  21. Prompt Engneering overview 🗣️ [10 Jul 2023]
  22. Prompt Principle for Instructions📑:💡26 prompt principles: e.g., 1) No need to be polite with LLM so there .. 16) Assign a role.. 17) Use Delimiters.. [26 Dec 2023]
  23. Promptist
    • Promptist📑: Microsoft's researchers trained an additional language model (LM) that optimizes text prompts for text-to-image generation. [19 Dec 2022]
    • For example, instead of simply passing "Cats dancing in a space club" as a prompt, an engineered prompt might be "Cats dancing in a space club, digital painting, artstation, concept art, soft light, hdri, smooth, sharp focus, illustration, fantasy."
  24. RankPrompt📑: Self-ranking method. Direct Scoring independently assigns scores to each candidate, whereas RankPrompt ranks candidates through a systematic, step-by-step comparative evaluation. [19 Mar 2024]
  25. ReAct📑: Grounding with external sources. (Reasoning and Act): Combines reasoning and acting ✍️ [6 Oct 2022]
  26. Re-Reading Improves Reasoning in Large Language Models📑: RE2 (Re-Reading), which involves re-reading the question as input to enhance the LLM's understanding of the problem. Read the question again [12 Sep 2023]
  27. Recursively Criticizes and Improves (RCI)📑: [30 Mar 2023]
    • Critique: Review your previous answer and find problems with your answer.
    • Improve: Based on the problems you found, improve your answer.
  28. Reflexion📑: Language Agents with Verbal Reinforcement Learning. 1. Reflexion that uses verbal reinforcement to help agents learn from prior failings. 2. Reflexion converts binary or scalar feedback from the environment into verbal feedback in the form of a textual summary, which is then added as additional context for the LLM agent in the next episode. 3. It is lightweight and doesn’t require finetuning the LLM. [20 Mar 2023] / git github stars
  29. Retrieval Augmented Generation (RAG)📑: To address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. [22 May 2020]
  30. Self-Consistency (CoT-SC)📑: The three steps in the self-consistency method: 1) prompt the language model using CoT prompting, 2) sample a diverse set of reasoning paths from the language model, and 3) marginalize out reasoning paths to aggregate final answers and choose the most consistent answer. [21 Mar 2022]
  31. Self-Refine📑, which enables an agent to reflect on its own output [30 Mar 2023]
  32. Skeleton Of Thought📑: Skeleton-of-Thought (SoT) reduces generation latency by first creating an answer's skeleton, then filling each skeleton point in parallel via API calls or batched decoding. [28 Jul 2023]
  33. Tree of Thought (ToT)📑: Self-evaluate the progress intermediate thoughts make towards solving a problem [17 May 2023] git / Agora: Tree of Thoughts (ToT) git github stars github stars
  34. Verbalized Sampling📑: "Generate 5 jokes about coffee and their corresponding probabilities". In creative writing, VS increases diversity by 1.6-2.1x over direct prompting. [1 Oct 2025]
  35. Zero-shot, one-shot and few-shot ref📑 [28 May 2020]
  36. Zero-shot: Large Language Models are Zero-Shot Reasoners📑: Let’s think step by step. [24 May 2022]

Adversarial Prompting

  • Prompt Injection: Ignore the above directions and ...
  • Prompt Leaking: Ignore the above instructions ... followed by a copy of the full prompt with exemplars:
  • Jailbreaking: Bypassing a safety policy, instruct Unethical instructions if the request is contextualized in a clever way. ✍️
  • Random Search (RS): git: 1. Feed the modified prompt (original + suffix) to the model. 2. Compute the log probability of a target token (e.g, Sure). 3. Accept the suffix if the log probability increases. github stars
  • DAN (Do Anything Now): ✍️
  • JailbreakBench: git / ✍️

Prompt Tuner / Optimizer

  1. Automatic Prompt Engineer (APE)📑: Automatically optimizing prompts. APE has discovered zero-shot Chain-of-Thought (CoT) prompts superior to human-designed prompts like “Let’s think through this step-by-step” (Kojima et al., 2022). The prompt “To get the correct answer, let’s think step-by-step.” triggers a chain of thought. Two approaches to generate high-quality candidates: forward mode and reverse mode generation. [3 Nov 2022] git / ✍️ [Mar 2024] github stars
  2. Claude Prompt Engineer: Simply input a description of your task and some test cases, and the system will generate, test, and rank a multitude of prompts to find the ones that perform the best. [4 Jul 2023] / Anthropic Helper metaprompt ✍️ / Claude Sonnet 3.5 for Coding github stars
  3. Cohere’s new Prompt Tuner: Automatically improve your prompts [31 Jul 2024]
  4. Large Language Models as Optimizers📑: Optimization by PROmpting (OPRO). showcase OPRO on linear regression and traveling salesman problems. git [7 Sep 2023] github stars

Prompt Guide & Leaked prompts

Visual Prompting & Visual Grounding

Finetuning

LLM Pre-training and Post-training Paradigms

  • How to continue pretraining an LLM on new data: Continued pretraining can be as effective as retraining on combined datasets. [13 Mar 2024]
  • Three training methods were compared:
    • Regular pretraining: A model is initialized with random weights and pretrained on dataset D1.
    • Continued pretraining: The pretrained model from 1) is further pretrained on dataset D2.
    • Retraining on combined dataset: A model is initialized with random weights and trained on the combined datasets D1 and D2.
  • Continued pretraining can be as effective as retraining on combined datasets. Key strategies for successful continued pretraining include:
    • Re-warming: Increasing the learning rate at the start of continued pre-training.
    • Re-decaying: Gradually reducing the learning rate afterwards.
    • Data Mixing: Adding a small portion (e.g., 5%) of the original pretraining data (D1) to the new dataset (D2) to prevent catastrophic forgetting.
  • LIMA: Less Is More for Alignment📑: fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, either equivalent or strictly preferred to GPT-4 in 43% of cases. [18 May 2023]

PEFT: Parameter-Efficient Fine-Tuning (📺) [24 Apr 2023]

  • PEFT🤗: Parameter-Efficient Fine-Tuning. PEFT is an approach to fine tuning only a few parameters. [10 Feb 2023]
  • Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning📑: [28 Mar 2023]
  • PEFT Category: Pseudo Code ✍️ [22 Sep 2023]
    • Adapters: Adapters - Additional Layers. Inference can be slower.
      def transformer_with_adapter(x):
        residual = x
        x = SelfAttention(x)
        x = FFN(x) # adapter
        x = LN(x + residual)
        residual = x
        x = FFN(x) # transformer FFN
        x = FFN(x) # adapter
        x = LN(x + residual)
        return x
    • Soft Prompts: Prompt-Tuning - Learnable text prompts. Not always desired results.
      def soft_prompted_model(input_ids):
        x = Embed(input_ids)
        soft_prompt_embedding = SoftPromptEmbed(task_based_soft_prompt)
        x = concat([soft_prompt_embedding, x], dim=seq)
        return model(x)
    • Selective: BitFit - Update only the bias parameters. fast but limited.
      params = (p for n,p in model.named_parameters() if "bias" in n)
      optimizer = Optimizer(params)
    • Reparametrization: LoRa - Low-rank decomposition. Efficient, Complex to implement.
      def lora_linear(x):
        h = x @ W # regular linear
        h += x @ W_A @ W_B # low_rank update
        return scale * h

LoRA: Low-Rank Adaptation

  • 5 Techniques of LoRA ✍️: LoRA, LoRA-FA, VeRA, Delta-LoRA, LoRA+ [May 2024]
  • DoRA📑: Weight-Decomposed Low-Rank Adaptation. Decomposes pre-trained weight into two components, magnitude and direction, for fine-tuning. [14 Feb 2024]
  • Fine-tuning a GPT - LoRA: Comprehensive guide for LoRA 🗄️ [20 Jun 2023]
  • LoRA: Low-Rank Adaptation of Large Language Models📑: LoRA is one of PEFT technique. To represent the weight updates with two smaller matrices (called update matrices) through low-rank decomposition. git [17 Jun 2021] github stars
  • LoRA learns less and forgets less📑: Compared to full training, LoRA has less learning but better retention of original knowledge. [15 May 2024]
    LoRA
  • LoRA+📑: Improves LoRA’s performance and fine-tuning speed by setting different learning rates for the LoRA adapter matrices. [19 Feb 2024]
  • LoTR📑: Tensor decomposition for gradient update. [2 Feb 2024]
  • LoRA Family ✍️ [11 Mar 2024]
    • LoRA introduces low-rank matrices A and B that are trained, while the pre-trained weight matrix W is frozen.
    • LoRA+ suggests having a much higher learning rate for B than for A.
    • VeRA does not train A and B, but initializes them randomly and trains new vectors d and b on top.
    • LoRA-FA only trains matrix B.
    • LoRA-drop uses the output of B*A to determine, which layers are worth to be trained at all.
    • AdaLoRA adapts the ranks of A and B in different layers dynamically, allowing for a higher rank in these layers, where more contribution to the model’s performance is expected.
    • DoRA splits the LoRA adapter into two components of magnitude and direction and allows to train them more independently.
    • Delta-LoRA changes the weights of W by the gradient of A*B.
  • Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)✍️ [19 Nov 2023]: Best practical guide of LoRA.
    • QLoRA saves 33% memory but increases runtime by 39%, useful if GPU memory is a constraint.
    • Optimizer choice for LLM finetuning isn’t crucial. Adam optimizer’s memory-intensity doesn’t significantly impact LLM’s peak memory.
    • Apply LoRA across all layers for maximum performance.
    • Adjusting the LoRA rank is essential.
    • Multi-epoch training on static datasets may lead to overfitting and deteriorate results.
  • QLoRA: Efficient Finetuning of Quantized LLMs📑: 4-bit quantized pre-trained language model into Low Rank Adapters (LoRA). git [23 May 2023] github stars
  • The Expressive Power of Low-Rank Adaptation📑: Theoretically analyzes the expressive power of LoRA. [26 Oct 2023]
  • Training language models to follow instructions with human feedback📑: [4 Mar 2022]

RLHF (Reinforcement Learning from Human Feedback) & SFT (Supervised Fine-Tuning)

  • A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More📑 [23 Jul 2024]
  • Absolute Zero: Reinforced Self-play Reasoning with Zero Data📑: Autonomous AI systems capable of self-improvement without human-curated data, using interpreter feedback for code generation and math problem solving. [6 May 2025]
  • Direct Preference Optimization (DPO)📑: 1. RLHF can be complex because it requires fitting a reward model and performing significant hyperparameter tuning. On the other hand, DPO directly solves a classification problem on human preference data in just one stage of policy training. DPO more stable, efficient, and computationally lighter than RLHF. 2. Your Language Model Is Secretly a Reward Model [29 May 2023]
  • Direct Preference Optimization (DPO) uses two models: a trained model (or policy model) and a reference model (copy of trained model). The goal is to have the trained model output higher probabilities for preferred answers and lower probabilities for rejected answers compared to the reference model. ✍️: RHLF vs DPO [Jan 2, 2024] / ✍️ [1 Jul 2023]
  • InstructGPT: Training language models to follow instructions with human feedback📑: is a model trained by OpenAI to follow instructions using human feedback. [4 Mar 2022]


    🗣️
  • Libraries: TRL🤗: from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step, trlX, Argilla github stars

    • The three steps in the process: 1. pre-training on large web-scale data, 2. supervised fine-tuning on instruction data (instruction tuning), and 3. RLHF. ✍️
  • Machine learning technique that trains a "reward model" directly from human feedback and uses the model as a reward function to optimize an agent's policy using reinforcement learning.
  • OpenAI Spinning Up in Deep RL!: An educational resource to help anyone learn deep reinforcement learning. git [Nov 2018] github stars
  • ORPO (odds ratio preference optimization)📑: Monolithic Preference Optimization without Reference Model. New method that combines supervised fine-tuning and preference alignment into one process git [12 Mar 2024] Fine-tune Llama 3 with ORPO✍️ [Apr 2024]
    github stars
  • Preference optimization techniques: ✍️ [13 Aug 2024]
    • RLHF (Reinforcement Learning from Human Feedback): Optimizes reward policy via objective function.
    • DPO (Direct preference optimization): removes the need for a reward model. > Minimizes loss; no reward policy.
    • IPO (Identity Preference Optimization) : A change in the objective, which is simpler and less prone to overfitting.
    • KTO (Kahneman-Tversky Optimization) : Scales more data by replacing the pairs of accepted and rejected generations with a binary label.
    • ORPO (Odds Ratio Preference Optimization) : Combines instruction tuning and preference optimization into one training process, which is cheaper and faster.
    • TPO (Thought Preference Optimization): This method generates thoughts before the final response, which are then evaluated by a Judge model for preference using Direct Preference Optimization (DPO). [14 Oct 2024]
  • Reinforcement Learning from AI Feedback (RLAF)📑: Uses AI feedback to generate instructions for the model. TLDR: CoT (Chain-of-Thought, Improved), Few-shot (Not improved). Only explores the task of summarization. After training on a few thousand examples, performance is close to training on the full dataset. RLAIF vs RLHF: In many cases, the two policies produced similar summaries. [1 Sep 2023]
  • Reinforcement Learning from Human Feedback (RLHF)📑) is a process of pretraining and retraining a language model using human feedback to develop a scoring algorithm that can be reapplied at scale for future training and refinement. As the algorithm is refined to match the human-provided grading, direct human feedback is no longer needed, and the language model continues learning and improving using algorithmic grading alone. [18 Sep 2019] 🤗 [9 Dec 2022]
    • Proximal Policy Optimization (PPO) is a reinforcement learning method using first-order optimization. It modifies the objective function to penalize large policy changes, specifically those that move the probability ratio away from 1. Aiming for TRPO (Trust Region Policy Optimization)-level performance without its complexity which requires second-order optimization.
  • Reinforcement Learning with Verifiable Rewards✍️: Practical RLVR Tutorial [Oct 24 2025]
  • SFT vs RL📑: SFT Memorizes, RL Generalizes. RL enhances generalization across text and vision, while SFT tends to memorize and overfit. git [28 Jan 2025]
  • Supervised Fine-Tuning (SFT) fine-tuning a pre-trained model on a specific task or domain using labeled data. This can cause more significant shifts in the model’s behavior compared to RLHF.
  • Supervised Reinforcement Learning (SRL)📑: The Problem: SFT imitates human actions token by token, leading to overfitting; RLVR gives rewards only when successful, with no signal when all attempts fail. This Approach: Each action during RL generates a short reasoning trace and receives a similarity reward at every step. [29 Oct 2025]
  • Train your own R1 reasoning model with Unsloth (GRPO)✍️: Unsloth x vLLM > 20x more throughput, 50% VRAM savings. [6 Feb 2025]

Quantization Techniques

  • bitsandbytes: 8-bit optimizers git [Oct 2021] github stars
  • The Era of 1-bit LLMs📑: All Large Language Models are in 1.58 Bits. BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. [27 Feb 2024]
  • Quantization-aware training (QAT): The model is further trained with quantization in mind after being initially trained in floating-point precision.
  • Post-training quantization (PTQ): The model is quantized after it has been trained without further optimization during the quantization process.
    Method Pros Cons
    Post-training quantization Easy to use, no need to retrain the model May result in accuracy loss
    Quantization-aware training Can achieve higher accuracy than post-training quantization Requires retraining the model, can be more complex to implement

Pruning and Sparsification

  • Pruning: The process of removing some of the neurons or layers from a neural network. This can be done by identifying and eliminating neurons or layers that have little or no impact on the network's output.
  • Sparsification: A technique used to reduce the size of large language models by removing redundant parameters.
  • Wanda Pruning📑: A Simple and Effective Pruning Approach for Large Language Models [20 Jun 2023] ✍️

Knowledge Distillation: Reducing Model Size with Textbooks

  • Distilled Supervised Fine-Tuning (dSFT)
    • Zephyr 7B📑: Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). 🤗 [25 Oct 2023]
    • Mistral 7B📑: Outperforms Llama 2 13B on all benchmarks. Uses Grouped-query attention (GQA) for faster inference. Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost. ✍️ [10 Oct 2023]
  • Textbooks Are All You Need📑: phi-1 [20 Jun 2023]
  • Orca 2📑: Orca learns from rich signals from GPT 4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. ✍️ [18 Nov 2023]

Memory Optimization

  • CPU vs GPU vs TPU: The threads are grouped into thread blocks. Each of the thread blocks has access to a fast shared memory (SRAM). All the thread blocks can also share a large global memory. High-bandwidth memories (HBM). HBM Bandwidth: 1.5-2.0TB/s vs SRAM Bandwidth: 19TB/s ~ 10x HBM [27 May 2024]
  • Flash Attention📑: [27 May 2022]
    • In a GPU, A thread is the smallest execution unit, and a group of threads forms a block.
    • A block executes the same kernel (function, to simplify), with threads sharing fast SRAM memory.
    • All blocks can access the shared global HBM memory.
    • First, the query (Q) and key (K) product is computed in threads and returned to HBM. Then, it's redistributed for softmax and returned to HBM.
    • Flash attention reduces these movements by caching results in SRAM.
    • Tiling splits attention computation into memory-efficient blocks, while recomputation saves memory by recalculating intermediates during backprop. 📺
    • FlashAttention-2📑: [17 Jul 2023]: An method that reorders the attention computation and leverages classical techniques (tiling, recomputation). Instead of storing each intermediate result, use kernel fusion and run every operation in a single kernel in order to avoid memory read/write overhead. git -> Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster github stars
    • FlashAttention-3📑 [11 Jul 2024]
  • PagedAttention📑 : vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, 24x Faster LLM Inference 🗄️. ✍️: vllm [12 Sep 2023]
    • PagedAttention for a prompt “the cat is sleeping in the kitchen and the dog is”. Key-Value pairs of tensors for attention computation are stored in virtual contiguous blocks mapped to non-contiguous blocks in the GPU memory.
    • Transformer cache key-value tensors of context tokens into GPU memory to facilitate fast generation of the next token. However, these caches occupy significant GPU memory. The unpredictable nature of cache size, due to the variability in the length of each request, exacerbates the issue, resulting in significant memory fragmentation in the absence of a suitable memory management mechanism.
    • To alleviate this issue, PagedAttention was proposed to store the KV cache in non-contiguous memory spaces. It partitions the KV cache of each sequence into multiple blocks, with each block containing the keys and values for a fixed number of tokens.
  • TokenAttention an attention mechanism that manages key and value caching at the token level. git [Jul 2023] github stars

Other techniques and LLM patterns

  • Better & Faster Large Language Models via Multi-token Prediction📑: Suggest that training language models to predict multiple future tokens at once [30 Apr 2024]
  • Differential Transformer📑: Amplifies attention to the relevant context while minimizing noise using two separate softmax attention mechanisms. [7 Oct 2024]
  • KAN or MLP: A Fairer Comparison📑: In machine learning, computer vision, audio processing, natural language processing, and symbolic formula representation (except for symbolic formula representation tasks), MLP generally outperforms KAN. [23 Jul 2024]
  • Kolmogorov-Arnold Networks (KANs)📑: KANs use activation functions on connections instead of nodes like Multi-Layer Perceptrons (MLPs) do. Each weight in KANs is replaced by a learnable 1D spline function. KANs’ nodes simply sum incoming signals without applying any non-linearities. git [30 Apr 2024] / ✍️: A Beginner-friendly Introduction to Kolmogorov Arnold Networks (KAN) [19 May 2024] github stars
  • Large Concept Models📑: Focusing on high-level sentence (concept) level rather than tokens. using SONAR for sentence embedding space. [11 Dec 2024]
  • Large Language Diffusion Models📑: LLaDA's core is a mask predictor, which uses controlled noise to help models learn to predict missing information from context. ✍️ [14 Feb 2025]
  • Large Transformer Model Inference Optimization: Besides the increasing size of SoTA models, there are two main factors contributing to the inference challenge ... [10 Jan 2023]
  • Lamini Memory Tuning: Mixture of Millions of Memory Experts (MoME). 95% LLM Accuracy, 10x Fewer Hallucinations. ✍️ [Jun 2024] github stars
  • Less is More: Recursive Reasoning with Tiny Networks📑: Tiny neural networks can perform complex recursive reasoning efficiently, achieving strong results with minimal model size. [6 Oct 2025] git github stars
  • LLM patterns: 🏆From data to user, from defensive to offensive 🗄️
  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces📑 [1 Dec 2023] git: 1. Structured State Space (S4) - Class of sequence models, encompassing traits from RNNs, CNNs, and classical state space models. 2. Hardware-aware (Optimized for GPU) 3. Integrating selective SSMs and eliminating attention and MLP blocks ✍️ / A Visual Guide to Mamba and State Space Models ✍️ [19 FEB 2024] github stars
  • Mamba-2📑: 2-8X faster [31 May 2024]
  • Mixture-of-Depths📑: All tokens should not require the same effort to compute. The idea is to make token passage through a block optional. Each block selects the top-k tokens for processing, and the rest skip it. ✍️ [2 Apr 2024]
  • Mixture of experts models: Mixtral 8x7B: Sparse mixture of experts models (SMoE) magnet [Dec 2023]
  • Model Compression for Large Language Models ref📑 [15 Aug 2023]
  • Model merging✍️: : A technique that combines two or more large language models (LLMs) into a single model, using methods such as SLERP, TIES, DARE, and passthrough. [Jan 2024] git: mergekit github stars
    Method Pros Cons
    SLERP Preserves geometric properties, popular method Can only merge two models, may decrease magnitude
    TIES Can merge multiple models, eliminates redundant parameters Requires a base model, may discard useful parameters
    DARE Reduces overfitting, keeps expectations unchanged May introduce noise, may not work well with large differences
  • Nested Learning: A new ML paradigm for continual learning✍️: A self-modifying architecture. Nested Learning (HOPE) views a model and its training as multiple nested, multi-level optimization problems, each with its own “context flow,” pairing deep optimizers + continuum memory systems for continual, human-like learning. [7 Nov 2025]
  • RouteLLM: a framework for serving and evaluating LLM routers. [Jun 2024] github stars
  • Sakana.ai: Evolutionary Optimization of Model Merging Recipes.📑: A Method to Combine 500,000 OSS Models. git [19 Mar 2024] github stars
  • Scaling Synthetic Data Creation with 1,000,000,000 Personas📑 A persona-driven data synthesis methodology using Text-to-Persona and Persona-to-Persona. [28 Jun 2024]
  • Simplifying Transformer Blocks📑: Simplifie Transformer. Removed several block components, including skip connections, projection/value matrices, sequential sub-blocks and normalisation layers without loss of training speed. [3 Nov 2023]
  • Text-to-LoRA (T2L): Converts text prompts into LoRA models, enabling lightweight fine-tuning of AI models for custom tasks. github stars [01 May 2025]
  • Titans + MIRAS: Titans + MIRAS let models update themselves while running by using a human-like surprise metric that skips familiar info and stores only pattern-breaking moments into long-term memory. persistent (fixed knowledge), contextual (on-the-fly), and core-attention (short-term) layers. ✍️ [4 Dec 2025]
  • What We’ve Learned From A Year of Building with LLMs:💡A practical guide to building successful LLM products, covering the tactical, operational, and strategic. [8 June 2024]

Large Language Model: Challenges and Solutions

AGI Discussion and Social Impact

OpenAI Roadmap

  • AMA (ask me anything) with OpenAI on Reddit🗣️ [1 Nov 2024]
  • Humanloop Interview 2023🗣️ : 🗄️ [29 May 2023]
  • Model Spec: Desired behavior for the models in the OpenAI API and ChatGPT ✍️ [8 May 2024] ✍️: takeaway
  • o3/o4-mini/GPT-5🗣️: we are going to release o3 and o4-mini after all, probably in a couple of weeks, and then do GPT-5 in a few months. [4 Apr 2025]
  • OpenAI’s CEO Says the Age of Giant AI Models Is Already Over ✍️ [17 Apr 2023]
  • Q* (pronounced as Q-Star): The model, called Q* was able to solve basic maths problems it had not seen before, according to the tech news site the Information. ✍️ [23 Nov 2023]
  • Reflections on OpenAI🗣️: OpenAI culture. Bottoms-up decision-making. Progress is iterative, not driven by a rigid roadmap. Direction changes quickly based on new information. Slack is the primary communication tool. [16 Jul 2025]
  • Sam Altman reveals in an interview with Bill Gates (2 days ago) what's coming up in GPT-4.5 (or GPT-5): Potential integration with other modes of information beyond text, better logic and analysis capabilities, and consistency in performance over the next two years. ✍️ [12 Jan 2024]

OpenAI Models

  • GPT 1: Decoder-only model. 117 million parameters. [Jun 2018] git github stars
  • GPT 2: Increased model size and parameters. 1.5 billion. [14 Feb 2019] git github stars
  • GPT 3: Introduced few-shot learning. 175B. [11 Jun 2020] git github stars
  • GPT 3.5: 3 variants each with 1.3B, 6B, and 175B parameters. [15 Mar 2022] Estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4,096
  • ChatGPT: GPT-3 fine-tuned with RLHF. 20B or 175B. unverified ✍️ [30 Nov 2022]
  • GPT 4: Mixture of Experts (MoE). 8 models with 220 billion parameters each, for a total of about 1.76 trillion parameters. unverified ✍️ [14 Mar 2023]
  • GPT-4V(ision) system card: ✍️ [25 Sep 2023] / ✍️
  • GPT-4: The Dawn of LMMs📑: Preliminary Explorations with GPT-4V(ision) [29 Sep 2023]
    • GPT-4 details leaked: GPT-4 is a language model with approximately 1.8 trillion parameters across 120 layers, 10x larger than GPT-3. It uses a Mixture of Experts (MoE) model with 16 experts, each having about 111 billion parameters. Utilizing MoE allows for more efficient use of resources during inference, needing only about 280 billion parameters and 560 TFLOPs, compared to the 1.8 trillion parameters and 3,700 TFLOPs required for a purely dense model.
    • The model is trained on approximately 13 trillion tokens from various sources, including internet data, books, and research papers. To reduce training costs, OpenAI employs tensor and pipeline parallelism, and a large batch size of 60 million. The estimated training cost for GPT-4 is around $63 million. ✍️ [Jul 2023]
  • GPT-4o✍️: o stands for Omni. 50% cheaper. 2x faster. Multimodal input and output capabilities (text, audio, vision). supports 50 languages. [13 May 2024] / GPT-4o mini✍️: 15 cents per million input tokens, 60 cents per million output tokens, MMLU of 82%, and fast. [18 Jul 2024]
  • A new series of reasoning models✍️: The complex reasoning-specialized model, OpenAI o1 series, excels in math, coding, and science, outperforming GPT-4o on key benchmarks. [12 Sep 2024] / git: Awesome LLM Strawberry (OpenAI o1) github stars
  • A Comparative Study on Reasoning Patterns of OpenAI's o1 Model📑: 6 types of o1 reasoning patterns (i.e., Systematic Analysis (SA), Method Reuse (MR), Divide and Conquer (DC), Self-Refinement (SR), Context Identification (CI), and Emphasizing Constraints (EC)). the most commonly used reasoning patterns in o1 are DC and SR [17 Oct 2024]
  • o3-mini system card✍️: The first model to reach Medium risk on Model Autonomy. [31 Jan 2025]
  • OpenAI o1 system card✍️ [5 Dec 2024]
  • o3 preview✍️: 12 Days of OpenAI [20 Dec 2024]
  • o3/o4-mini✍️ [16 Apr 2025]
  • GPT-4.5✍️: greater “EQ”. better unsupervised learning (world model accuracy and intuition). scalable training from smaller models. ✍️ [27 Feb 2025]
  • GPT-4o: 4o image generation✍️: create photorealistic output, replacing DALL·E 3 [25 Mar 2025]
  • GPT-4.1 family of models✍️: GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano can process up to 1 million tokens of context. enhanced coding abilities, improved instruction following. [14 Apr 2025]
  • gpt-image-1✍️: Image generation model API with designing and editing [23 Apr 2025]
  • gpt-oss: gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI. [Jun 2025] github stars
  • GPT-5✍️: Real-time router orchestrating multiple models. GPT‑5 is the new default in ChatGPT, replacing GPT‑4o, OpenAI o3, OpenAI o4-mini, GPT‑4.1, and GPT‑4.5. [7 Aug 2025]
  • GPT 5.1✍️: GPT-5.1 Auto, GPT-5.1 Instant, and GPT-5.1 Thinking. Better instruction-following, More customization for tone and style. [12 Nov 2025]
  • GPT-5.1 Codex Max✍️: agentic coding model for lonng-running, detailed work. [19 Nov 2025]
  • GPT 5.2✍️: 70.9% GDPval (knowledge work vs professionals), major gains over GPT-5.1 on SWE-Bench, GPQA Diamond, AIME 2025, ARC-AGI reasoning, and advanced coding/vision tasks. [11 Dec 2025]
  • GPT-5.4✍️: Thinking, coding, and native computer-use in a single model. [Mar 2026]

OpenAI Products

  • Agents SDK & Response API✍️: Responses API (Chat Completions + Assistants API), Built-in tools (web search, file search, computer use), Agents SDK for multi-agent workflows, agent workflow observability tools [11 Mar 2025] git
  • Building ChatGPT Atlas✍️: OpenAI's approach to building Atlas. OWL: OpenAI’s Web Layer. Mojo Protocol. [Oct 2025]
  • ChatGPT agent✍️: Web-browsing, File-editing, Terminal, Email, Spreadsheet, Calendar, API-calling, Automation, Task-chaining, Reasoning. [17 Jul 2025]
  • ChatGPT can now see, hear, and speak✍️: It has recently been updated to support multimodal capabilities, including voice and image. [25 Sep 2023] Whisper / CLIP github stars github stars
  • ChatGPT Function calling [Jun 2023] > Azure OpenAI supports function calling. ✍️
  • ChatGPT Memory✍️: Remembering things you discuss across all chats saves you from having to repeat information and makes future conversations more helpful. [Apr 2024]
  • ChatGPT Plugin✍️ [23 Mar 2023]
  • CriticGPT✍️: a version of GPT-4 fine-tuned to critique code generated by ChatGPT [27 Jun 2024]
  • Codex 5.3✍️: OpenAI Codex with enhanced coding and agentic reasoning. [5 Feb 2026]
  • Custom instructions✍️: In a nutshell, the Custom Instructions feature is a cross-session memory that allows ChatGPT to retain key instructions across chat sessions. [20 Jul 2023]
  • DALL·E 3✍️ : In September 2023, OpenAI announced their latest image model, DALL-E 3 git [Sep 2023] github stars
  • deep research✍️: An agent that uses reasoning to synthesize large amounts of online information and complete multi-step research tasks [2 Feb 2025]
  • GPT-3.5 Turbo Fine-tuning✍️ Fine-tuning for GPT-3.5 Turbo is now available, with fine-tuning for GPT-4 coming this fall. [22 Aug 2023]
  • Introducing the GPT Store✍️: Roll out the GPT Store to ChatGPT Plus, Team and Enterprise users GPTs [10 Jan 2024]
  • New embedding models✍️ text-embedding-3-small: Embedding size: 512, 1536 text-embedding-3-large: Embedding size: 256,1024,3072 [25 Jan 2024]
  • Open AI Enterprise: Removes GPT-4 usage caps, and performs up to two times faster ✍️ [28 Aug 2023]
  • OpenAI DevDay 2023✍️: GPT-4 Turbo with 128K context, Assistants API (Code interpreter, Retrieval, and function calling), GPTs (Custom versions of ChatGPT: ✍️), Copyright Shield, Parallel Function Calling, JSON Mode, Reproducible outputs [6 Nov 2023]
  • OpenAI DevDay 2024✍️: Real-time API (speech-to-speech), Vision Fine-Tuning, Prompt Caching, and Distillation (fine-tuning a small language model using a large language model). ✍️ [1 Oct 2024]
  • OpenAI DevDay 2025✍️: ChatGPT Apps + SDK, AgentKit, GPT-5 Pro, Sora 2 video API, upgraded Codex ✍️ [6 Oct 2025]
  • OpenAI Frontier✍️: OpenAI’s largest, most capable model tier. [Feb 2026]
  • Operator✍️: GUI Agent. Operates embedded virtual environments. Specialized model (Computer-Using Agent). [23 Jan 2025]
  • Prism✍️: AI-native workspace for scientists to write and collaborate on research. [27 Jan 2026]
  • SearchGPT✍️: AI search [25 Jul 2024] > ChatGPT Search✍️ [31 Oct 2024]
  • Sora✍️ Text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt. [15 Feb 2024]
  • Structured Outputs in the API✍️: a new feature designed to ensure model-generated outputs will exactly match JSON Schemas provided by developers. [6 Aug 2024]

Anthropic AI Products

  • Agent Skills: A way to package instructions, scripts, and resources into “skills” that Claude agents can dynamically load. [16 Oct 2025]
  • Anthropic CLI (Claude Code): The official command-line interface that lives in your project directory, enabling natural-language code generation, refactoring, and Git automation. [24 Feb 2025]
  • Bringing Code Review to Claude Code✍️: Multi-agent PR review dispatches parallel agents and verifies bugs before posting findings. [9 Mar 2026]
  • Put Claude to work on your computer✍️: Dispatch carries tasks across phone and desktop while Claude operates your computer. [23 Mar 2026]
  • Anthropic killed Tool calling📺: Programmatic Tool Calling / Dynamic Filtering — what changed in Anthropic’s API. [Feb 2026]
  • Claude Agent SDK: A toolkit for building multi-step, tool-using agents using the Claude API. [29 Sep 2025]
  • Claude Opus 4.6✍️: Advanced reasoning and coding flagship model. [5 Feb 2026]
  • Claude Sonnet 4.6✍️: Balanced performance and speed model. [17 Feb 2026]
  • Constitutional AI (CAI): Anthropic’s training framework using a “constitution” (AI‑generated rules) to align models toward harmlessness. [15 Dec 2022]
  • Cowork: AI agent that accesses local files to automate multi-step desktop tasks like organizing, reporting, and data extraction. [Jan 2026]
  • Claude Code Security✍️: Claude Code on the web for scanning codebases and suggesting security patches. [Feb 2026]
  • Detecting and preventing distillation attacks✍️: 16M+ fraudulent exchanges scraped from Claude; Anthropic’s detection and prevention. [Feb 2026]
  • Frontier AI Safety Research: Foundational research into AI risks, alignment, and interpretability.
  • Model Context Protocol (MCP): An open standard for connecting AI assistants to external systems (data, tools, etc.) securely and scalably. [25 Nov 2024]
  • Programmatic Tool Calling: Enables Claude to write orchestration code (e.g., Python) to call multiple tools in a sequence, improving efficiency. [24 Nov 2025]
  • Tool Use & Agent Orchestration: Advanced tool‑use framework for Claude agents, allowing dynamic API discovery and execution in complex tasks. [24 Nov 2025]

Google AI Products

  • AlphaMissense: A machine learning tool that classifies the effects of 71 million 'missense' mutations in the human genome to help pinpoint disease causes. [2025]
  • CodeMender: An autonomous AI agent leveraging Gemini Deep Think models to automatically find, debug, and fix complex software security vulnerabilities. [Oct 2025]
  • Firebase Studio: A web-based IDE that uses Gemini to assist in building, refactoring, and troubleshooting full-stack web and mobile applications. [7 May 2025]
  • Gemini CLI: An open-source terminal interface for "vibecoding" that brings Gemini 3 Pro capabilities directly to the command line for script generation and automation. [25 Jun 2025]
  • Gemini Code Assist: An enterprise-grade AI assistant for IDEs (VS Code, IntelliJ) that offers context-aware code completion, generation, and chat using Gemini models. [20 May 2025]
  • Gemini Code Assist for GitHub: A specialized agent that acts as a code reviewer on Pull Requests, identifying bugs, style issues, and suggesting fixes automatically. [20 May 2025]
  • Google AI for Developers: A suite of research tools including AI-powered documentation search and code explanation to accelerate learning and implementation. [Jul 2024]
  • Google Antigravity: An "agent-first" IDE platform announced with Gemini 3 that gives autonomous agents direct control over editors, terminals, and browsers to build and verify software. [18 Nov 2025]
  • Introducing "vibe design" with Stitch✍️: AI-native design canvas for turning prompts and images into UI drafts. [18 Mar 2026]
  • Jules: An autonomous coding agent that integrates with GitHub to plan, execute, and verify multi-step coding tasks like bug fixing and dependency management. [20 May 2025]
  • NotebookLM: An AI-powered research and thinking partner that synthesizes complex information and automates online research using the Deep Research agent feature. [13 Nov 2025]
  • SIMA 2: (Scalable Instructable Multiworld Agent) A research agent that explores and learns to play across a variety of 3D video game environments, aimed at general-purpose robotics. [13 Nov 2025]
  • Vertex AI Codey: A family of foundation models (Code-Bison, Code-Gecko) optimized for code generation and completion, accessible via API. [29 Jun 2023]

Context constraints

  • Context Rot: How Increasing Input Tokens Impacts LLM Performance [14 Jul 2025]
  • Doc-to-LoRA: Learning to Instantly Internalize Contexts📑: Generates LoRA adapters from long context to cut repeated context cost. [Feb 2026]
  • DroPE✍️: Extends LLM context by dropping positional embeddings and brief recalibration, improving long-context performance without retraining. Sakana AI. [13 Dec 2025]
  • Giraffe📑: Adventures in Expanding Context Lengths in LLMs. A new truncation strategy for modifying the basis for the position encoding. ✍️ [2 Jan 2024]
  • Introducing 100K Context Windows✍️: hundreds of pages, Around 75,000 words; [11 May 2023] demo Anthropic Claude
  • Leave No Context Behind📑: Efficient Infinite Context Transformers with Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism. Integrate attention from both local and global attention. [10 Apr 2024]
  • LLM Maybe LongLM📑: Self-Extend LLM Context Window Without Tuning. With only four lines of code modification, the proposed method can effortlessly extend existing LLMs' context window without any fine-tuning. [2 Jan 2024]
  • Lost in the Middle: How Language Models Use Long Contexts📑:💡[6 Jul 2023]
    • Best Performace when relevant information is at beginning
    • Too many retrieved documents will harm performance
    • Performacnce decreases with an increase in context
  • “Needle in a Haystack” Analysis [21 Nov 2023]: Context Window Benchmarks; Claude 2.1 (200K Context Window) vs GPT-4; Long context prompting for Claude 2.1✍️ adding just one sentence, “Here is the most relevant sentence in the context:”, to the prompt resulted in near complete fidelity throughout Claude 2.1’s 200K context window. [6 Dec 2023] github stars
  • Ring Attention📑: 1. Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while overlapping the communication of key-value blocks with the computation of blockwise attention. 2. Ring Attention can reduce the memory requirements of Transformers, enabling us to train more than 500 times longer sequence than prior memory efficient state-of-the-arts and enables the training of sequences that exceed 100 million in length without making approximations to attention. 3. we propose an enhancement to the blockwise parallel transformers (BPT) framework. git [3 Oct 2023] github stars
  • Rotary Positional Embedding (RoPE)📑:💡/ ✍️ / 🗄️ [20 Apr 2021]
    • How is this different from the sinusoidal embeddings used in "Attention is All You Need"?
    • Sinusoidal embeddings apply to each coordinate individually, while rotary embeddings mix pairs of coordinates
    • Sinusoidal embeddings add a cos or sin term, while rotary embeddings use a multiplicative factor.
    • Rotary embeddings are applied to positional encoding to K and V, not to the input embeddings.
    • ALiBi📑: Attention with Linear Biases. ALiBi applies a bias directly to the attention scores. [27 Aug 2021]
    • NoPE: Transformer Language Models without Positional Encodings Still Learn Positional Information📑: No postion embedding. [30 Mar 2022]
  • Sparse Attention: Generating Long Sequences with Sparse Transformer📑:💡Sparse attention computes scores for a subset of pairs, selected via a fixed or learned sparsity pattern, reducing calculation costs. Strided attention: image, audio / Fixed attention:text ✍️ / git [23 Apr 2019] github stars
  • Structured Prompting: Scaling In-Context Learning to 1,000 Examples📑: [13 Dec 2022]
    • Microsoft's Structured Prompting allows thousands of examples, by first concatenating examples into groups, then inputting each group into the LM. The hidden key and value vectors of the LM's attention modules are cached. Finally, when the user's unaltered input prompt is passed to the LM, the cached attention vectors are injected into the hidden layers of the LM.
    • This approach wouldn't work with OpenAI's closed models. because this needs to access [keys] and [values] in the transformer interns, which they do not expose. You could implement yourself on OSS ones. ✍️ [07 Feb 2023]
  • Zig-Zag Ring Attention✍️: Long-context attention pattern for more memory-efficient distributed inference and training. [18 Mar 2026]

Numbers LLM

Trustworthy, Safe and Secure LLM

  • 20 AI Governance Papers📑 [Jan 2025]
  • A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models📑: A compre hensive survey of over thirty-two techniques developed to mitigate hallucination in LLMs [2 Jan 2024]
  • AI models collapse when trained on recursively generated data: Model Collapse. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. [24 Jul 2024]
  • Alignment Faking✍️: LLMs may pretend to align with training objectives during monitored interactions but revert to original behaviors when unmonitored. [18 Dec 2024] | demo: ✍️ | Alignment Science Blog
  • An Approach to Technical AGI Safety and Security📑: Google DeepMind. We focus on technical solutions to misuse and misalignment, two of four key AI risks (the others being mistakes and structural risks). To prevent misuse, we limit access to dangerous capabilities through detection and security. For misalignment, we use two defenses: model-level alignment via training and oversight, and system-level controls like monitoring and access restrictions. ✍️ [2 Apr 2025]
  • Anthropic Many-shot jailbreaking✍️: simple long-context attack, Bypassing safety guardrails by bombarding them with unsafe or harmful questions and answers. [3 Apr 2024]
  • Extracting Concepts from GPT-4✍️: Sparse Autoencoders identify key features, enhancing the interpretability of language models like GPT-4. They extract 16 million interpretable features using GPT-4's outputs as input for training. [6 Jun 2024]
  • FactTune📑: A procedure that enhances the factuality of LLMs without the need for human feedback. The process involves the fine-tuning of a separated LLM using methods such as DPO and RLAIF, guided by preferences generated by FActScore. [14 Nov 2023] FActScore works by breaking down a generation into a series of atomic facts and then computing the percentage of these atomic facts by a reliable knowledge source. github stars
  • Frontier Safety Framework: Google DeepMind, Frontier Safety Framework, a set of protocols designed to identify and mitigate potential harms from future AI systems. [17 May 2024]
  • Google SAIF✍️: Secure AI Framework for managing AI security risks. [05 Nov 2025]
  • Guardrails Hub: Guardrails for common LLM validation use cases
  • Hallucination Index: w.r.t. RAG, Testing LLMs with short (≤5k), medium (5k–25k), and long (40k–100k) contexts to evaluate improved RAG performance [Nov 2023]
  • Hallucination Leaderboard: Evaluate how often an LLM introduces hallucinations when summarizing a document. [Nov 2023]
  • Hallucinations📑: A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions [9 Nov 2023]
  • Large Language Models Reflect the Ideology of their Creators📑: When prompted in Chinese, all LLMs favor pro-Chinese figures; Western LLMs similarly align more with Western values, even in English prompts. [24 Oct 2024]
  • LlamaFirewall: Scans and filters AI inputs to block prompt injections and malicious content. [29 Apr 2025]
  • LLMs Will Always Hallucinate, and We Need to Live With This📑:💡LLMs cannot completely eliminate hallucinations through architectural improvements, dataset enhancements, or fact-checking mechanisms due to fundamental mathematical and logical limitations. [9 Sep 2024]
  • Machine unlearning: Machine unlearning: techniques to remove specific data from trained machine learning models.
  • Mapping the Mind of a Large Language Model: Anthrophic, A technique called "dictionary learning" can help understand model behavior by identifying which features respond to a particular input, thus providing insight into the model's "reasoning." ✍️ [21 May 2024]
  • NeMo Guardrails: Building Trustworthy, Safe and Secure LLM Conversational Systems [Apr 2023] github stars
  • NIST AI Risk Management Framework: NIST released the first complete version of the NIST AI RMF Playbook on March 30, 2023
  • OpenAI Weak-to-strong generalization📑:💡In the superalignment problem, humans must supervise models that are much smarter than them. The paper discusses supervising a GPT-4 or 3.5-level model using a GPT-2-level model. It finds that while strong models supervised by weak models can outperform the weak models, they still don’t perform as well as when supervised by ground truth. git [14 Dec 2023] github stars
  • Political biases of LLMs📑: From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models. [15 May 2023]
  • Red Teaming: The term red teaming has historically described systematic adversarial attacks for testing security vulnerabilities. LLM red teamers should be a mix of people with diverse social and professional backgrounds, demographic groups, and interdisciplinary expertise that fits the deployment context of your AI system. ✍️
  • The Foundation Model Transparency Index📑: A comprehensive assessment of the transparency of foundation model developers ✍️ [19 Oct 2023]
  • The Instruction Hierarchy📑: Training LLMs to Prioritize Privileged Instructions. The OpenAI highlights the need for instruction privileges in LLMs to prevent attacks and proposes training models to conditionally follow lower-level instructions based on their alignment with higher-level instructions. [19 Apr 2024]
  • Tracing the thoughts of a large language model✍️:💡Claude 3.5 Haiku 1. Universal Thought Processing (Multiple Languages): Shared concepts exist across languages and are then translated into the respective language. 2. Advance Planning (Composing Poetry): Despite generating text word by word, it anticipates rhyming words in advance. 3. Fabricated Reasoning (Math): Produces plausible-sounding arguments even when given an incorrect hint. [27 Mar 2025]
  • Trustworthy LLMs📑: Comprehensive overview for assessing LLM trustworthiness; Reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. [10 Aug 2023]
  • Vibe Hacking✍️: Anthropic reports vibe-hacking attempts. [14 Nov 2025]

Large Language Model Is: Abilities

Reasoning

Survey and Reference

Survey on Large Language Models

Additional Topics: A Survey of LLMs

LLM Research (Ranked by cite count >=150)

  • LLM Papers (≥150 citations)📑: High-citation CS papers (≥150 citations) across 35 LLM topic areas — reasoning, RAG, agents, PEFT, RLHF, scaling laws, multimodal, and more — fetched from Semantic Scholar and ranked by citation count.

Business use cases

Build an LLMs from scratch: picoGPT and lit-gpt

  • An unnecessarily tiny implementation of GPT-2 in NumPy. picoGPT: Transformer Decoder [Jan 2023] github stars
q = x @ w_k # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]
k = x @ w_q # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]
v = x @ w_v # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]

# In picoGPT, combine w_q, w_k and w_v into a single matrix w_fc
x = x @ w_fc # [n_seq, n_embd] @ [n_embd, 3*n_embd] -> [n_seq, 3*n_embd]
  • 4 LLM Text Generation Strategies: Greedy strategy, Multinomial sampling strategy, Beam search, Contrastive search [27 Sep 2025]
  • Andrej Karpathy📺: Reproduce the GPT-2 (124M) from scratch. [June 2024] / SebastianRaschka📺: Developing an LLM: Building, Training, Finetuning [June 2024]
  • Beam Search [1977] in Transformers is an inference algorithm that maintains the beam_size most probable sequences until the end token appears or maximum sequence length is reached. If beam_size (k) is 1, it's a Greedy Search. If k equals the total vocabularies, it's an Exhaustive Search. 🤗 [Mar 2022]
  • Build a Large Language Model (From Scratch):🏆Implementing a ChatGPT-like LLM from scratch, step by step github stars
  • Einsum is All you Need: Einstein Summation [5 Feb 2018]
  • lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed. git [Mar 2023] github stars
  • llama3-from-scratch: Implementing Llama3 from scratch [May 2024] github stars
  • llm.c: LLM training in simple, raw C/CUDA [Apr 2024] github stars | Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 git
  • nanochat: a full-stack implementation of an LLM [Oct 2025] github stars
  • nanoGPT:💡Andrej Karpathy [Dec 2022] | nanoMoE [Dec 2024] github stars github stars
  • nanoVLM: 🤗 The simplest, fastest repository for training/finetuning small-sized VLMs. [May 2025]
  • pix2code: Generating Code from a Graphical User Interface Screenshot. Trained dataset as a pair of screenshots and simplified intermediate script for HTML, utilizing image embedding for CNN and text embedding for LSTM, encoder and decoder model. Early adoption of image-to-code. [May 2017] github stars
  • Screenshot to code: Turning Design Mockups Into Code With Deep Learning [Oct 2017] ✍️ github stars
  • Spreadsheets-are-all-you-need: Spreadsheets-are-all-you-need implements the forward pass of GPT2 entirely in Excel using standard spreadsheet functions. [Sep 2023] github stars
  • Transformer Explainer: an open-source interactive tool to learn about the inner workings of a Transformer model (GPT-2) git [8 Aug 2024]
  • Umar Jamil github:💡LLM Model explanation / building a model from scratch 📺
  • You could have designed state of the art positional encoding: Binary Position Encoding, Sinusoidal positional encoding, Absolute vs Relative Position Encoding, Rotary Positional encoding [17 Nov 2024]

Classification of Attention

visual attention

LLM Materials in Japanese

LLM Materials in Korean

Learning and Supplementary Materials