A production-grade, resource-constrained Small Language Model (SLM) Agent featuring deterministic guardrails, Reciprocal Rank Fusion (RRF), and extreme VRAM optimization.
TinyAgent-SLM is engineered to democratize advanced AI capabilities by deploying a robust Retrieval-Augmented Generation (RAG) pipeline on extreme edge devices.
The core achievement of this project is successfully running a full 3B-parameter LLM (Llama-3.2-3B-Instruct), alongside a dense embedding model (BAAI/bge-small) and a cross-encoder reranker, entirely locally on a single 4GB VRAM GPU (RTX 3050) without Out-Of-Memory (OOM) crashes.
This repository is the final deployment phase (Phase 3) of my broader initiative to build full-stack, resource-efficient AI agents from the ground up. To see the complete lifecycle of how this model was adapted and aligned before edge deployment, please explore the related repositories:
- 🔗 Phase 1: TinyAgent-Llama3-FineTuning
- Domain Adaptation: Fine-tuned the base Llama-3 model using QLoRA to inject custom domain knowledge and improve intent recognition accuracy on consumer hardware.
- 🔗 Phase 2: Llama3-DPO-Alignment
- Preference Alignment: Applied Direct Preference Optimization (DPO) to constrain the fine-tuned model's behavior, ensuring safe, harmless, and human-aligned responses.
- 🔗 Phase 3: TinyAgent-SLM (This Repository)
- Edge Grounding: The final hardware-constrained deployment. Integrates the aligned SLM with a local VectorDB, RRF retrieval, and deterministic guardrails to prevent hallucinations in real-world use.
Bypassed standard high-level abstractions to interact directly with Hugging Face's core AutoModelForCausalLM API to enforce strict memory boundaries:
- 4-Bit Double Quantization: Implemented
bitsandbytesnf4quantization to drastically reduce the model footprint. - Strict VRAM Leashing: Hard-coded
max_memory={0: "2048MB"}to prevent the greedyAcceleratelibrary from crashing the system. - CPU Offloading: Enabled
llm_int8_enable_fp32_cpu_offload=Trueto seamlessly spill excess model shards into system RAM (Pagefile) during the initial loading phase.
- Query Expansion: The SLM autonomously rewrites the user's initial prompt into multiple perspectives to maximize retrieval recall.
- Reciprocal Rank Fusion (RRF): Merges results from multiple query searches to bubble up the most universally relevant context.
- Cross-Encoder Reranking: Utilizes
ms-marco-MiniLM-L-6-v2for fine-grained, one-to-one relevance scoring, filtering out noise.
The agent evaluates the retrieved local context. If it deems the context "Irrelevant", it dynamically extracts the core entity and falls back to a Wikipedia API search, preventing dead-ends.
Implements a strict post-generation verification layer (guardrails.py). It forces the LLM's output to logically align with the retrieved context, specifically auditing numerical data and factual claims to mitigate hallucinations.
User Query ──> [Query Expansion] ──> [Vector DB Search (Chroma)]
│
[RRF Merging]
│
[Cross-Encoder Reranking]
│
[Wikipedia Fallback] <──(Irrelevant)── [Context Reflection]
│
(Relevant)
│
[Llama-3.2-3B Generation] ──> [Deterministic Guardrails] ──> Final Output
Prerequisites Anaconda / Miniconda
Windows OS (Tested)
GPU with at least 4GB VRAM
git clone https://github.com/YunqiWang1/TinyAgent-SLM.git
cd TinyAgent-SLM
conda create -n ai_start python=3.10
conda activate ai_start
conda install -c conda-forge greenlet
pip install gradio wikipedia langchain-community langchain-text-splitters sentence-transformers chromadb transformers bitsandbytes accelerate
python app.py
This repository contains the archived codebase for my research project completed in 2025. The code has been recently cleaned, documented, and migrated to this public repository for portfolio demonstration purposes.
