Skip to content

YunqiWang1/TinyAgent-SLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

TinyAgent-SLM: MicroAgent-Reflection-Guard

Python 3.10+ PyTorch Gradio VRAM

A production-grade, resource-constrained Small Language Model (SLM) Agent featuring deterministic guardrails, Reciprocal Rank Fusion (RRF), and extreme VRAM optimization.

TinyAgent Demo

Overview

TinyAgent-SLM is engineered to democratize advanced AI capabilities by deploying a robust Retrieval-Augmented Generation (RAG) pipeline on extreme edge devices.

The core achievement of this project is successfully running a full 3B-parameter LLM (Llama-3.2-3B-Instruct), alongside a dense embedding model (BAAI/bge-small) and a cross-encoder reranker, entirely locally on a single 4GB VRAM GPU (RTX 3050) without Out-Of-Memory (OOM) crashes.


The TinyAgent Ecosystem: An End-to-End SLM Journey

This repository is the final deployment phase (Phase 3) of my broader initiative to build full-stack, resource-efficient AI agents from the ground up. To see the complete lifecycle of how this model was adapted and aligned before edge deployment, please explore the related repositories:

  • 🔗 Phase 1: TinyAgent-Llama3-FineTuning
    • Domain Adaptation: Fine-tuned the base Llama-3 model using QLoRA to inject custom domain knowledge and improve intent recognition accuracy on consumer hardware.
  • 🔗 Phase 2: Llama3-DPO-Alignment
    • Preference Alignment: Applied Direct Preference Optimization (DPO) to constrain the fine-tuned model's behavior, ensuring safe, harmless, and human-aligned responses.
  • 🔗 Phase 3: TinyAgent-SLM (This Repository)
    • Edge Grounding: The final hardware-constrained deployment. Integrates the aligned SLM with a local VectorDB, RRF retrieval, and deterministic guardrails to prevent hallucinations in real-world use.

Core Features & Engineering Highlights

1. Extreme VRAM Optimization (4GB VRAM Limit)

Bypassed standard high-level abstractions to interact directly with Hugging Face's core AutoModelForCausalLM API to enforce strict memory boundaries:

  • 4-Bit Double Quantization: Implemented bitsandbytes nf4 quantization to drastically reduce the model footprint.
  • Strict VRAM Leashing: Hard-coded max_memory={0: "2048MB"} to prevent the greedy Accelerate library from crashing the system.
  • CPU Offloading: Enabled llm_int8_enable_fp32_cpu_offload=True to seamlessly spill excess model shards into system RAM (Pagefile) during the initial loading phase.

2. Robust RAG Pipeline (Inspired by advanced Query Blending & Filtering)

  • Query Expansion: The SLM autonomously rewrites the user's initial prompt into multiple perspectives to maximize retrieval recall.
  • Reciprocal Rank Fusion (RRF): Merges results from multiple query searches to bubble up the most universally relevant context.
  • Cross-Encoder Reranking: Utilizes ms-marco-MiniLM-L-6-v2 for fine-grained, one-to-one relevance scoring, filtering out noise.

3. Reflection & Dynamic Fallback

The agent evaluates the retrieved local context. If it deems the context "Irrelevant", it dynamically extracts the core entity and falls back to a Wikipedia API search, preventing dead-ends.

4. Deterministic Guardrails

Implements a strict post-generation verification layer (guardrails.py). It forces the LLM's output to logically align with the retrieved context, specifically auditing numerical data and factual claims to mitigate hallucinations.

System Architecture

User Query ──> [Query Expansion] ──> [Vector DB Search (Chroma)] 
                                              │
                                     [RRF Merging]
                                              │
                                     [Cross-Encoder Reranking]
                                              │
[Wikipedia Fallback] <──(Irrelevant)── [Context Reflection]
                                              │
                                          (Relevant)
                                              │
[Llama-3.2-3B Generation] ──> [Deterministic Guardrails] ──> Final Output

Installation & Usage

Prerequisites Anaconda / Miniconda

Windows OS (Tested)

GPU with at least 4GB VRAM

Setup

1. Clone the repository

git clone https://github.com/YunqiWang1/TinyAgent-SLM.git

cd TinyAgent-SLM

2. Create conda environment

conda create -n ai_start python=3.10

conda activate ai_start

3. Install dependencies

conda install -c conda-forge greenlet

pip install gradio wikipedia langchain-community langchain-text-splitters sentence-transformers chromadb transformers bitsandbytes accelerate

Run the App

python app.py

Archive Notice:

This repository contains the archived codebase for my research project completed in 2025. The code has been recently cleaned, documented, and migrated to this public repository for portfolio demonstration purposes.

About

A robust SLM RAG pipeline optimized for 4GB VRAM edge devices, featuring RRF and deterministic guardrails.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages