TinyAgent-SLM: MicroAgent-Reflection-Guard

A production-grade, resource-constrained Small Language Model (SLM) Agent featuring deterministic guardrails, Reciprocal Rank Fusion (RRF), and extreme VRAM optimization.

Overview

TinyAgent-SLM is engineered to democratize advanced AI capabilities by deploying a robust Retrieval-Augmented Generation (RAG) pipeline on extreme edge devices.

The core achievement of this project is successfully running a full 3B-parameter LLM (Llama-3.2-3B-Instruct), alongside a dense embedding model (BAAI/bge-small) and a cross-encoder reranker, entirely locally on a single 4GB VRAM GPU (RTX 3050) without Out-Of-Memory (OOM) crashes.

The TinyAgent Ecosystem: An End-to-End SLM Journey

This repository is the final deployment phase (Phase 3) of my broader initiative to build full-stack, resource-efficient AI agents from the ground up. To see the complete lifecycle of how this model was adapted and aligned before edge deployment, please explore the related repositories:

🔗 Phase 1: TinyAgent-Llama3-FineTuning
- Domain Adaptation: Fine-tuned the base Llama-3 model using QLoRA to inject custom domain knowledge and improve intent recognition accuracy on consumer hardware.
🔗 Phase 2: Llama3-DPO-Alignment
- Preference Alignment: Applied Direct Preference Optimization (DPO) to constrain the fine-tuned model's behavior, ensuring safe, harmless, and human-aligned responses.
🔗 Phase 3: TinyAgent-SLM (This Repository)
- Edge Grounding: The final hardware-constrained deployment. Integrates the aligned SLM with a local VectorDB, RRF retrieval, and deterministic guardrails to prevent hallucinations in real-world use.

Core Features & Engineering Highlights

1. Extreme VRAM Optimization (4GB VRAM Limit)

Bypassed standard high-level abstractions to interact directly with Hugging Face's core AutoModelForCausalLM API to enforce strict memory boundaries:

4-Bit Double Quantization: Implemented bitsandbytes nf4 quantization to drastically reduce the model footprint.
Strict VRAM Leashing: Hard-coded max_memory={0: "2048MB"} to prevent the greedy Accelerate library from crashing the system.
CPU Offloading: Enabled llm_int8_enable_fp32_cpu_offload=True to seamlessly spill excess model shards into system RAM (Pagefile) during the initial loading phase.

2. Robust RAG Pipeline (Inspired by advanced Query Blending & Filtering)

Query Expansion: The SLM autonomously rewrites the user's initial prompt into multiple perspectives to maximize retrieval recall.
Reciprocal Rank Fusion (RRF): Merges results from multiple query searches to bubble up the most universally relevant context.
Cross-Encoder Reranking: Utilizes ms-marco-MiniLM-L-6-v2 for fine-grained, one-to-one relevance scoring, filtering out noise.

3. Reflection & Dynamic Fallback

The agent evaluates the retrieved local context. If it deems the context "Irrelevant", it dynamically extracts the core entity and falls back to a Wikipedia API search, preventing dead-ends.

4. Deterministic Guardrails

Implements a strict post-generation verification layer (guardrails.py). It forces the LLM's output to logically align with the retrieved context, specifically auditing numerical data and factual claims to mitigate hallucinations.

System Architecture

User Query ──> [Query Expansion] ──> [Vector DB Search (Chroma)] 
                                              │
                                     [RRF Merging]
                                              │
                                     [Cross-Encoder Reranking]
                                              │
[Wikipedia Fallback] <──(Irrelevant)── [Context Reflection]
                                              │
                                          (Relevant)
                                              │
[Llama-3.2-3B Generation] ──> [Deterministic Guardrails] ──> Final Output

Installation & Usage

Prerequisites Anaconda / Miniconda

Windows OS (Tested)

GPU with at least 4GB VRAM

Setup

1. Clone the repository

git clone https://github.com/YunqiWang1/TinyAgent-SLM.git

cd TinyAgent-SLM

2. Create conda environment

conda create -n ai_start python=3.10

conda activate ai_start

3. Install dependencies

conda install -c conda-forge greenlet

pip install gradio wikipedia langchain-community langchain-text-splitters sentence-transformers chromadb transformers bitsandbytes accelerate

Run the App

python app.py

Archive Notice:

This repository contains the archived codebase for my research project completed in 2025. The code has been recently cleaned, documented, and migrated to this public repository for portfolio demonstration purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
TinyAgent-SLM		TinyAgent-SLM
README.md		README.md
demo.png		demo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyAgent-SLM: MicroAgent-Reflection-Guard

Overview

The TinyAgent Ecosystem: An End-to-End SLM Journey

Core Features & Engineering Highlights

1. Extreme VRAM Optimization (4GB VRAM Limit)

2. Robust RAG Pipeline (Inspired by advanced Query Blending & Filtering)

3. Reflection & Dynamic Fallback

4. Deterministic Guardrails

System Architecture

Installation & Usage

Setup

1. Clone the repository

2. Create conda environment

3. Install dependencies

Run the App

Archive Notice:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyAgent-SLM: MicroAgent-Reflection-Guard

Overview

The TinyAgent Ecosystem: An End-to-End SLM Journey

Core Features & Engineering Highlights

1. Extreme VRAM Optimization (4GB VRAM Limit)

2. Robust RAG Pipeline (Inspired by advanced Query Blending & Filtering)

3. Reflection & Dynamic Fallback

4. Deterministic Guardrails

System Architecture

Installation & Usage

Setup

1. Clone the repository

2. Create conda environment

3. Install dependencies

Run the App

Archive Notice:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages