Minions is a framework developed by Stanford's Hazy Research lab that enables efficient collaboration between small, local models running on your device and large, powerful models running in the cloud. By combining Minions with Groq's fast inference and cost-effectiveness, you get reduced costs, minimal latency, and high-quality results similar to using large models exclusively. We will be exploring Minions in general, and then investigating how it is specifically well-suited for multi-hop question-answering, a challenging type of question where frontier models still struggle.
The framework offers two main protocols:
The Minion Protocol is a small, on-device or local model chatting with a cloud-based model to reach a solution without sending long context to the cloud, while the MinionS protocol is a cloud model decomposing the task into smaller subtasks that are executed in parallel by the local model before being aggregated by the local model. While Minion offers slightly lower performance, it is more cost-effective than MinionS. One should consider the complexity of the task when deciding between Minion and MinionS based on the trade-off between cost and performance.
Multi-hop reasoning refers to answering questions that require connecting multiple pieces of information across different sources or parts of a document. Traditionally, multi-hop QA has been approached through:
-
Retrieval-Augmented Generation (RAG): Systems retrieve multiple relevant passages and then generate answers, but often struggle with complex reasoning chains.
-
Pipeline Approaches: Breaking questions into sub-questions, retrieving information for each, then combining results - but these systems are complex to build and maintain.
-
Single Large Model Calls: Using frontier models with large context windows to process all information at once - effective but extremely expensive and often inefficient.
Consider these examples that require multiple reasoning steps:
- "How did the 2008 housing crisis affect average retirement savings by 2010?"
- "Compare NVIDIA and Apple's stock performance during the AI boom of 2023"
- "How did Britain's and France's economic recovery differ in the Great Depression?"
Let's break down how one would approach the question about Britain and France's economic recovery:
-
First Hop: Identify when the Great Depression occurred (approximate timeframe)
-
Second Hop: Gather information about Britain's economic recovery during this period
-
Third Hop: Gather information about France's economic recovery during this period
-
Fourth Hop: Compare the two countries' economic recovery
-
Final Synthesis: Draw conclusions about how each country recovered from and was affected by the Great Depression
This multi-step process requires gathering different pieces of information and connecting them in a logical sequence - exactly what makes multi-hop reasoning challenging.
Single-call approaches to multi-hop reasoning face several challenges:
- High tendency for models to hallucinate when connections aren't explicit
- High token costs when processing large documents
- Difficulty maintaining focus across multiple steps
The MinionS architecture of decomposing a query into tasks and synthesizing a final response is well-suited for multi-hop question-answering, as it helps form connections, identify separate pieces of information, and compare them effectively.
Let us consider the multi-hop question in the example file we have provided: "How did Britain's and France's economic recovery differ in the Great Depression?"
Here is how MinionS would go ahead and answer this question:
1. Initial Task Decomposition: The remote model hosted on Groq typically will break the task into two parallel tasks, such as:
Task 1: "Extract information about Britain's economic recovery during the Great Depression."
Task 2: "Extract information about France's economic recovery during the Great Depression."2. Parallel Local Processing: The local model will these tasks against the context document, returning structured outputs:
{
"explanation": "Britain devalued its currency early and experienced less severe impacts...",
"citation": "Britain, Argentina and Brazil, all of which devalued their currencies early and returned to normal patterns of growth relatively rapidly...",
"answer": "Britain's early currency devaluation helped it recover more quickly..."
}3. Expert Aggregation: Finally, the remote model on Groq synthesizes the parallel outputs from the local models into a final answer. This final answer effectively answers the multi-hop question, combining information from several parts of the context.
By leveraging the Minions framework with Groq's fast inference capabilities, developers can build applications that handle complex multi-hop reasoning tasks efficiently and cost-effectively. The MinionS protocol in particular offers a powerful alternative to traditional approaches, maintaining near-frontier model quality while significantly reducing costs and running on local devices.
- Clone the repository:
git clone https://github.com/HazyResearch/minions.git
cd minions- Create a virtual environment (optional):
python3 -m venv .venv
source .venv/bin/activate- Install the package and dependencies:
pip install -e .- Install Ollama and pull the Llama 3.2 model:
# Install Ollama from https://ollama.ai
ollama pull llama3.2- Get your Groq API key from Groq Cloud
export GROQ_API_KEY=YOUR_GROQ_API_KEYGo ahead and run the example code:
python groq_minions.py