A local AI assistant for Apple Silicon Macs. Type a question, and ask automatically routes it to the right model -- coding, vision, speech-to-text, or general chat -- all running offline on your hardware via MLX.
git clone https://github.com/kislayverma/ask.git
cd ask
./setup.shThe setup script will:
- Install system dependencies (Python 3, ffmpeg) via Homebrew if needed
- Detect your hardware and select the best models for your RAM
- Create a virtual environment and install everything
- Download the classifier model (~2GB)
- Add the
askcommand to your shell
# Ask anything -- auto-routed to the right model
ask "What is a closure in JavaScript?"
# Force a specific model
ask --model code "Write a Python binary search function"
# Analyze an image
ask --image screenshot.png "What error is shown here?"
# Transcribe audio
ask --audio meeting.wav
# Interactive mode (model stays warm between queries)
ask -iIn interactive mode (ask -i), you can:
ask> What is quantum entanglement? # auto-routed
ask> /image photo.png What's in this? # vision
ask> /audio recording.wav # transcribe
ask> quit # exit
Your query
|
v
[Router] -- Phi-3.5-mini classifies your query
|
+---> GENERAL --> Qwen3 8B
+---> CODE --> Qwen2.5-Coder 7B
+---> VISION --> Qwen2.5-VL 7B
+---> AUDIO --> Qwen3-ASR 0.6B
The router uses Phi-3.5-mini (3.8B, 4-bit) as a fast classifier. It reads your query and decides which specialist model to invoke. Only one model is loaded at a time to fit comfortably in memory.
Explicit flags (--image, --audio, --model) bypass the classifier entirely.
The setup script auto-detects your RAM and picks the best models:
| RAM | Tier | General | Coding | Vision | Audio |
|---|---|---|---|---|---|
| 8GB | 8gb |
Phi-3.5-mini 3.8B | Phi-3.5-mini 3.8B | Qwen2.5-VL 7B | Qwen3-ASR 0.6B |
| 16GB | 16gb |
Qwen3 8B | Qwen2.5-Coder 7B | Qwen2.5-VL 7B | Qwen3-ASR 0.6B |
| 32GB+ | 32gb |
Qwen3 8B | Qwen2.5-Coder 7B | Qwen2.5-VL 7B | Qwen3-ASR 1.7B |
Models download on first use (~4-6GB each). Only the classifier (~2GB) downloads during setup.
- macOS with Apple Silicon (M1/M2/M3/M4)
- 8GB+ unified memory
- ~20GB free disk space (for all models)
You can swap any role to a different model using the ask models command:
# See what's currently configured
ask models list
# Swap the general model to Llama 3.2
ask models add general mlx-community/Llama-3.2-3B-Instruct-4bit --type mlx_lm
# Pre-download so first use is instant
ask models download general
# Reset back to the default for your hardware tier
ask models remove generalThe --type flag tells ask which framework to use:
| Type | Use for |
|---|---|
mlx_lm |
Text-only LLMs (general chat, coding) |
mlx_vlm |
Vision-language models (image understanding) |
mlx_qwen3_asr |
Qwen3 speech-to-text models |
Any model from the mlx-community HuggingFace organization works. Use 4-bit quantized models (names ending in -4bit) to stay within memory limits.
Config lives at ~/.ask/config.yaml. You can also edit it directly, or re-run ./setup.sh to regenerate it from scratch.
ask/
├── setup.sh # One-command setup
├── pyproject.toml # Package definition
└── src/ask/
├── cli.py # CLI entry point (one-shot + REPL)
├── router.py # LLM-based query classifier
├── models.py # Model loader/cache manager
├── model_manager.py # Model add/remove/download commands
├── config.py # Config file handling
├── hardware.py # Hardware detection
└── model_profiles/ # Default model selections per RAM tier
├── tier_8gb.yaml
├── tier_16gb.yaml
└── tier_32gb.yaml
MIT