Mixpeek breaks every video, image, and audio file into structured features
your agents can search, reason over, and trust.
Docs · Get Started · Quickstart · Blog ·
Mixpeek is multimodal infrastructure for AI agents. Upload video, images, audio, and documents — Mixpeek automatically extracts features (faces, objects, transcripts, embeddings, structured metadata) and indexes them into searchable collections. Your agent queries a single endpoint and gets structured results back.
Index → Upload files to buckets. Mixpeek runs feature extraction automatically — faces, objects, transcripts, embeddings, and structured metadata all get indexed.
Search → Build retrieval pipelines. Semantic search, face search, object search, transcript search — chain them into multi-stage retrievers exposed as a single endpoint.
Integrate → Wire Mixpeek into your agent as a LangChain tool, an MCP server, or a direct REST call.
pip install mixpeekfrom mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
# Upload a video
mx.buckets.upload(bucket_id="my-bucket", file_path="video.mp4")
# Search across all extracted features
results = mx.retrievers.execute(
retriever_id="my-retriever",
inputs={"query_text": "person wearing a red jacket"},
limit=10,
)Also available as:
- JavaScript SDK:
npm install mixpeek - MCP Server: Connect Claude, Cursor, or any MCP-compatible agent
- REST API:
POST https://api.mixpeek.com/v1/retrievers/{id}/execute - CLI:
mixpeek --version(included in the Python SDK)
| File Type | Features |
|---|---|
| Video | Face embeddings (ArcFace), scene descriptions (Gemini), visual embeddings (Vertex AI), transcripts (Whisper), keyframes |
| Images | Visual embeddings (SigLIP / Vertex AI), face embeddings (ArcFace), OCR, descriptions, structured extraction |
| Audio | Transcripts (Whisper), transcript embeddings (E5-Large), multimodal audio embeddings |
| Documents | Text chunks, text embeddings (E5-Large), OCR for scanned PDFs, structured extraction |
Each extracted feature becomes an independently searchable document. A single video can produce hundreds of documents — one per face, one per transcript segment, one per scene.
- Video understanding — Search surveillance footage by face, scene, or spoken word
- Content moderation — Detect brand logos, faces, and unsafe content across media libraries
- Document intelligence — Extract structured data from scanned PDFs, invoices, and forms
- Media asset management — Find the exact frame across millions of hours of video
- E-commerce — Visual similarity search, product matching, catalog enrichment