Mixpeek

Give your agents eyes and ears.

Mixpeek breaks every video, image, and audio file into structured features
your agents can search, reason over, and trust.

Docs · Get Started · Quickstart · Blog ·

What is Mixpeek?

Mixpeek is multimodal infrastructure for AI agents. Upload video, images, audio, and documents — Mixpeek automatically extracts features (faces, objects, transcripts, embeddings, structured metadata) and indexes them into searchable collections. Your agent queries a single endpoint and gets structured results back.

Index → Upload files to buckets. Mixpeek runs feature extraction automatically — faces, objects, transcripts, embeddings, and structured metadata all get indexed.

Search → Build retrieval pipelines. Semantic search, face search, object search, transcript search — chain them into multi-stage retrievers exposed as a single endpoint.

Integrate → Wire Mixpeek into your agent as a LangChain tool, an MCP server, or a direct REST call.

Quickstart

pip install mixpeek

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

# Upload a video
mx.buckets.upload(bucket_id="my-bucket", file_path="video.mp4")

# Search across all extracted features
results = mx.retrievers.execute(
    retriever_id="my-retriever",
    inputs={"query_text": "person wearing a red jacket"},
    limit=10,
)

Also available as:

JavaScript SDK: npm install mixpeek
MCP Server: Connect Claude, Cursor, or any MCP-compatible agent
REST API: POST https://api.mixpeek.com/v1/retrievers/{id}/execute
CLI: mixpeek --version (included in the Python SDK)

What Gets Extracted

File Type	Features
Video	Face embeddings (ArcFace), scene descriptions (Gemini), visual embeddings (Vertex AI), transcripts (Whisper), keyframes
Images	Visual embeddings (SigLIP / Vertex AI), face embeddings (ArcFace), OCR, descriptions, structured extraction
Audio	Transcripts (Whisper), transcript embeddings (E5-Large), multimodal audio embeddings
Documents	Text chunks, text embeddings (E5-Large), OCR for scanned PDFs, structured extraction

Each extracted feature becomes an independently searchable document. A single video can produce hundreds of documents — one per face, one per transcript segment, one per scene.

Use Cases

Video understanding — Search surveillance footage by face, scene, or spoken word
Content moderation — Detect brand logos, faces, and unsafe content across media libraries
Document intelligence — Extract structured data from scanned PDFs, invoices, and forms
Media asset management — Find the exact frame across millions of hours of video
E-commerce — Visual similarity search, product matching, catalog enrichment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixpeek

Give your agents eyes and ears.

What is Mixpeek?

Quickstart

What Gets Extracted

Use Cases

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!