Not Yet Another Neural Network Benchmarking Tool.
A high-performance LLM inference benchmarking tool designed for Kubernetes-scale deployments.
nyann-bench was vibe-coded created via agentic engineering in support of vLLM's GB200 NVL72 WideEP bring-up, in order to address a series of challenges we ran into at scale.
- In order to sustain a high number of concurrent requests, a benchmarking tool needs to support scale-out and a high request rate at high concurrency.
- Observability becomes more important at scale. Client-side benchmarking metrics make it easy to see what all benchmarking pods are doing at a glance.
- Streaming evals helped us detect and debug numerical issues that would gradually degrade the accuracy of NVFP4 models over the lifetime of the server — rare events that would only happen at scale.
- Tools like
vllm bench,guide-llmorlm-evalthat have heavy dependencies like PyTorch are too slow to update or deploy.nyann-benchis only 5MB compressed.
At high concurrency, nyann-bench sustains up to 10x more requests per second than Python-based alternatives. Go's goroutine model and tuned HTTP transport eliminate the client as the bottleneck, so you're measuring the server, not your benchmark harness.
| Concurrency | nyann-bench | guidellm | vllm bench |
|---|---|---|---|
| 1 | 28 req/s | 28 req/s | 28 req/s |
| 64 | 1,616 req/s | 1,341 req/s | 1,386 req/s |
| 256 | 7,221 req/s | 1,352 req/s | 2,083 req/s |
| 1024 | 15,065 req/s | 1,207 req/s | 2,120 req/s |
| 4096 | 17,889 req/s | 1,306 req/s | 1,799 req/s |
Measured against the built-in mock server on a Linux x86_64 machine, 30s per data point. See bench_compare/ for methodology and reproduction steps.
The container image is ~5 MB (single static binary on scratch) — no Python runtime, no pip dependencies, no conda environment. It deploys as a Kubernetes indexed Job for horizontal scale-out across multiple pods. Pod-level network tuning (expanded ephemeral port range, TCP_TW_REUSE) is built into the Job template.
Run GSM8K (or other evals) under load to see accuracy in real time via Prometheus. Watch your inference server's GSM8K score slowly fall as its KV cache gets poisoned with NaNs.
Two-sided observability out of the box:
- Client-side metrics — each pod exposes a
/metricsendpoint with histograms for TTFT, ITL, E2E latency, and token counts, ready for Prometheus scraping. - Server-side correlation — per-stage timestamps make it easy to query your server's Prometheus for the exact window of each benchmark phase (see
just query-prometheus).
Define benchmark scenarios using a Pythonic Starlark DSL:
chat = workload("faker", isl=256, osl=512)
long = workload("corpus", corpus_path="/data/sharegpt.txt", isl=2048, osl=512)
scenario(
stages = [
stage("30s", concurrency=16, warmup=True),
stage("5m", concurrency=128, workload=chat),
stage("5m", concurrency=64, workload=long),
],
)Use variables, loops, and conditionals — it's a real language, not YAML:
scenario(
stages = [stage("2m", concurrency=c) for c in range(64, 513, 64)],
workload = workload("synthetic", isl=512, osl=1024),
)Each goroutine stream can run multi-turn conversations, carrying real model responses forward into subsequent turns. This exercises server-side KV cache reuse (prefix caching) and produces realistic conversation-shaped traffic.
A configurable warmup phase brings the server to steady state before measurement begins, and ramp-up staggers stream starts to avoid synchronized request patterns that would otherwise create artificial load spikes.
# Build
go build -o nyann-bench ./cmd/nyann-bench/
# Start the mock server (for testing)
./nyann-bench mock-server
# Run a quick benchmark
./nyann-bench generate --target http://localhost:8000/v1 --config '{"load":{"concurrency":16,"duration":"30s"}}'Or with a Starlark config file:
./nyann-bench generate --target http://localhost:8000/v1 --config scenario.star| Command | Description |
|---|---|
generate |
Run a load generation benchmark against an LLM endpoint |
analyze |
Analyze benchmark results from JSONL recordings |
mock-server |
Start a mock OpenAI-compatible server for testing |
corpus |
Convert text sources (ShareGPT, files, directories) into a corpus file |
| Type | Description |
|---|---|
synthetic |
Random word padding with deterministic ISL/OSL control |
faker |
Diverse, realistic generated text (names, locations, phrases) |
corpus |
Sliding window over real text files (ShareGPT, custom corpora) |
gsm8k |
Grade School Math 8K with few-shot prompting and streaming eval |
All workload types support configurable ISL (input sequence length), OSL (output sequence length), multi-turn conversations, and per-turn ISL overrides via subsequent_isl.
| Mode | Description |
|---|---|
concurrent |
Fixed number of goroutine streams, each sending requests back-to-back |
constant |
Fixed request rate (req/s) with deterministic inter-arrival times |
poisson |
Fixed request rate with exponential inter-arrival times (realistic traffic) |
Each worker produces:
requests_N.jsonl— one line per completed request with TTFT, per-token ITL array, token counts, latency, eval results, and finish reason.timestamps_N.json— start/end times for each stage, for Prometheus range queries.
Merging across workers: cat requests_*.jsonl.
just deploy my-benchmark http://vllm-server:8000/v1 config.star 8This creates a ConfigMap with your config and launches an indexed Job with 8 parallel pods. Each pod auto-detects its worker ID from JOB_COMPLETION_INDEX.
go install github.com/neuralmagic/nyann-bench/cmd/nyann-bench@latestOr pull the container:
docker pull ghcr.io/neuralmagic/nyann-bench:latestgo test ./... -count=1 # all tests run against the mock server
just test # same, via Justfile
just smoke-test # end-to-end: mock server + load generator