nyann-bench

Not Yet Another Neural Network Benchmarking Tool.

A high-performance LLM inference benchmarking tool designed for Kubernetes-scale deployments.

Why nyann-bench?

nyann-bench was ~~vibe-coded~~ created via agentic engineering in support of vLLM's GB200 NVL72 WideEP bring-up, in order to address a series of challenges we ran into at scale.

In order to sustain a high number of concurrent requests, a benchmarking tool needs to support scale-out and a high request rate at high concurrency.
Observability becomes more important at scale. Client-side benchmarking metrics make it easy to see what all benchmarking pods are doing at a glance.
Streaming evals helped us detect and debug numerical issues that would gradually degrade the accuracy of NVFP4 models over the lifetime of the server — rare events that would only happen at scale.
Tools like vllm bench, guide-llm or lm-eval that have heavy dependencies like PyTorch are too slow to update or deploy. nyann-bench is only 5MB compressed.

Pretty Fast

At high concurrency, nyann-bench sustains up to 10x more requests per second than Python-based alternatives. Go's goroutine model and tuned HTTP transport eliminate the client as the bottleneck, so you're measuring the server, not your benchmark harness.

Concurrency	nyann-bench	guidellm	vllm bench
1	28 req/s	28 req/s	28 req/s
64	1,616 req/s	1,341 req/s	1,386 req/s
256	7,221 req/s	1,352 req/s	2,083 req/s
1024	15,065 req/s	1,207 req/s	2,120 req/s
4096	17,889 req/s	1,306 req/s	1,799 req/s

^{Measured against the built-in mock server on a Linux x86_64 machine, 30s per data point. See bench_compare/ for methodology and reproduction steps.}

Kubernetes-native

The container image is ~5 MB (single static binary on scratch) — no Python runtime, no pip dependencies, no conda environment. It deploys as a Kubernetes indexed Job for horizontal scale-out across multiple pods. Pod-level network tuning (expanded ephemeral port range, TCP_TW_REUSE) is built into the Job template.

Streaming eval

Run GSM8K (or other evals) under load to see accuracy in real time via Prometheus. Watch your inference server's GSM8K score slowly fall as its KV cache gets poisoned with NaNs.

Prometheus integration

Two-sided observability out of the box:

Client-side metrics — each pod exposes a /metrics endpoint with histograms for TTFT, ITL, E2E latency, and token counts, ready for Prometheus scraping.
Server-side correlation — per-stage timestamps make it easy to query your server's Prometheus for the exact window of each benchmark phase (see just query-prometheus).

Flexible workload definition

Define benchmark scenarios using a Pythonic Starlark DSL:

chat = workload("faker", isl=256, osl=512)
long = workload("corpus", corpus_path="/data/sharegpt.txt", isl=2048, osl=512)

scenario(
    stages = [
        stage("30s", concurrency=16, warmup=True),
        stage("5m",  concurrency=128, workload=chat),
        stage("5m",  concurrency=64,  workload=long),
    ],
)

Use variables, loops, and conditionals — it's a real language, not YAML:

scenario(
    stages = [stage("2m", concurrency=c) for c in range(64, 513, 64)],
    workload = workload("synthetic", isl=512, osl=1024),
)

Multi-turn conversations

Each goroutine stream can run multi-turn conversations, carrying real model responses forward into subsequent turns. This exercises server-side KV cache reuse (prefix caching) and produces realistic conversation-shaped traffic.

Ramp-up and warmup

A configurable warmup phase brings the server to steady state before measurement begins, and ramp-up staggers stream starts to avoid synchronized request patterns that would otherwise create artificial load spikes.

Quick start

# Build
go build -o nyann-bench ./cmd/nyann-bench/

# Start the mock server (for testing)
./nyann-bench mock-server

# Run a quick benchmark
./nyann-bench generate --target http://localhost:8000/v1 --config '{"load":{"concurrency":16,"duration":"30s"}}'

Or with a Starlark config file:

./nyann-bench generate --target http://localhost:8000/v1 --config scenario.star

Subcommands

Command	Description
`generate`	Run a load generation benchmark against an LLM endpoint
`analyze`	Analyze benchmark results from JSONL recordings
`mock-server`	Start a mock OpenAI-compatible server for testing
`corpus`	Convert text sources (ShareGPT, files, directories) into a corpus file

Workload types

Type	Description
`synthetic`	Random word padding with deterministic ISL/OSL control
`faker`	Diverse, realistic generated text (names, locations, phrases)
`corpus`	Sliding window over real text files (ShareGPT, custom corpora)
`gsm8k`	Grade School Math 8K with few-shot prompting and streaming eval

All workload types support configurable ISL (input sequence length), OSL (output sequence length), multi-turn conversations, and per-turn ISL overrides via subsequent_isl.

Load modes

Mode	Description
`concurrent`	Fixed number of goroutine streams, each sending requests back-to-back
`constant`	Fixed request rate (req/s) with deterministic inter-arrival times
`poisson`	Fixed request rate with exponential inter-arrival times (realistic traffic)

Output

Each worker produces:

requests_N.jsonl — one line per completed request with TTFT, per-token ITL array, token counts, latency, eval results, and finish reason.
timestamps_N.json — start/end times for each stage, for Prometheus range queries.

Merging across workers: cat requests_*.jsonl.

Kubernetes deployment

just deploy my-benchmark http://vllm-server:8000/v1 config.star 8

This creates a ConfigMap with your config and launches an indexed Job with 8 parallel pods. Each pod auto-detects its worker ID from JOB_COMPLETION_INDEX.

Installation

go install github.com/neuralmagic/nyann-bench/cmd/nyann-bench@latest

Or pull the container:

docker pull ghcr.io/neuralmagic/nyann-bench:latest

Development

go test ./... -count=1     # all tests run against the mock server
just test                  # same, via Justfile
just smoke-test            # end-to-end: mock server + load generator

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
bench_compare		bench_compare
cmd/nyann-bench		cmd/nyann-bench
deploy		deploy
pkg		pkg
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nyann-bench

Why nyann-bench?

Pretty Fast

Kubernetes-native

Streaming eval

Prometheus integration

Flexible workload definition

Multi-turn conversations

Ramp-up and warmup

Quick start

Subcommands

Workload types

Load modes

Output

Kubernetes deployment

Installation

Development

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nyann-bench

Why nyann-bench?

Pretty Fast

Kubernetes-native

Streaming eval

Prometheus integration

Flexible workload definition

Multi-turn conversations

Ramp-up and warmup

Quick start

Subcommands

Workload types

Load modes

Output

Kubernetes deployment

Installation

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages