Releases · llm-d/llm-d-inference-sim

Refactor configuration: To support pod name, namespace and dev mode in configuration parameters by @Mrudhulraj in #402
Fix bug in dp server start by @irar2 in #410

New Contributors

@Mrudhulraj made their first contribution in #402

Full Changelog: v0.8.0...v0.8.1

Contributors

irar2 and Mrudhulraj

Assets 2

26 Mar 07:03

mayabar

v0.8.0

eedfce4

v0.8.0

⚠️ Important Changes

Please read before upgrading.

What’s new:

New Dependency: Tokenizer is now a stand alone application which should run as a sidecar process.

For details see README.md

Deprecated command line parameters:

tokenizers-cache-dir
zmq-max-connect-attempts

New Features

New endpoint /v1/embeddings
gRPC support - details
/chat/completions works with --enable-kvcache
Added support for --mm-encoder-only
Support --no- prefix for boolean vllm config parameters
- no-enable-sleep-mode
- no-mm-encoder-only
- no-enforce-eager
- no-enable-prefix-caching
Fake metrics support functions for gauges
Dataset structure updated, dataset tool is updated accordingly
All requests are tokenized using the model defined in the configuration. Important: If you want to avoid the time and network overhead of HuggingFace tokenization use a "fake" or non-existent model name (e.g., --model fake-model).
Extend kv events - add tokens
New metrics
- vllm:prefix_cache_hits
- vllm:prefix_cache_queries

What's Changed

Introduce Tokenizer interface by @mayabar in #314
fix hf models url by @mayabar in #316
Set default value of --tokenizers-cache-dir to hf_cache by @mayabar in #317
Tokenize all requests by @irar2 in #318
Use real tokenization in echo mode by @irar2 in #319
Echo Dataset by @irar2 in #322
fix python error on hf tokenizer initialization by @mayabar in #321
Return tokenized response in GetTokens by @irar2 in #323
Use Tokenized in response by @irar2 in #324
Handle gRPC requests by @irar2 in #326
Metrics tpot channel size fix and new tests for errors by @irar2 in #328
Dataset tool by @mayabar in #325
Generation request and response types by @irar2 in #330
update documentation by @mayabar in #329
🌱 Standardize governance workflows, tooling, and Dependabot by @clubanderson in #333
🌱 Remove legacy typo and link checker workflows by @clubanderson in #340
docs(example): Fix indentation for POD_IP valueFrom field by @tarilabs in #348
Update example of ruuning simulator in the documentation by @mayabar in #351
🌱 Remove orphaned .lychee.toml by @clubanderson in #352
Refactor: separate token generation from response sending by @irar2 in #353
Add tokens to kv events by @mayabar in #354
Fix /chat/completion response in echo mode by @mayabar in #362
Fix PR #362 by @mayabar in #365
Add vllm:prefix_cache_hits and vllm:prefix_cache_queries counters by @InfraWhisperer in #358
Add /v1/embeddings endpoint by @sbekkerm in #364
Response builder by @irar2 in #372
Read configuration in main by @irar2 in #373
Separate simulator creation and start. Communication layer by @irar2 in #375
🌱 Remove per-repo gh-aw typo/link/upstream workflows by @clubanderson in #381
Ignore data-parallel-size if data-parallel-rank is set by @irar2 in #376
feat(http): add pod/namespace/request-id response headers to /embeddings by @sbekkerm in #374
Separate communication (HTTP and gRPC) from the simulator code by @irar2 in #382
Support functions for generating fake gauge metrics by @irar2 in #389
Bug fix: fake metrics init by @irar2 in #391
Refactoring: store channels along their names in a struct by @irar2 in #390
Use kv cache 0.6.0 - tokenizer is stand alone + remove all python dependencies by @mayabar in #386
fixes in makefile by @mayabar in #395
Chat completion with kvcache by @mayabar in #396
Support mm-encoder-only mode by @irar2 in #398
Update readme by @irar2 in #401
Add --no option for vLLM boolean command line parameters by @irar2 in #400
Remove CGO dependency by migrating to pure-Go ZMQ+change in ci_pr_checks by @mayabar in #406

New Contributors

@tarilabs made their first contribution in #348
@InfraWhisperer made their first contribution in #358
@sbekkerm made their first contribution in #364

Full Changelog: v0.7.0...v0.8.0

Contributors

clubanderson, tarilabs, and 4 other contributors

Assets 2

27 Jan 12:37

mayabar

v0.7.1

3140e66

v0.7.1

What's Changed

Tokenizer fix by @shmuelk in #320

Full Changelog: v0.7.0...v0.7.1

Contributors

shmuelk

Assets 2

25 Jan 07:04

mayabar

v0.7.0

44887c2

v0.7.0

New Features

Sleep mode
Support for the vLLM --data-parallel-rank command line argument
Change all latency configuration to Duration format
Support cache threshold finish reason header to return cache_threshold finish reason
ZeroMQ listener
Support for X-Request-Id header in responses and logs
New metrics
- max_num_generation_tokens
- cache_config_info
- inter_token_latency_seconds
- generation_tokens_total
- prompt_tokens_total
Renamed metric
- gpu_cache_usage_perc renamed to kv_cache_usage_perc

What's Changed

Add log verbosity levels by @mayabar in #241
Update Readme file by @mayabar in #242
Fix dataset test by @irar2 in #246
Deterministic uuid by @irar2 in #245
add metrics tests for latency metrics with remote prefill by @mayabar in #248
Support vllm:max_num_generation_tokens metrics by @mayabar in #250
Sleep mode by @irar2 in #252
Support cache_config_info metric by @irar2 in #256
Update logic of response generation in case of dataset usage by @mayabar in #257
update PD support to be compatible with nixlv2 by @mayabar in #258
docs: add P/D disaggregation example in manifests/disaggregation by @googs1025 in #253
feat(metrics): add inter_token_latency_seconds by @googs1025 in #265
Bump kv-cache-manager to v0.4.0-rc2 by @pierDipi in #261
feat: add support for X-Request-Id header in responses and logs by @rudeigerc in #269
Fix kvevents by @mayabar in #273
Fix inter token latency test by @irar2 in #274
Check parent field in /models response by @irar2 in #276
fix: Use "cmpl-" prefix for /completions response IDs (#270) by @RohanDSkaria in #275
Fix Makefile by @irar2 in #278
Added support for the vLLM --data-parallel-rank command line argument by @shmuelk in #279
Updated readme by @irar2 in #281
feat(metrics): add generation_tokens_total and prompt_tokens_total metrics by @googs1025 in #268
feat: change all latency configuration to Duration format by @setsunakute in #288
Latency calculator interface by @irar2 in #286
Refactor requests handling by @irar2 in #289
Add a 'script' to set the PYTHONPATH env var by @shmuelk in #290
Code reorganization by @irar2 in #291
Request processor by @irar2 in #293
Response context interface by @irar2 in #294
Process requests in requestContext by @irar2 in #295
Cleanup, renaming, code reorganization by @irar2 in #297
feat: support cache threshold finish reason header to return cache_threshold finish reason by @kyanokashi in #296
Readme update for running locally by @irar2 in #305
Initial gRPC support by @irar2 in #306
Rename metric vllm:gpu_cache_usage_perc to vllm:kv_cache_usage_perc by @irar2 in #307
feat: cache hit threshold handling by @kyanokashi in #301
Add zmq listener by @mayabar in #302
Support streaming and latencies in gRPC by @irar2 in #315

New Contributors

@pierDipi made their first contribution in #261
@rudeigerc made their first contribution in #269
@RohanDSkaria made their first contribution in #275
@setsunakute made their first contribution in #288
@kyanokashi made their first contribution in #296

Migrating from releases prior to v0.7.0

Use a script env-setup.sh to set the PYTHONPATH env var when running the simulator locally
Use duration format instead of millis for all latency configuration parameters
Rename metric gpu_cache_usage_perc to kv_cache_usage_perc
Bump kv-cache-manager to v0.4.0

Full Changelog: v0.6.1...v0.7.0

Contributors

mayabar, setsunakute, and 7 other contributors

Assets 2

30 Oct 14:47

mayabar

v0.6.1

658e3e5

v0.6.1

What's Changed

feat: Log probabilities support by @ruivieira in #221
Add synchronization of freeing worker after stream reqiest processing by @mayabar in #244

New Contributors

@ruivieira made their first contribution in #221

Full Changelog: v0.6.0...v0.6.1

Contributors

ruivieira and mayabar

Assets 2

29 Oct 11:06

mayabar

v0.6.0

9a57299

v0.6.0

What's Changed

New requests queue by @irar2 in #214
Make writing to channels non-blocking by @irar2 in #225
Change packages' dependencies by @irar2 in #229
Added port header to response by @irar2 in #232
Test fix: number of running requests can be one request less when scheduling requests by @irar2 in #231
fix occasional ttft and tpot metrics test failures by @mayabar in #233
Configure the tool_choice option to use a specific tool by @MondayCha in #234
Additional latency related metrics by @mayabar in #237
Changed random from static to a field in the simulator by @irar2 in #238
Made workers' requests channel non-blocking by @irar2 in #239

New Contributors

@MondayCha made their first contribution in #234

Full Changelog: v0.5.2...v0.6.0

Contributors

mayabar, irar2, and MondayCha

Assets 2

22 Oct 07:48

mayabar

v0.5.2

1c3d559

v0.5.2

What's Changed

Use custom dataset as response source by @pancak3 in #200
Add vllm:time_per_output_token_seconds and vllm:time_to_first_token_seconds metrics by @mayabar in #217
Use openai-go v3.6.1 in the tests by @irar2 in #223
feat(metrics): add request prompt, generation, max_tokens and success metrics by @googs1025 in #202

Full Changelog: v0.5.1...v0.5.2

Contributors

mayabar, irar2, and 2 other contributors

Assets 2

18 Sep 15:08

shmuelk

v0.5.1

b8eb7a4

v0.5.1

New Features

The llm-d-inference-sim server can be run in TLS mode with the certificate and key supplied by the user or automatically generated.

What's Changed

Add golangci-lint version check by @npolshakova in #160
feat(server): enables TLS mode by @bartoszmajsak in #205
fix(make): properly resolves package manager for ZMQ installation by @bartoszmajsak in #204
feat(make): simplifies local tooling installation by @bartoszmajsak in #203

New Contributors

@bartoszmajsak made their first contribution in #205

Full Changelog: v0.5.0...v0.5.1

Contributors

bartoszmajsak and npolshakova

Assets 2

16 Sep 06:54

mayabar

v0.5.0

9c541b9

v0.5.0

New features

Processing time is affected by server load
Change TTFT parameter to be based on number of request tokens
KV cache affects prefill time
Support failure injection
Implement kv-cache usage and waiting loras Prometheus metrics
Randomize response length based when max-tokens is defined in the request
Support DP (data parallel)
Support /tokenize endpoint

What's Changed

Fix server interrupt by @npolshakova in #161
Show final config in simulaor default logger at Info lvel by @pancak3 in #154
Cast bounds type in tests to func def: latency, interToken, and timeToFirst (to int) by @pancak3 in #163
Remvoe unnecessary deferal of server close by @pancak3 in #162
Fix: Rand generator is not set in a test suite which result in accessing nil pointer during runtime if run the only test suite by @pancak3 in #166
Use channels for metrics updates, added metrics tests by @irar2 in #171
Remove rerun on comment action by @irar2 in #174
Add failure injection mode to simulator by @smarunich in #131
Add waiting loras list to loraInfo metrics by @mayabar in #175
feat: generate response length based on a histogram when max_tokens is defined in the request by @mayabar in #169
extend response length buckets calculation to have not necessary equally sized buckets by @mayabar in #176
Use dynamic ports in zmq tests by @pancak3 in #170
Change time-to-first-token parameter to be based on number of request tokens #137 by @pancak3 in #165
Bugfix: was accessing number of tokens from nil var; getting it from req instead by @pancak3 in #177
feat: add helm charts for Kubernetes deployment by @Blackoutta in #182
chore: Make the image smaller by @shmuelk in #183
Take cached prompt tokens into account in prefill time calculation by @irar2 in #184
Add ignore eos in request by @pancak3 in #187
Support DP by @irar2 in #188
Change RandomNorm from float types to int by @pancak3 in #190
KV cache usage metric by @irar2 in #192
Adjust request "processing time" to current load by @pancak3 in #189
Updates for the new release of kv-cache-manager by @irar2 in #194
DP bug fix: wait after starting rank 0 sim by @irar2 in #193
Support /tokenize endpoint by @irar2 in #198
add Service to expose vLLM deployment and update doc by @googs1025 in #201
Split simulator.go into several files by @irar2 in #199

New Contributors

@smarunich made their first contribution in #131
@Blackoutta made their first contribution in #182
@googs1025 made their first contribution in #201

Full Changelog: v0.4.0...v0.5.0

Contributors

mayabar, smarunich, and 6 other contributors

Assets 2

Releases: llm-d/llm-d-inference-sim

v0.8.2

What's Changed

Contributors

Uh oh!

v0.8.1

⚠️ Migration Notes (for users upgrading from versions prior to v0.8.0)

What's Changed

New Contributors

Contributors

Uh oh!

v0.8.0

⚠️ Important Changes

New Features

What's Changed

New Contributors

Contributors

Uh oh!

v0.7.1

What's Changed

Contributors

Uh oh!

v0.7.0

New Features

What's Changed

New Contributors

Migrating from releases prior to v0.7.0

Contributors

Uh oh!

v0.6.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.6.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.5.2

What's Changed

Contributors

Uh oh!

v0.5.1

New Features

What's Changed

New Contributors

Contributors

Uh oh!

v0.5.0

New features

What's Changed

New Contributors

Contributors

Uh oh!