Releases: llm-d/llm-d-inference-sim
Releases · llm-d/llm-d-inference-sim
v0.8.2
Please see v0.8.0 Release Notes
What's Changed
Full Changelog: v0.8.1...v0.8.2
v0.8.1
⚠️ Migration Notes (for users upgrading from versions prior to v0.8.0)
Please see v0.8.0 Release Notes
What's Changed
- Refactor configuration: To support pod name, namespace and dev mode in configuration parameters by @Mrudhulraj in #402
- Fix bug in dp server start by @irar2 in #410
New Contributors
- @Mrudhulraj made their first contribution in #402
Full Changelog: v0.8.0...v0.8.1
v0.8.0
⚠️ Important Changes
Please read before upgrading.
What’s new:
New Dependency: Tokenizer is now a stand alone application which should run as a sidecar process.
For details see README.md
Deprecated command line parameters:
- tokenizers-cache-dir
- zmq-max-connect-attempts
New Features
- New endpoint
/v1/embeddings - gRPC support - details
/chat/completionsworks with--enable-kvcache- Added support for
--mm-encoder-only - Support
--no-prefix for boolean vllm config parameters- no-enable-sleep-mode
- no-mm-encoder-only
- no-enforce-eager
- no-enable-prefix-caching
- Fake metrics support functions for gauges
- Dataset structure updated, dataset tool is updated accordingly
- All requests are tokenized using the model defined in the configuration. Important: If you want to avoid the time and network overhead of HuggingFace tokenization use a "fake" or non-existent model name (e.g., --model fake-model).
- Extend kv events - add tokens
- New metrics
- vllm:prefix_cache_hits
- vllm:prefix_cache_queries
What's Changed
- Introduce Tokenizer interface by @mayabar in #314
- fix hf models url by @mayabar in #316
- Set default value of --tokenizers-cache-dir to hf_cache by @mayabar in #317
- Tokenize all requests by @irar2 in #318
- Use real tokenization in echo mode by @irar2 in #319
- Echo Dataset by @irar2 in #322
- fix python error on hf tokenizer initialization by @mayabar in #321
- Return tokenized response in GetTokens by @irar2 in #323
- Use Tokenized in response by @irar2 in #324
- Handle gRPC requests by @irar2 in #326
- Metrics tpot channel size fix and new tests for errors by @irar2 in #328
- Dataset tool by @mayabar in #325
- Generation request and response types by @irar2 in #330
- update documentation by @mayabar in #329
- 🌱 Standardize governance workflows, tooling, and Dependabot by @clubanderson in #333
- 🌱 Remove legacy typo and link checker workflows by @clubanderson in #340
- docs(example): Fix indentation for POD_IP valueFrom field by @tarilabs in #348
- Update example of ruuning simulator in the documentation by @mayabar in #351
- 🌱 Remove orphaned .lychee.toml by @clubanderson in #352
- Refactor: separate token generation from response sending by @irar2 in #353
- Add tokens to kv events by @mayabar in #354
- Fix /chat/completion response in echo mode by @mayabar in #362
- Fix PR #362 by @mayabar in #365
- Add vllm:prefix_cache_hits and vllm:prefix_cache_queries counters by @InfraWhisperer in #358
- Add /v1/embeddings endpoint by @sbekkerm in #364
- Response builder by @irar2 in #372
- Read configuration in main by @irar2 in #373
- Separate simulator creation and start. Communication layer by @irar2 in #375
- 🌱 Remove per-repo gh-aw typo/link/upstream workflows by @clubanderson in #381
- Ignore data-parallel-size if data-parallel-rank is set by @irar2 in #376
- feat(http): add pod/namespace/request-id response headers to /embeddings by @sbekkerm in #374
- Separate communication (HTTP and gRPC) from the simulator code by @irar2 in #382
- Support functions for generating fake gauge metrics by @irar2 in #389
- Bug fix: fake metrics init by @irar2 in #391
- Refactoring: store channels along their names in a struct by @irar2 in #390
- Use kv cache 0.6.0 - tokenizer is stand alone + remove all python dependencies by @mayabar in #386
- fixes in makefile by @mayabar in #395
- Chat completion with kvcache by @mayabar in #396
- Support mm-encoder-only mode by @irar2 in #398
- Update readme by @irar2 in #401
- Add --no option for vLLM boolean command line parameters by @irar2 in #400
- Remove CGO dependency by migrating to pure-Go ZMQ+change in ci_pr_checks by @mayabar in #406
New Contributors
- @tarilabs made their first contribution in #348
- @InfraWhisperer made their first contribution in #358
- @sbekkerm made their first contribution in #364
Full Changelog: v0.7.0...v0.8.0
v0.7.1
v0.7.0
New Features
- Sleep mode
- Support for the vLLM
--data-parallel-rankcommand line argument - Change all latency configuration to Duration format
- Support cache threshold finish reason header to return
cache_thresholdfinish reason - ZeroMQ listener
- Support for
X-Request-Idheader in responses and logs - New metrics
- max_num_generation_tokens
- cache_config_info
- inter_token_latency_seconds
- generation_tokens_total
- prompt_tokens_total
- Renamed metric
- gpu_cache_usage_perc renamed to kv_cache_usage_perc
What's Changed
- Add log verbosity levels by @mayabar in #241
- Update Readme file by @mayabar in #242
- Fix dataset test by @irar2 in #246
- Deterministic uuid by @irar2 in #245
- add metrics tests for latency metrics with remote prefill by @mayabar in #248
- Support vllm:max_num_generation_tokens metrics by @mayabar in #250
- Sleep mode by @irar2 in #252
- Support cache_config_info metric by @irar2 in #256
- Update logic of response generation in case of dataset usage by @mayabar in #257
- update PD support to be compatible with nixlv2 by @mayabar in #258
- docs: add P/D disaggregation example in manifests/disaggregation by @googs1025 in #253
- feat(metrics): add inter_token_latency_seconds by @googs1025 in #265
- Bump kv-cache-manager to v0.4.0-rc2 by @pierDipi in #261
- feat: add support for X-Request-Id header in responses and logs by @rudeigerc in #269
- Fix kvevents by @mayabar in #273
- Fix inter token latency test by @irar2 in #274
- Check parent field in /models response by @irar2 in #276
- fix: Use "cmpl-" prefix for /completions response IDs (#270) by @RohanDSkaria in #275
- Fix Makefile by @irar2 in #278
- Added support for the vLLM --data-parallel-rank command line argument by @shmuelk in #279
- Updated readme by @irar2 in #281
- feat(metrics): add generation_tokens_total and prompt_tokens_total metrics by @googs1025 in #268
- feat: change all latency configuration to Duration format by @setsunakute in #288
- Latency calculator interface by @irar2 in #286
- Refactor requests handling by @irar2 in #289
- Add a 'script' to set the PYTHONPATH env var by @shmuelk in #290
- Code reorganization by @irar2 in #291
- Request processor by @irar2 in #293
- Response context interface by @irar2 in #294
- Process requests in requestContext by @irar2 in #295
- Cleanup, renaming, code reorganization by @irar2 in #297
- feat: support cache threshold finish reason header to return cache_threshold finish reason by @kyanokashi in #296
- Readme update for running locally by @irar2 in #305
- Initial gRPC support by @irar2 in #306
- Rename metric vllm:gpu_cache_usage_perc to vllm:kv_cache_usage_perc by @irar2 in #307
- feat: cache hit threshold handling by @kyanokashi in #301
- Add zmq listener by @mayabar in #302
- Support streaming and latencies in gRPC by @irar2 in #315
New Contributors
- @pierDipi made their first contribution in #261
- @rudeigerc made their first contribution in #269
- @RohanDSkaria made their first contribution in #275
- @setsunakute made their first contribution in #288
- @kyanokashi made their first contribution in #296
Migrating from releases prior to v0.7.0
- Use a script
env-setup.shto set the PYTHONPATH env var when running the simulator locally - Use duration format instead of millis for all latency configuration parameters
- Rename metric
gpu_cache_usage_perctokv_cache_usage_perc - Bump
kv-cache-managerto v0.4.0
Full Changelog: v0.6.1...v0.7.0
v0.6.1
What's Changed
- feat: Log probabilities support by @ruivieira in #221
- Add synchronization of freeing worker after stream reqiest processing by @mayabar in #244
New Contributors
- @ruivieira made their first contribution in #221
Full Changelog: v0.6.0...v0.6.1
v0.6.0
What's Changed
- New requests queue by @irar2 in #214
- Make writing to channels non-blocking by @irar2 in #225
- Change packages' dependencies by @irar2 in #229
- Added port header to response by @irar2 in #232
- Test fix: number of running requests can be one request less when scheduling requests by @irar2 in #231
- fix occasional ttft and tpot metrics test failures by @mayabar in #233
- Configure the tool_choice option to use a specific tool by @MondayCha in #234
- Additional latency related metrics by @mayabar in #237
- Changed random from static to a field in the simulator by @irar2 in #238
- Made workers' requests channel non-blocking by @irar2 in #239
New Contributors
- @MondayCha made their first contribution in #234
Full Changelog: v0.5.2...v0.6.0
v0.5.2
What's Changed
- Use custom dataset as response source by @pancak3 in #200
- Add vllm:time_per_output_token_seconds and vllm:time_to_first_token_seconds metrics by @mayabar in #217
- Use openai-go v3.6.1 in the tests by @irar2 in #223
- feat(metrics): add request prompt, generation, max_tokens and success metrics by @googs1025 in #202
Full Changelog: v0.5.1...v0.5.2
v0.5.1
New Features
- The llm-d-inference-sim server can be run in TLS mode with the certificate and key supplied by the user or automatically generated.
What's Changed
- Add golangci-lint version check by @npolshakova in #160
- feat(server): enables TLS mode by @bartoszmajsak in #205
- fix(make): properly resolves package manager for ZMQ installation by @bartoszmajsak in #204
- feat(make): simplifies local tooling installation by @bartoszmajsak in #203
New Contributors
- @bartoszmajsak made their first contribution in #205
Full Changelog: v0.5.0...v0.5.1
v0.5.0
New features
- Processing time is affected by server load
- Change TTFT parameter to be based on number of request tokens
- KV cache affects prefill time
- Support failure injection
- Implement kv-cache usage and waiting loras Prometheus metrics
- Randomize response length based when max-tokens is defined in the request
- Support DP (data parallel)
- Support /tokenize endpoint
What's Changed
- Fix server interrupt by @npolshakova in #161
- Show final config in simulaor default logger at Info lvel by @pancak3 in #154
- Cast bounds type in tests to func def: latency, interToken, and timeToFirst (to int) by @pancak3 in #163
- Remvoe unnecessary deferal of server close by @pancak3 in #162
- Fix: Rand generator is not set in a test suite which result in accessing nil pointer during runtime if run the only test suite by @pancak3 in #166
- Use channels for metrics updates, added metrics tests by @irar2 in #171
- Remove rerun on comment action by @irar2 in #174
- Add failure injection mode to simulator by @smarunich in #131
- Add waiting loras list to loraInfo metrics by @mayabar in #175
- feat: generate response length based on a histogram when max_tokens is defined in the request by @mayabar in #169
- extend response length buckets calculation to have not necessary equally sized buckets by @mayabar in #176
- Use dynamic ports in zmq tests by @pancak3 in #170
- Change time-to-first-token parameter to be based on number of request tokens #137 by @pancak3 in #165
- Bugfix: was accessing number of tokens from nil var; getting it from req instead by @pancak3 in #177
- feat: add helm charts for Kubernetes deployment by @Blackoutta in #182
- chore: Make the image smaller by @shmuelk in #183
- Take cached prompt tokens into account in prefill time calculation by @irar2 in #184
- Add ignore eos in request by @pancak3 in #187
- Support DP by @irar2 in #188
- Change RandomNorm from float types to int by @pancak3 in #190
- KV cache usage metric by @irar2 in #192
- Adjust request "processing time" to current load by @pancak3 in #189
- Updates for the new release of kv-cache-manager by @irar2 in #194
- DP bug fix: wait after starting rank 0 sim by @irar2 in #193
- Support /tokenize endpoint by @irar2 in #198
- add Service to expose vLLM deployment and update doc by @googs1025 in #201
- Split simulator.go into several files by @irar2 in #199
New Contributors
- @smarunich made their first contribution in #131
- @Blackoutta made their first contribution in #182
- @googs1025 made their first contribution in #201
Full Changelog: v0.4.0...v0.5.0