Skip to content

Releases: llm-d/llm-d-inference-sim

v0.8.2

31 Mar 09:31
a311a6c

Choose a tag to compare

⚠️ Migration Notes (for users upgrading from versions prior to v0.8.0)
Please see v0.8.0 Release Notes

What's Changed

  • Send response context if there are no tokens by @irar2 in #412

Full Changelog: v0.8.1...v0.8.2

v0.8.1

29 Mar 06:32
5cce552

Choose a tag to compare

⚠️ Migration Notes (for users upgrading from versions prior to v0.8.0)

Please see v0.8.0 Release Notes

What's Changed

  • Refactor configuration: To support pod name, namespace and dev mode in configuration parameters by @Mrudhulraj in #402
  • Fix bug in dp server start by @irar2 in #410

New Contributors

Full Changelog: v0.8.0...v0.8.1

v0.8.0

26 Mar 07:03
eedfce4

Choose a tag to compare

⚠️ Important Changes

Please read before upgrading.

What’s new:

New Dependency: Tokenizer is now a stand alone application which should run as a sidecar process.

For details see README.md

Deprecated command line parameters:

  • tokenizers-cache-dir
  • zmq-max-connect-attempts

New Features

  • New endpoint /v1/embeddings
  • gRPC support - details
  • /chat/completions works with --enable-kvcache
  • Added support for --mm-encoder-only
  • Support --no- prefix for boolean vllm config parameters
    • no-enable-sleep-mode
    • no-mm-encoder-only
    • no-enforce-eager
    • no-enable-prefix-caching
  • Fake metrics support functions for gauges
  • Dataset structure updated, dataset tool is updated accordingly
  • All requests are tokenized using the model defined in the configuration. Important: If you want to avoid the time and network overhead of HuggingFace tokenization use a "fake" or non-existent model name (e.g., --model fake-model).
  • Extend kv events - add tokens
  • New metrics
    • vllm:prefix_cache_hits
    • vllm:prefix_cache_queries

What's Changed

  • Introduce Tokenizer interface by @mayabar in #314
  • fix hf models url by @mayabar in #316
  • Set default value of --tokenizers-cache-dir to hf_cache by @mayabar in #317
  • Tokenize all requests by @irar2 in #318
  • Use real tokenization in echo mode by @irar2 in #319
  • Echo Dataset by @irar2 in #322
  • fix python error on hf tokenizer initialization by @mayabar in #321
  • Return tokenized response in GetTokens by @irar2 in #323
  • Use Tokenized in response by @irar2 in #324
  • Handle gRPC requests by @irar2 in #326
  • Metrics tpot channel size fix and new tests for errors by @irar2 in #328
  • Dataset tool by @mayabar in #325
  • Generation request and response types by @irar2 in #330
  • update documentation by @mayabar in #329
  • 🌱 Standardize governance workflows, tooling, and Dependabot by @clubanderson in #333
  • 🌱 Remove legacy typo and link checker workflows by @clubanderson in #340
  • docs(example): Fix indentation for POD_IP valueFrom field by @tarilabs in #348
  • Update example of ruuning simulator in the documentation by @mayabar in #351
  • 🌱 Remove orphaned .lychee.toml by @clubanderson in #352
  • Refactor: separate token generation from response sending by @irar2 in #353
  • Add tokens to kv events by @mayabar in #354
  • Fix /chat/completion response in echo mode by @mayabar in #362
  • Fix PR #362 by @mayabar in #365
  • Add vllm:prefix_cache_hits and vllm:prefix_cache_queries counters by @InfraWhisperer in #358
  • Add /v1/embeddings endpoint by @sbekkerm in #364
  • Response builder by @irar2 in #372
  • Read configuration in main by @irar2 in #373
  • Separate simulator creation and start. Communication layer by @irar2 in #375
  • 🌱 Remove per-repo gh-aw typo/link/upstream workflows by @clubanderson in #381
  • Ignore data-parallel-size if data-parallel-rank is set by @irar2 in #376
  • feat(http): add pod/namespace/request-id response headers to /embeddings by @sbekkerm in #374
  • Separate communication (HTTP and gRPC) from the simulator code by @irar2 in #382
  • Support functions for generating fake gauge metrics by @irar2 in #389
  • Bug fix: fake metrics init by @irar2 in #391
  • Refactoring: store channels along their names in a struct by @irar2 in #390
  • Use kv cache 0.6.0 - tokenizer is stand alone + remove all python dependencies by @mayabar in #386
  • fixes in makefile by @mayabar in #395
  • Chat completion with kvcache by @mayabar in #396
  • Support mm-encoder-only mode by @irar2 in #398
  • Update readme by @irar2 in #401
  • Add --no option for vLLM boolean command line parameters by @irar2 in #400
  • Remove CGO dependency by migrating to pure-Go ZMQ+change in ci_pr_checks by @mayabar in #406

New Contributors

Full Changelog: v0.7.0...v0.8.0

v0.7.1

27 Jan 12:37
v0.7.1
3140e66

Choose a tag to compare

What's Changed

Full Changelog: v0.7.0...v0.7.1

v0.7.0

25 Jan 07:04
44887c2

Choose a tag to compare

New Features

  • Sleep mode
  • Support for the vLLM --data-parallel-rank command line argument
  • Change all latency configuration to Duration format
  • Support cache threshold finish reason header to return cache_threshold finish reason
  • ZeroMQ listener
  • Support for X-Request-Id header in responses and logs
  • New metrics
    • max_num_generation_tokens
    • cache_config_info
    • inter_token_latency_seconds
    • generation_tokens_total
    • prompt_tokens_total
  • Renamed metric
    • gpu_cache_usage_perc renamed to kv_cache_usage_perc

What's Changed

New Contributors

Migrating from releases prior to v0.7.0

  • Use a script env-setup.sh to set the PYTHONPATH env var when running the simulator locally
  • Use duration format instead of millis for all latency configuration parameters
  • Rename metric gpu_cache_usage_perc to kv_cache_usage_perc
  • Bump kv-cache-manager to v0.4.0

Full Changelog: v0.6.1...v0.7.0

v0.6.1

30 Oct 14:47
658e3e5

Choose a tag to compare

What's Changed

  • feat: Log probabilities support by @ruivieira in #221
  • Add synchronization of freeing worker after stream reqiest processing by @mayabar in #244

New Contributors

Full Changelog: v0.6.0...v0.6.1

v0.6.0

29 Oct 11:06
9a57299

Choose a tag to compare

What's Changed

  • New requests queue by @irar2 in #214
  • Make writing to channels non-blocking by @irar2 in #225
  • Change packages' dependencies by @irar2 in #229
  • Added port header to response by @irar2 in #232
  • Test fix: number of running requests can be one request less when scheduling requests by @irar2 in #231
  • fix occasional ttft and tpot metrics test failures by @mayabar in #233
  • Configure the tool_choice option to use a specific tool by @MondayCha in #234
  • Additional latency related metrics by @mayabar in #237
  • Changed random from static to a field in the simulator by @irar2 in #238
  • Made workers' requests channel non-blocking by @irar2 in #239

New Contributors

Full Changelog: v0.5.2...v0.6.0

v0.5.2

22 Oct 07:48
1c3d559

Choose a tag to compare

What's Changed

  • Use custom dataset as response source by @pancak3 in #200
  • Add vllm:time_per_output_token_seconds and vllm:time_to_first_token_seconds metrics by @mayabar in #217
  • Use openai-go v3.6.1 in the tests by @irar2 in #223
  • feat(metrics): add request prompt, generation, max_tokens and success metrics by @googs1025 in #202

Full Changelog: v0.5.1...v0.5.2

v0.5.1

18 Sep 15:08
b8eb7a4

Choose a tag to compare

New Features

  • The llm-d-inference-sim server can be run in TLS mode with the certificate and key supplied by the user or automatically generated.

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.5.1

v0.5.0

16 Sep 06:54
9c541b9

Choose a tag to compare

New features

  • Processing time is affected by server load
  • Change TTFT parameter to be based on number of request tokens
  • KV cache affects prefill time
  • Support failure injection
  • Implement kv-cache usage and waiting loras Prometheus metrics
  • Randomize response length based when max-tokens is defined in the request
  • Support DP (data parallel)
  • Support /tokenize endpoint

What's Changed

  • Fix server interrupt by @npolshakova in #161
  • Show final config in simulaor default logger at Info lvel by @pancak3 in #154
  • Cast bounds type in tests to func def: latency, interToken, and timeToFirst (to int) by @pancak3 in #163
  • Remvoe unnecessary deferal of server close by @pancak3 in #162
  • Fix: Rand generator is not set in a test suite which result in accessing nil pointer during runtime if run the only test suite by @pancak3 in #166
  • Use channels for metrics updates, added metrics tests by @irar2 in #171
  • Remove rerun on comment action by @irar2 in #174
  • Add failure injection mode to simulator by @smarunich in #131
  • Add waiting loras list to loraInfo metrics by @mayabar in #175
  • feat: generate response length based on a histogram when max_tokens is defined in the request by @mayabar in #169
  • extend response length buckets calculation to have not necessary equally sized buckets by @mayabar in #176
  • Use dynamic ports in zmq tests by @pancak3 in #170
  • Change time-to-first-token parameter to be based on number of request tokens #137 by @pancak3 in #165
  • Bugfix: was accessing number of tokens from nil var; getting it from req instead by @pancak3 in #177
  • feat: add helm charts for Kubernetes deployment by @Blackoutta in #182
  • chore: Make the image smaller by @shmuelk in #183
  • Take cached prompt tokens into account in prefill time calculation by @irar2 in #184
  • Add ignore eos in request by @pancak3 in #187
  • Support DP by @irar2 in #188
  • Change RandomNorm from float types to int by @pancak3 in #190
  • KV cache usage metric by @irar2 in #192
  • Adjust request "processing time" to current load by @pancak3 in #189
  • Updates for the new release of kv-cache-manager by @irar2 in #194
  • DP bug fix: wait after starting rank 0 sim by @irar2 in #193
  • Support /tokenize endpoint by @irar2 in #198
  • add Service to expose vLLM deployment and update doc by @googs1025 in #201
  • Split simulator.go into several files by @irar2 in #199

New Contributors

Full Changelog: v0.4.0...v0.5.0