Parallelize tokenizer and metadata loading to improve engine initialization latency.#1568
Closed
copybara-service[bot] wants to merge 1165 commits intomainfrom
Closed
Parallelize tokenizer and metadata loading to improve engine initialization latency.#1568copybara-service[bot] wants to merge 1165 commits intomainfrom
copybara-service[bot] wants to merge 1165 commits intomainfrom
Conversation
LiteRT-LM-PiperOrigin-RevId: 857335054
LiteRT-LM-PiperOrigin-RevId: 857350704
LiteRT-LM-PiperOrigin-RevId: 857357241
LiteRT-LM-PiperOrigin-RevId: 857423226
Follow the set_* pattern to make it scalable instead of adding all value to the single C function. LiteRT-LM-PiperOrigin-RevId: 857689534
LiteRT-LM-PiperOrigin-RevId: 858607999
LiteRT-LM-PiperOrigin-RevId: 858656280
LiteRT-LM-PiperOrigin-RevId: 858718908
LiteRT-LM-PiperOrigin-RevId: 858742888
LiteRT-LM-PiperOrigin-RevId: 858748258
LiteRT-LM-PiperOrigin-RevId: 858802033
LiteRT-LM-PiperOrigin-RevId: 858833664
LiteRT-LM-PiperOrigin-RevId: 859090887
Jetson doesn't have std::logf(). Use std::log(static_cast<float>()) instead. LiteRT-LM-PiperOrigin-RevId: 859148505
- Transpose mask in input handling for fixed gpu kernel size - Fix exported symbols from libLiteRt.so on linux_arm64 LiteRT-LM-PiperOrigin-RevId: 859164020
LiteRT-LM-PiperOrigin-RevId: 859247355
Currently, tools could be implemented with Kotlin functions only. It is easy to use but does not give developer the full control over the tools spec. For users who need fine-grained control over the tool description and execution, they can use the new `OpenApiTool` now. For examples how to implement it, see the `README.md`. LiteRT-LM-PiperOrigin-RevId: 859271963
- Last sampler shlibs have some regressions LiteRT-LM-PiperOrigin-RevId: 859350407
LiteRT-LM-PiperOrigin-RevId: 859642028
When context length is big, it increase init time noticeably, e.g. 8s when context length = 32k for gemma3-1b. LiteRT-LM-PiperOrigin-RevId: 859666933
LiteRT-LM-PiperOrigin-RevId: 859770885
LiteRT-LM-PiperOrigin-RevId: 859859182
LiteRT-LM-PiperOrigin-RevId: 859875851
LiteRT-LM-PiperOrigin-RevId: 859999259
LiteRT-LM-PiperOrigin-RevId: 860120569
The given option is true by default and used to set the mmap'ed memory for shared weights are swapped out to reduce memory footprint. When memory is swapped out, all the temporary changes made by magic numbers are reverted. So, when magic numbers are used, the give flags must be disabled. LiteRT-LM-PiperOrigin-RevId: 860150634
LiteRT-LM-PiperOrigin-RevId: 860170788
LiteRT-LM-PiperOrigin-RevId: 860216677
LiteRT-LM-PiperOrigin-RevId: 860278850
We will remove the `--hk_token`. The environment is the only way to set the token. LiteRT-LM-PiperOrigin-RevId: 860319701
LiteRT-LM-PiperOrigin-RevId: 878014141
LiteRT-LM-PiperOrigin-RevId: 878052506
LiteRT-LM-PiperOrigin-RevId: 878095579
LiteRT-LM-PiperOrigin-RevId: 878143704
LiteRT-LM-PiperOrigin-RevId: 878182837
LiteRT-LM-PiperOrigin-RevId: 878195871
Support end of vision tflite model in executors LiteRT-LM-PiperOrigin-RevId: 878252087
Integrate model downloading into macOS, Windows, and Linux CI workflows using a Gemma 3 1B IT model from Hugging Face. Introduce a cross-platform Pytest framework to execute litert_lm_main with a single prompt, serving as an E2E smoke test to verify basic inference functionality and expected output. LiteRT-LM-PiperOrigin-RevId: 878581563
LiteRT-LM-PiperOrigin-RevId: 878609747
LiteRT-LM-PiperOrigin-RevId: 878717839
LiteRT-LM-PiperOrigin-RevId: 878816159
LiteRT-LM-PiperOrigin-RevId: 878885676
LiteRT-LM-PiperOrigin-RevId: 879067681
LiteRT-LM-PiperOrigin-RevId: 879146688
LiteRT-LM-PiperOrigin-RevId: 879202010
LiteRT-LM-PiperOrigin-RevId: 879221034
LiteRT-LM-PiperOrigin-RevId: 879268560
LiteRT-LM-PiperOrigin-RevId: 879452695
…y issue on specific GPUs LiteRT-LM-PiperOrigin-RevId: 879665447
This change refactors the `Backend` API in `Config.kt` from an enum
to a sealed class to support backend-specific configurations directly
within the backend definition.
**Key changes include:**
* **Backend Sealed Class:** `Backend` is now a sealed class with three
variants: `CPU(val numThreads: Int)`, `NPU()`,
and `GPU()`.
* **JNI Configuration:** Updated `LiteRtLmJni.nativeCreateEngine` and
`litertlm.cc` to accept and process the `num_threads` parameter, which
is mapped to `number_of_threads` in the C++ `CpuConfig`.
* **Tests & Examples Migration:** Updated all test cases (e.g.,
`SessionTest`, `DeviceTest`, `BaseDeviceTest`) and example scripts
(`Main.kt`, `ToolMain.kt`, `BenchmarkMain.kt`) to instantiate the new
`Backend.CPU()` and `Backend.NPU(...)` data classes.
LiteRT-LM-PiperOrigin-RevId: 879674985
LiteRT-LM-PiperOrigin-RevId: 879706615
LiteRT-LM-PiperOrigin-RevId: 879824143
This change refactors the LiteRT LM Python API to provide a more idiomatic and
simplified interface for users. Key changes include:
- Abstract Base Classes (ABC): Introduced AbstractEngine and AbstractConversation
in a new interfaces.py module. The C++ implementation classes are registered as
virtual subclasses, allowing for proper inheritance checks (e.g.,
isinstance(engine, AbstractEngine)) across the C++/Python boundary.
- Simplified Engine & Conversation Lifecycle:
- Users can now directly instantiate litert_lm.Engine with configuration
parameters, which internally handles ModelAssets and EngineSettings.
- Added engine.create_conversation() to eliminate the need for manual
ConversationConfig and Conversation.create calls.
- send_message and send_message_async now support both str and dict inputs. String inputs are automatically wrapped into a user-role message.
- Moved the Backend enum to Python (interfaces.py).
- Internal Refactoring: Updated C++ bindings to support the new factory methods and
simplified logging initialization using a mapping-based approach.
- Enhanced Testing: Updated existing tests to the new API and added new test cases
for ABC inheritance, simplified conversation creation, and string input support.
LiteRT-LM-PiperOrigin-RevId: 879841369
This change moves the NPU native library directory configuration from the global `ExperimentalFlags` to the `Backend.NPU` class. This enables each NPU backend (main, vision, and audio) to specify its own native library directory, which is essential for multi-modal models that may utilize different NPU backends or separate library instances. **Key changes**: - Updated `Backend.NPU` in `Config.kt` to a data class with a `nativeLibraryDir` property. - Deprecated `ExperimentalFlags.npuLibrariesDir` while maintaining it as a fallback for backward compatibility. - Updated the JNI layer (`nativeCreateEngine` in `litertlm.cc` and `LiteRtLmJni.kt`) to support separate native library directories for main, vision, and audio executors. - Adjusted `Engine.kt` initialization logic to resolve the native library directory per backend, prioritizing the value in `Backend.NPU` over the global experimental flag. - Updated `BaseDeviceTest.kt` to utilize the new configuration pattern. LiteRT-LM-PiperOrigin-RevId: 879851457
LiteRT-LM-PiperOrigin-RevId: 879871562
4dacb83 to
4dd4c65
Compare
…zation latency. LiteRT-LM-PiperOrigin-RevId: 879423683
4dd4c65 to
9f0fc73
Compare
9f0fc73 to
95c303f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parallelize tokenizer and metadata loading to improve engine initialization latency.