feat(wasm): add tools/wasm/ Emscripten entrypoint for browser-resident inference#15
Open
wordingone wants to merge 498 commits into
Open
feat(wasm): add tools/wasm/ Emscripten entrypoint for browser-resident inference#15wordingone wants to merge 498 commits into
wordingone wants to merge 498 commits into
Conversation
* Add MCP Connection diagnostics and CORS hint to web-ui * tidy up test * webui: Refactor and improve MCP diagnostic logging --------- Co-authored-by: evalstate <[email protected]>
* webui: add setting for first-line chat titles Add an opt-in setting (`titleGenerationUseFirstLine`) to use the first non-empty line of a prompt as the generated conversation title. Previously, the complete multi-line prompt was being used, which created long titles for complex queries. Coupled with "Ask for confirmation before changing conversation title", the dialog would overflow. * Update tools/server/webui/src/lib/utils/text.ts Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/utils/text.ts Co-authored-by: Aleksander Grygier <[email protected]> * webui: Run build to update the bundle As requested in: ggml-org#21797 (review) * webui: Fix missing import for NEWLINE_SEPARATOR --------- Co-authored-by: Aleksander Grygier <[email protected]>
* CUDA: Limit DeviceSegmentedSort to immediate mode DeviceSegmentedSort is currently not capturable in a cuda graph. Hence, we have to go for the slower DeviceSegmentedRadixSort in that case. Perf numbers on RTX Pro 6000 Blackwell Max-Q: DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs) ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s DeviceSegmentedSort in immediate mode ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s * Add test case for dispatch to DeviceSegmentedRadixSort We currently lack a way to force graph mode in CUDA, patch callback to invoke ggml_backend_compare_graph_backend twice to enforce each test to run in graph mode
) Signed-off-by: Adrien Gallouët <[email protected]>
…20797) * use integer dot product for quantized KV flash attention * small improvements * fix SHMEM_STAGING indexing * add missing KV type quants * fixes * add supported quants to FA tests * readd fast paths for <8bit quants * fix mmq gate and shmem checks
* docs: listing qwen3-asr and qwen3-omni as supported * nits
* server: support OAI /v1/audio/transcriptions API * address autoreview comments * correct default response_format value
This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32.
…ml-org#21870) * common: skip reasoning budget sampler when no budget is requested After I added thinking_start_tag / thinking_end_tag for gemma4 in ggml-org#21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in ggml-org#21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before. * common: preserve rbudget when grammar is lazy Following up on the review feedback on ggml-org#21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from ggml-org#20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
…gml-org#21644) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function
* cmake: fix CMP0194 warning on Windows with MSVC Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1. The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler. This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147). Closes ggml-org#20311 * cmake: apply cisc's formatting suggestion --------- Co-authored-by: texasich <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
* ci : re-enable mac workflows * vulkan : fix compile warning
…device supports it (ggml-org#21572) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md
* mtmd: add mtmd_image_tokens_get_decoder_pos() API * consistent naming * fix build
* ggml: correct placement of ggml-ext.h * ggml : remove ggml-ext.h --------- Co-authored-by: Georgi Gerganov <[email protected]>
* hexagon: add async HMX worker Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread. * hexagon: cost-based VTCM chunk search for out-stationary matmul * hexagon: fix futex race in hmx_worker_drain Store the boolean to local variable avoid atomic load twice * hex-mm: hmx optimize scatter/transpose and use HMX intrinsics * hex-vmem: drop vmem limit a touch under 3GB on v73 * hexagon: add fwd declaration of htp_context * hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface Simplifies the overall implemantion, reduces thread wakeup roundtrips. * hex-mm: add debug log to hmx work func called from hmx-queue * Update hmx-queue.h Co-authored-by: Max Krasnyansky <[email protected]> --------- Co-authored-by: Kim-Chyan Gan <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]>
… Qwen 3.6 NextN - Added detailed descriptions of AtomicChat `UDT` quantization process in NEXTN.md, including tensor-type file overrides and build entrypoints. - Updated README.md to include optional UDT quant information and links to relevant documentation. - Modified bench-matrix script to support combined GGUF benchmarking and added filtering options for benchmark modes. - Improved summary output in the benchmarking script to include optional markdown headings and better formatting.
- Introduced a new environment variable `QWEN_UDT_ABLATION_AUTO` to control filtering for benchmark modes based on model versions. - Refactored the `bench-qwen-udt-matrix-local.sh` script to improve clarity and structure, ensuring proper handling of model types and filtering. - Updated `bench-qwen-udt-quality.sh` to support an optional second pass on chat-style text files, with a default sample chat calibration file included. - Improved error handling in `get-wikitext-2.sh` for downloading and unzipping files. - Added a new sample chat calibration file to enhance benchmarking capabilities.
…Qwen 3.6 NextN enhancements - Revised NEXTN.md to highlight the new AtomicChat UDT collection, detailing the combined `_MTP.gguf` quants and their benefits for NextN processing. - Updated README.md to reflect changes in recommended sources for Qwen 3.6 models, emphasizing the AtomicChat UDT collection and its features. - Enhanced quantization scripts to support improved file handling and added compatibility for new tensor types. - Introduced a new script for running perplexity benchmarks on UDT quant models, generating detailed performance logs. - Improved error handling and user feedback in various scripts to streamline the quantization and benchmarking processes.
…pp-turboquant - Updated NEXTN.md to document the integration of `--mmproj` with speculative decoding types `mtp`, `nextn`, and `eagle3`, allowing coexistence on a single slot. - Revised README.md to reflect the new multimodal capabilities and their implications for text and image processing. - Added functions in `common/speculative.cpp` and `common/speculative.h` to check compatibility of speculative types with multimodal settings. - Enhanced server context handling to manage multimodal prompts and ensure correct behavior during speculative decoding. - Introduced a new script for running Gemma 4 with multimodal projector support, detailing expected behavior for text and image turns. - Updated documentation in `docs/speculative.md` to clarify per-turn behavior and future roadmap for draft acceleration on vision turns.
Enhance multimodal support and speculative decoding in atomic-llama-c…
…t inference
Exposes libllama.a (with MTP + gemma4-assistant support) to the browser via
four EMSCRIPTEN_KEEPALIVE exports: wasm_llama_init, wasm_llama_health,
wasm_llama_chat_completion, wasm_llama_free_str.
Response shape: {choices:[...], _mtp_enabled:bool, _spec_accept_rate:null}
MTP threading requires SharedArrayBuffer (COOP/COEP) and global -pthread build.
Build: emcmake cmake + emmake make wasm-llama -> .html/.js/.wasm (2.6M)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
tools/wasm/containing a C++ entrypoint that linkslibllama.aand exposes browser-callable functions viaEMSCRIPTEN_KEEPALIVE.Build:
emcmake cmake . && emmake make wasm-llamaArtifacts:
wasm-llama.html,wasm-llama.js,wasm-llama.wasm(2.6 MiB)Exported functions
wasm_llama_init(target_path, drafter_path) -> intwasm_llama_health() -> char*{status, mtp_loaded}JSONwasm_llama_chat_completion(request_json) -> char*{choices, _mtp_enabled, _spec_accept_rate, _latency_ms, _tps}wasm_llama_free_str(ptr)Response shape
{"choices":[{"message":{"role":"assistant","content":"..."}}],"_mtp_enabled":true,"_spec_accept_rate":null,"_latency_ms":1234,"_tps":4.2}_mtp_enabledreflects whetherllama_model_load_mtp_from_filesucceeded._spec_accept_rateisnulluntil pthreads (SharedArrayBuffer + COOP/COEP) are wired globally — tracked as follow-up work.Changes
tools/wasm/wasm_llama.cpp— entrypoint implementation (~220 lines, no external deps beyondllama.h)tools/wasm/CMakeLists.txt— Emscripten-specific build configurationtools/CMakeLists.txt— wireadd_subdirectory(wasm)in theif (EMSCRIPTEN)block (was previously empty)Threading note
pthreads require all objects to be compiled with
-matomics -mbulk-memory(i.e.,-pthreadat cmake-configure time). This PR compiles single-threaded, matching the currenttools/build posture. Full pthread support (enabling MTP'smtp_worker_loop) is a separate cmake-level change.