qvac-fabric-llm.cpp

AI inference and training engine for desktop and mobile platforms.

qvac-fabric-llm.cpp is a specialized fork of llama.cpp optimized for embedded systems, mobile devices, and enterprise deployment scenarios. It extends the excellent foundation of llama.cpp with additional capabilities focused on low-bit KV-cache quantization, mobile GPU optimization, and flexible integration patterns.

Key Features

The following capabilities are developed and maintained as part of qvac-fabric-llm.cpp. Features marked as exclusive are not available in upstream llama.cpp.

TurboQuant KV Cache Quantization (exclusive)

TurboQuant adds low-bit KV-cache quantization formats for long-context inference while preserving token-generation quality close to higher-bit caches. It supports:

TBQ3_0 / TBQ4_0: TurboQuant 3-bit and 4-bit formats with QJL Stage 2 correction.
PQ3_0 / PQ4_0: PolarQuant 3-bit and 4-bit Stage 1-only formats.
64-wide variants: tbq3_0_64, tbq4_0_64, pq3_0_64, and pq4_0_64 for models with smaller attention head dimensions.
Backend support: CPU quantization/dequantization and Vulkan inference kernels, including attention paths and mixed K/V cache configurations. CUDA and Metal do not include TurboQuant kernels in this release.
Test coverage: performance coverage via test-kv-cache-quantization-perf.sh, perplexity coverage via test-kv-cache-quantization-perp.sh, and eval coverage via test-kv-cache-ruler.py, test-kv-cache-niah.py, test-kv-cache-longbench.py, test-kv-cache-zeroscrolls.py, and test-kv-cache-leval.py.

Qwen3.5-4B Q8_0 benchmark highlights:

Context	Cache config	BPW	RTX 5090 pp	RTX 5090 tg	Strix Halo pp	Strix Halo tg
2k	`f16/f16`	16.00	14,538 t/s (baseline)	236.02 t/s (baseline)	1,750 t/s (baseline)	42.30 t/s (baseline)
2k	`pq4_0/pq4_0`	4.25	0.96x	0.96x	0.94x	0.99x
2k	`pq3_0/pq3_0`	3.25	0.96x	0.96x	0.95x	0.99x
2k	`tbq4_0/pq4_0`	4.75	0.71x	0.94x	0.60x	0.97x
2k	`tbq3_0/pq3_0`	3.75	0.71x	0.94x	0.60x	0.98x
8k	`f16/f16`	16.00	13,981 t/s (baseline)	236.67 t/s (baseline)	1,585 t/s (baseline)	42.46 t/s (baseline)
8k	`pq4_0/pq4_0`	4.25	0.89x	0.96x	0.82x	0.99x
8k	`pq3_0/pq3_0`	3.25	0.89x	0.96x	0.85x	0.99x
8k	`tbq4_0/pq4_0`	4.75	0.42x	0.94x	0.32x	0.97x
8k	`tbq3_0/pq3_0`	3.75	0.43x	0.94x	0.32x	0.97x

Quality checks on Qwen3.5-4B Q8_0 show tbq4_0/pq4_0 at -0.03% perplexity delta versus f16/f16, with 94.8% RULER main score and 37.04 LongBench average versus 37.52 for f16/f16. See the full TurboQuant benchmark report for all measured models, contexts, and quality results.

LoRA Fine-Tuning (exclusive)

qvac-fabric-llm.cpp provides native LoRA (Low-Rank Adaptation) fine-tuning across CPU, Vulkan, and Metal backends. The training pipeline runs directly on consumer hardware, including mobile phones and integrated GPUs.

Multi-backend GPU training (NVIDIA, AMD, Intel, Apple, Mali, Adreno)
FP32, FP16, Q8, and Q4 training paths
Supervised instruction-tuning via assistant-only masked loss (SFT)
Checkpoint saving and resumable training
Learning rate schedulers (constant, cosine, linear) with warmup support
LoRA adapter merging into base models via llama-export-lora
Verified compatibility with Qwen3 and Gemma3 model architectures

For usage details and CLI reference, see the Finetuning Guide.

BitNet Inference and Fine-Tuning (exclusive)

Native support for BitNet ternary quantized models via the TQ2_0 data type, enabling efficient inference and LoRA fine-tuning of models such as bitnet_b1_58-xl on resource-constrained devices.

The official microsoft/BitNet inference framework provides optimized CPU kernels and GPU support limited to CUDA. qvac-fabric-llm.cpp extends BitNet to all major GPU backends -- Vulkan, Metal, and CPU -- bringing cross-platform GPU-accelerated BitNet inference and on-device fine-tuning to hardware not covered by the upstream framework, including Apple Silicon, mobile GPUs (Adreno, Mali), and AMD/Intel discrete GPUs. Compatible with models such as bitnet_b1_58-3B.

Backends: Vulkan, Metal and CPU TQ2_0 quantization support
Training: LoRA fine-tuning of BitNet models on Vulkan, Metal, and CPU backends
Conversion: HuggingFace-to-GGUF conversion for BitNet model architectures
Cooperative matrix (coopmat) support for Vulkan devices that expose the extension

Mobile GPU Optimization (exclusive)

Enhanced GPU support with targeted optimizations for Qualcomm Adreno GPUs.

Vulkan on Adreno 800+: quantized inference (Q4_0, Q8) and LoRA fine-tuning for Gemma3, Qwen3, and BitNet (TQ2_0) model architectures.
OpenCL on Adreno: inference support for all other model architectures.
Adreno-specific Vulkan shader variants for improved throughput.
Vulkan Memory Allocator (VMA) integration for efficient GPU memory management.

Quick Start

Building from Source

git clone https://github.com/tetherto/qvac-fabric-llm.cpp.git
cd qvac-fabric-llm.cpp

# Standard build
cmake -B build
cmake --build build --config Release

# With Vulkan support (Android, Windows, Linux)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

# With Metal support (macOS, iOS)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release

For more detailed build instructions, see docs/build.md.

Running a Model

# Run a model with Vulkan GPU acceleration, 4096 context length, and a prompt
./build/bin/llama-cli -m model.gguf -ngl 99 -c 4096 -p "Explain quantum computing in simple terms"

Supported Platforms

Platform	Backend	Status
Linux (x86_64, ARM64)	CPU, Vulkan, CUDA	✅ Full support
macOS (Intel, Apple Silicon)	CPU, Metal	✅ Full support
Windows (x86_64)	CPU, Vulkan, CUDA	✅ Full support
Android (ARM64)	CPU, Vulkan, OpenCL	✅ Full support
iOS	CPU, Metal	✅ Full support

Relationship with llama.cpp

qvac-fabric-llm.cpp is a maintained fork of llama.cpp. The project regularly synchronizes with upstream releases to incorporate improvements, bug fixes, and new model support, while extending the engine with capabilities not present in the upstream project.

Current upstream baseline: llama.cpp b7248

Exclusive Features

The following features are developed in qvac-fabric-llm.cpp and are not available in upstream llama.cpp:

Feature	Description
TurboQuant KV cache quantization	TBQ3_0/TBQ4_0 with QJL correction plus PQ3_0/PQ4_0 Stage 1 formats; CPU quantization/dequantization and Vulkan inference kernels for low-bit KV-cache inference
LoRA fine-tuning	On-device training across CPU, Vulkan, and Metal with SFT, checkpointing, and LR scheduling
BitNet inference and training	TQ2_0 quantization on Vulkan, Metal, and CPU for inference and LoRA fine-tuning; extends microsoft/BitNet beyond its CUDA-only GPU support
Mobile GPU optimization	Adreno 800+ quantized inference (Q4_0, Q8), Adreno-specific Vulkan shader variants, VMA integration

Upstream Compatibility

All standard llama.cpp functionality, models, and APIs remain fully compatible.

Any GGUF model supported by llama.cpp is supported by qvac-fabric-llm.cpp
Existing llama.cpp documentation applies to all non-exclusive features

Contributing

We welcome contributions! Please see our development workflow:

Fork the repository
Create a feature branch from master
Submit a pull request

License

MIT License - see LICENSE for details.

qvac-fabric-llm.cpp is built on llama.cpp by Georgi Gerganov and contributors.

For additional documentation, refer to the llama.cpp documentation.

Dependencies

yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
minja - Minimal Jinja parser in C++, used by various tools/examples - MIT License
linenoise.cpp - C++ library that provides readline-like line editing capabilities, used by llama-run - BSD 2-Clause License
curl - Client-side URL transfer library, used by various tools/examples - CURL License
miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain

Name		Name	Last commit message	Last commit date
Latest commit History 9,214 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
ci		ci
cmake		cmake
common		common
common_test		common_test
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
merging_strategy.md		merging_strategy.md
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
ty.toml		ty.toml
vulkan_profiling_analyzer.py		vulkan_profiling_analyzer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qvac-fabric-llm.cpp

Key Features

TurboQuant KV Cache Quantization (exclusive)

LoRA Fine-Tuning (exclusive)

BitNet Inference and Fine-Tuning (exclusive)

Mobile GPU Optimization (exclusive)

Quick Start

Building from Source

Running a Model

Supported Platforms

Relationship with llama.cpp

Exclusive Features

Upstream Compatibility

Contributing

License

Dependencies

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

qvac-fabric-llm.cpp

Key Features

TurboQuant KV Cache Quantization (exclusive)

LoRA Fine-Tuning (exclusive)

BitNet Inference and Fine-Tuning (exclusive)

Mobile GPU Optimization (exclusive)

Quick Start

Building from Source

Running a Model

Supported Platforms

Relationship with llama.cpp

Exclusive Features

Upstream Compatibility

Contributing

License

Dependencies

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages