43 lines (34 loc) · 1.54 KB

todo

beta (jit)

goalpost: jit-compiled gpt2 matching pytorch performance

perf:

close rune grad performance gap (within <2x of pytorch)
close nx performance gaps (within <2x of numpy)

tolk:

integrate tolk as rune jit transformation
kernel fusion and optimization
cpu, cuda, metal backends

v1 (production)

goalpost: end-to-end train -> deploy as unikernel or static binary

training:

gradient accumulation
mixed precision (fp16/bf16 forward, fp32 master weights, loss scaling)
gradient checkpointing (rune.checkpoint, recompute activations in backward)
flash attention (tolk kernel and/or kaun.fn primitive)
parallel data loading (ocaml 5 domains, background prefetch)
layer completions: transposed conv, group norm, full conv2d stride/dilation/padding
onnx import (onnx -> tolk ir adapter, cover resnet/bert/gpt2/llama/vit/whisper ops)

deployment:

aot compilation: cpu (c via clang, musl static linking) and gpu (cuda/metal/opencl)
mimir: kv cache, continuous batching, pagedattention
mimir: http server (rest api, /health, /metrics, sigterm, structured logging)
post-training quantization (int8/int4, tolk quantized kernels)
mirageos unikernel deployment (raven-mirage package)
- no blas dep (tolk aot generates all compute)
- weight loading via network (mirage-http)
- verify ocaml 5 effects on mirageos runtime
- http server on mirageos network stack

docs/website:

landing page rewrite with benchmarks
deployment guide (aot, static binary, docker, mirageos, gpu)
end-to-end examples (serving, onnx+deploy workflow)