zcuda/
βββ build.zig # Build configuration (library, tests, examples)
βββ build.zig.zon # Package manifest
βββ src/ # Source code
β βββ cuda.zig # Root module β re-exports all public types
β βββ types.zig # Shared types (Dim3, LaunchConfig, DevicePtr)
β βββ driver/ # CUDA Driver API (always enabled)
β β βββ sys.zig # Raw FFI (@cImport cuda.h)
β β βββ result.zig # Error wrapping (CUresult β DriverError)
β β βββ safe.zig # CudaContext, CudaStream, CudaSlice, CudaEvent, CudaGraph
β β βββ driver.zig # Module entry point
β βββ nvrtc/ # NVRTC β runtime compilation (always enabled)
β β βββ sys.zig # Raw FFI (@cImport nvrtc.h)
β β βββ result.zig # Error wrapping
β β βββ safe.zig # compilePtx, compileCubin, CompileOptions
β β βββ nvrtc.zig # Module entry point
β βββ cublas/ # cuBLAS β BLAS L1/L2/L3 (-Dcublas=true)
β β βββ sys.zig # Raw FFI
β β βββ result.zig # Error wrapping
β β βββ safe.zig # CublasContext, GEMM/AXPY/TRSM etc.
β β βββ cublas.zig # Module entry point
β βββ cublaslt/ # cuBLAS LT β lightweight GEMM (-Dcublaslt=true)
β β βββ sys.zig, result.zig, safe.zig, cublaslt.zig
β β βββ ...
β βββ curand/ # cuRAND β GPU random numbers (-Dcurand=true)
β β βββ sys.zig, result.zig, safe.zig, curand.zig
β β βββ ...
β βββ cudnn/ # cuDNN β deep learning (-Dcudnn=true)
β β βββ sys.zig, result.zig, safe.zig, cudnn.zig
β β βββ ...
β βββ cusolver/ # cuSOLVER β direct solvers (-Dcusolver=true)
β β βββ sys.zig, result.zig, safe.zig, cusolver.zig
β β βββ ...
β βββ cusparse/ # cuSPARSE β sparse matrices (-Dcusparse=true)
β β βββ sys.zig, result.zig, safe.zig, cusparse.zig
β β βββ ...
β βββ cufft/ # cuFFT β FFT (-Dcufft=true)
β β βββ sys.zig, result.zig, safe.zig, cufft.zig
β β βββ ...
β βββ nvtx/ # NVTX β profiling annotations (-Dnvtx=true)
β β βββ sys.zig, safe.zig, nvtx.zig
β β βββ ...
β βββ runtime/ # CUDA Runtime API (internal)
β βββ sys.zig, result.zig, safe.zig, runtime.zig
β βββ ...
β βββ kernel/ # GPU Kernel DSL β device-side intrinsics & types
β βββ device.zig # Module entry point (re-exports all sub-modules)
β βββ intrinsics.zig # 98 inline fns: threadIdx, atomics, warp, math, cache hints
β βββ tensor_core.zig # 56 inline fns: WMMA/MMA/wgmma/tcgen05/TMA/cluster
β βββ shared_mem.zig # SharedArray (addrspace(3)), dynamicShared, cooperative utils
β βββ arch.zig # SM version guards (requireSM, SmVersion enum)
β βββ types.zig # DeviceSlice(T), DevicePtr(T), GridStrideIterator
β βββ shared_types.zig # Host-device shared: Vec2/3/4, Int2/3, Matrix3x3/4x4, LaunchConfig
β βββ bridge_gen.zig # Type-safe kernel bridge generator (Fn enum, load, getFunction)
β βββ debug.zig # assertf, ErrorFlag, printf, checkNaN, CycleTimer, __trap
βββ test/ # Tests
β βββ helpers.zig # Shared test helpers (initCuda, readPtxFile)
β βββ unit/ # Unit tests (12 files + 10 kernel unit tests)
β β βββ driver_test.zig # Context, stream, memory, events, graphs
β β βββ nvrtc_test.zig # PTX/CUBIN compilation
β β βββ cublas_test.zig # BLAS L1/L2/L3 operations
β β βββ cublaslt_test.zig # Lightweight GEMM
β β βββ curand_test.zig # Random number generation
β β βββ cudnn_test.zig # Conv, activation, pooling, softmax
β β βββ cusolver_test.zig # LU, SVD, Cholesky, eigensolve
β β βββ cusparse_test.zig # SpMV, SpMM, SpGEMM
β β βββ cufft_test.zig # FFT plans and execution
β β βββ nvtx_test.zig # Profiling annotations
β β βββ runtime_test.zig # CUDA runtime API
β β βββ types_test.zig # Shared type tests
β β βββ kernel/ # Kernel DSL unit tests (host-side, no GPU required)
β β βββ kernel_arch_test.zig # SM version guards
β β βββ kernel_debug_test.zig # ErrorFlag, CycleTimer declarations
β β βββ kernel_device_test.zig # Device kernel compilation & launch
β β βββ kernel_device_types_test.zig # DeviceSlice, DevicePtr, GridStrideIterator
β β βββ kernel_grid_stride_test.zig # GridStrideIterator logic
β β βββ kernel_intrinsics_host_test.zig # Intrinsic type/signature validation
β β βββ kernel_shared_mem_test.zig # SharedArray comptime API
β β βββ kernel_shared_types_test.zig # Vec2/3/4, Matrix, LaunchConfig
β β βββ kernel_tensor_core_host_test.zig # Fragment types, SM guards
β β βββ kernel_types_test.zig # Device type layout tests
β βββ integration/ # Integration tests (10 library + 7 kernel = 17 files)
β βββ gemm_roundtrip_test.zig # cuBLAS GEMM round-trip
β βββ jit_kernel_test.zig # NVRTC compile + launch
β βββ lu_solve_test.zig # cuSOLVER LU solve pipeline
β βββ svd_reconstruct_test.zig # SVD reconstruction
β βββ fft_roundtrip_test.zig # FFT forward + inverse
β βββ curand_fft_test.zig # cuRAND β cuFFT pipeline
β βββ conv_pipeline_test.zig # cuDNN conv pipeline
β βββ conv_relu_test.zig # cuDNN conv + activation
β βββ sparse_pipeline_test.zig # cuSPARSE pipeline
β βββ syrk_geam_test.zig # cuBLAS SYRK + GEAM
β βββ kernel/ # Kernel DSL integration tests (GPU required)
β βββ kernel_device_test.zig # Basic kernel launch correctness
β βββ kernel_event_timing_test.zig # Event timing + multi-stream
β βββ kernel_intrinsics_gpu_test.zig # Math/atomic intrinsics on real GPU
β βββ kernel_memory_lifecycle_test.zig # Alloc/free/copy lifecycle
β βββ kernel_pipeline_test.zig # Tiled matmul, softmax, dot product
β βββ kernel_reduction_test.zig # Warp reduce, histogram, matmul
β βββ kernel_shared_mem_gpu_test.zig # Shared mem reduce/transpose
β βββ kernel_softmax_test.zig # Online softmax correctness
βββ examples/ # Runnable examples
β βββ README.md # Categorized example index (with links to per-category READMEs)
β βββ basics/ # 16 examples β contexts, streams, events, memory, kernels
β β βββ README.md # Category index with API key snippets
β β βββ vector_add.zig, streams.zig, device_info.zig, event_timing.zig
β β βββ struct_kernel.zig, kernel_attributes.zig, constant_memory.zig
β β βββ peer_to_peer.zig, alloc_patterns.zig, async_memcpy.zig
β β βββ pinned_memory.zig, unified_memory.zig, context_lifecycle.zig
β β βββ dtod_copy_chain.zig, memset_patterns.zig, multi_device_query.zig
β βββ kernel/ # 80 GPU kernel examples (11 categories, compiled to PTX)
β β βββ README.md # Per-category kernel example index
β β βββ 0_Basic/ # 8 kernels β SAXPY, ReLU, dot product, grid stride
β β βββ 1_Reduction/ # 5 kernels β warp shuffle, prefix scan, multi-block
β β βββ 2_Matrix/ # 6 kernels β naive matmul, tiled matmul, transpose
β β βββ 3_Atomics/ # 5 kernels β atomic ops, histogram, warp-aggregated
β β βββ 4_SharedMemory/ # 3 kernels β static/dynamic smem, 1D stencil
β β βββ 5_Warp/ # 5 kernels β ballot, broadcast, match, scan
β β βββ 6_MathAndTypes/ # 9 kernels β FP16, complex, fast math, type conversion
β β βββ 7_Debug/ # 2 kernels β error checking, GPU printf
β β βββ 8_TensorCore/ # 11 kernels β WMMA (f16/bf16/int8/tf32), MMA PTX, FP8
β β βββ 9_Advanced/ # 8 kernels β async copy, cooperative groups, softmax
β β βββ 10_Integration/ # 24 kernels β end-to-end pipelines and benchmarks
β βββ cublas/ # 19 examples β BLAS L1/L2/L3, batched, mixed-precision
β β βββ README.md # Category index with row-major note and API key snippets
β β βββ gemm.zig, axpy.zig, dot.zig, scal.zig, nrm2_asum.zig
β β βββ gemv.zig, symv_syr.zig, trmv_trsv.zig, trsm.zig
β β βββ gemm_batched.zig, gemm_ex.zig, geam.zig, dgmm.zig
β β βββ swap_copy.zig, rot.zig, amax_amin.zig, symm.zig, syrk.zig
β β βββ cosine_similarity.zig
β βββ cublaslt/ # 1 example β lightweight GEMM with heuristics
β β βββ README.md # Category index
β β βββ lt_sgemm.zig
β βββ cudnn/ # 3 examples β convolution, activation, pooling
β β βββ README.md # Category index
β β βββ conv2d.zig, activation.zig, pooling_softmax.zig
β β βββ ...\nβ βββ cufft/ # 4 examples β 1D/2D/3D FFT
β β βββ README.md # Category index with transform type table
β β βββ fft_1d_c2c.zig, fft_1d_r2c.zig, fft_2d.zig, fft_3d.zig
β β βββ ...\nβ βββ curand/ # 3 examples β RNG, distributions, Monte Carlo
β β βββ README.md # Category index with generator type table
β β βββ generators.zig, distributions.zig, monte_carlo_pi.zig
β β βββ ...\nβ βββ cusolver/ # 5 examples β LU, SVD, Cholesky, QR, eigensolve
β β βββ README.md # Category index with devInfo note
β β βββ getrf.zig, gesvd.zig, potrf.zig, geqrf.zig, syevd.zig
β β βββ ...\nβ βββ cusparse/ # 4 examples β CSR/COO SpMV, SpMM, SpGEMM
β β βββ README.md # Category index with sparse format table
β β βββ spmv_csr.zig, spmv_coo.zig, spmm_csr.zig, spgemm.zig
β β βββ ...\nβ βββ nvrtc/ # 2 examples β JIT compilation
β β βββ README.md # Category index with CompileOptions table
β β βββ jit_compile.zig, template_kernel.zig
β β βββ ...\nβ βββ nvtx/ # 1 example β Nsight profiling
β βββ README.md # Category index with Nsight usage
β βββ profiling.zig
βββ docs/ # Documentation
β βββ README.md # Documentation index
β βββ API.md # Complete API reference (binding layer + Kernel DSL overview)
β βββ kernel/
β β βββ API.md # Kernel DSL full API reference (intrinsics, smem, tensor cores)
β β βββ MIGRATION.md # CUDA C++ β Zig migration guide
β βββ driver/README.md # Driver module docs
β βββ nvrtc/README.md # NVRTC module docs
β βββ cublas/README.md # cuBLAS module docs
β βββ cublaslt/README.md # cuBLAS LT module docs
β βββ curand/README.md # cuRAND module docs
β βββ cudnn/README.md # cuDNN module docs
β βββ cusolver/README.md # cuSOLVER module docs
β βββ cusparse/README.md # cuSPARSE module docs
β βββ cufft/README.md # cuFFT module docs
β βββ nvtx/README.md # NVTX module docs
Core CUDA types: CudaContext, CudaStream, CudaSlice(T), CudaView(T), CudaViewMut(T), CudaModule, CudaFunction, CudaEvent, CudaGraph. Device management, memory allocation, host β device transfers, kernel launch, stream synchronization, event timing, graph capture, and unified memory.
Runtime compilation: compilePtx, compileCubin, compilePtxWithOptions, compileCubinWithOptions. CompileOptions for target architecture, optimization, register limits, and arbitrary flags.
CublasContext wrapping cuBLAS handle. Level 1 (AXPY, SCAL, DOT, NRM2, AMAX, AMIN, SWAP, COPY, ROT, ROTG), Level 2 (GEMV, SYMV, TRMV, TRSV, SYR), Level 3 (SGEMM, DGEMM, strided batched, pointer-array batched, GemmEx, SYMM, TRSM, TRMM, SYRK, GEAM, DGMM, grouped batched GEMM). Single and double precision throughout.
CublasLtContext for lightweight GEMM with fine-grained algorithm selection via getHeuristics, layout descriptors, and matmul/matmulWithAlgo. Supports mixed-precision with f16/bf16/f32/f64 data types and TF32 compute.
CurandContext with 8 generator types (XORWOW, MRG32k3a, MTGP32, MT19937, Philox, Sobol, etc.). Distributions: uniform, normal, log-normal, Poisson. Single and double precision.
CudnnContext for deep learning primitives. 2D and N-dimensional convolution (forward, backward data, backward filter), activation, pooling, softmax (with backward), batch normalization, dropout, element-wise tensor operations (opTensor, addTensor, scaleTensor, reduceTensor). Multiple algorithms (implicit GEMM, Winograd, FFT, etc.).
CusolverDnContext for LU factorization and SVD. CusolverDnExt extends with Cholesky (potrf/potrs), QR (geqrf/orgqr), eigenvalue decomposition (syevd), and Jacobi SVD (gesvdj) with configurable tolerance and max sweeps. Single and double precision.
CusparseContext for CSR and COO sparse matrix creation. SpMV (sparse Γ dense vector), SpMM (sparse Γ dense matrix), SpGEMM (sparse Γ sparse) with work estimation / compute / copy phases. Algorithm selection for deterministic vs non-deterministic compute.
CufftPlan for 1D/2D/3D and batched FFT plans. Six execution modes: C2C, R2C, C2R for float and double (execC2C, execZ2Z, execR2C, execC2R, execD2Z, execZ2D).
rangePush/rangePop for named range markers, mark for point markers, ScopedRange for RAII-style ranges, Domain for per-module profiling isolation.
Dim3, LaunchConfig (with forNumElems auto-configuration), DevicePtr(T), and cuBLAS types (Operation, FillMode, DiagType, SideMode).
Device-side module for writing GPU kernels in pure Zig, compiled to PTX via the NVPTX backend. Contains 175 inline functions across:
- intrinsics.zig (98 fns):
threadIdx,blockIdx,__syncthreads, atomics (atomicAddβatomicDec), warp shuffle/vote/match/reduce, fast math, bit ops, cache hints, type conversions,__nanosleep,__byte_perm - tensor_core.zig (56 fns): WMMA (sm_70+), MMA PTX (sm_80+), FP8 MMA (sm_89+), wgmma/TMA/cluster (sm_90+), tcgen05 (sm_100+)
- shared_mem.zig:
SharedArray(T, N)viaaddrspace(.shared),dynamicShared(T),clearShared,loadToShared,storeFromShared,reduceSum - arch.zig:
SmVersionenum (sm_52βsm_100+),requireSMcomptime guard,atLeast,codename - types.zig:
DeviceSlice(T)(get/set/len),DevicePtr(T)(load/store/atomicAdd),GridStrideIterator,globalThreadIdx,gridStride - shared_types.zig:
Vec2/3/4,Int2/3,Matrix3x3/4x4,LaunchConfig(init1D/2D, forElementCount) - debug.zig:
assertf,assertInBounds,safeGet,ErrorFlag(5 error codes +setError/checkNaN),printf,CycleTimer,__trap,__brkpt - bridge_gen.zig:
init(Config)β comptimeFnenum,load,loadFromPtx,getFunction,getFunctionByName
β Full API reference: docs/kernel/API.md
zig build # Build library (driver + nvrtc + cublas + curand)
zig build test # All tests (unit + integration, 235 total)
zig build test-unit # Unit tests only
zig build test-integration # Integration tests only
zig build run-<cat>-<name> # Run a host example (e.g. run-basics-vector_add)
zig build example-integration -Dgpu-arch=sm_86 -Dcublas=true -Dcufft=true # Build all integration examples
zig build compile-kernels # Compile all GPU kernels to PTX (default sm_80)
zig build compile-kernels -Dgpu-arch=sm_80 # Target Ampere
zig build compile-kernels -Dgpu-arch=sm_90 # Target Hopper
zig build example-kernel-<cat>-<name> -Dgpu-arch=sm_86 # Build one kernel example