Skip to content

Latest commit

Β 

History

History
251 lines (221 loc) Β· 17.6 KB

File metadata and controls

251 lines (221 loc) Β· 17.6 KB

zCUDA Project Structure

Directory Layout

zcuda/
β”œβ”€β”€ build.zig                  # Build configuration (library, tests, examples)
β”œβ”€β”€ build.zig.zon              # Package manifest
β”œβ”€β”€ src/                       # Source code
β”‚   β”œβ”€β”€ cuda.zig               # Root module β€” re-exports all public types
β”‚   β”œβ”€β”€ types.zig              # Shared types (Dim3, LaunchConfig, DevicePtr)
β”‚   β”œβ”€β”€ driver/                # CUDA Driver API (always enabled)
β”‚   β”‚   β”œβ”€β”€ sys.zig            # Raw FFI (@cImport cuda.h)
β”‚   β”‚   β”œβ”€β”€ result.zig         # Error wrapping (CUresult β†’ DriverError)
β”‚   β”‚   β”œβ”€β”€ safe.zig           # CudaContext, CudaStream, CudaSlice, CudaEvent, CudaGraph
β”‚   β”‚   └── driver.zig         # Module entry point
β”‚   β”œβ”€β”€ nvrtc/                 # NVRTC β€” runtime compilation (always enabled)
β”‚   β”‚   β”œβ”€β”€ sys.zig            # Raw FFI (@cImport nvrtc.h)
β”‚   β”‚   β”œβ”€β”€ result.zig         # Error wrapping
β”‚   β”‚   β”œβ”€β”€ safe.zig           # compilePtx, compileCubin, CompileOptions
β”‚   β”‚   └── nvrtc.zig          # Module entry point
β”‚   β”œβ”€β”€ cublas/                # cuBLAS β€” BLAS L1/L2/L3 (-Dcublas=true)
β”‚   β”‚   β”œβ”€β”€ sys.zig            # Raw FFI
β”‚   β”‚   β”œβ”€β”€ result.zig         # Error wrapping
β”‚   β”‚   β”œβ”€β”€ safe.zig           # CublasContext, GEMM/AXPY/TRSM etc.
β”‚   β”‚   └── cublas.zig         # Module entry point
β”‚   β”œβ”€β”€ cublaslt/              # cuBLAS LT β€” lightweight GEMM (-Dcublaslt=true)
β”‚   β”‚   β”œβ”€β”€ sys.zig, result.zig, safe.zig, cublaslt.zig
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ curand/                # cuRAND β€” GPU random numbers (-Dcurand=true)
β”‚   β”‚   β”œβ”€β”€ sys.zig, result.zig, safe.zig, curand.zig
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ cudnn/                 # cuDNN β€” deep learning (-Dcudnn=true)
β”‚   β”‚   β”œβ”€β”€ sys.zig, result.zig, safe.zig, cudnn.zig
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ cusolver/              # cuSOLVER β€” direct solvers (-Dcusolver=true)
β”‚   β”‚   β”œβ”€β”€ sys.zig, result.zig, safe.zig, cusolver.zig
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ cusparse/              # cuSPARSE β€” sparse matrices (-Dcusparse=true)
β”‚   β”‚   β”œβ”€β”€ sys.zig, result.zig, safe.zig, cusparse.zig
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ cufft/                 # cuFFT β€” FFT (-Dcufft=true)
β”‚   β”‚   β”œβ”€β”€ sys.zig, result.zig, safe.zig, cufft.zig
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ nvtx/                  # NVTX β€” profiling annotations (-Dnvtx=true)
β”‚   β”‚   β”œβ”€β”€ sys.zig, safe.zig, nvtx.zig
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ runtime/               # CUDA Runtime API (internal)
β”‚       β”œβ”€β”€ sys.zig, result.zig, safe.zig, runtime.zig
β”‚       └── ...
β”‚   └── kernel/                # GPU Kernel DSL β€” device-side intrinsics & types
β”‚       β”œβ”€β”€ device.zig         # Module entry point (re-exports all sub-modules)
β”‚       β”œβ”€β”€ intrinsics.zig     # 98 inline fns: threadIdx, atomics, warp, math, cache hints
β”‚       β”œβ”€β”€ tensor_core.zig    # 56 inline fns: WMMA/MMA/wgmma/tcgen05/TMA/cluster
β”‚       β”œβ”€β”€ shared_mem.zig     # SharedArray (addrspace(3)), dynamicShared, cooperative utils
β”‚       β”œβ”€β”€ arch.zig           # SM version guards (requireSM, SmVersion enum)
β”‚       β”œβ”€β”€ types.zig          # DeviceSlice(T), DevicePtr(T), GridStrideIterator
β”‚       β”œβ”€β”€ shared_types.zig   # Host-device shared: Vec2/3/4, Int2/3, Matrix3x3/4x4, LaunchConfig
β”‚       β”œβ”€β”€ bridge_gen.zig     # Type-safe kernel bridge generator (Fn enum, load, getFunction)
β”‚       └── debug.zig          # assertf, ErrorFlag, printf, checkNaN, CycleTimer, __trap
β”œβ”€β”€ test/                      # Tests
β”‚   β”œβ”€β”€ helpers.zig            # Shared test helpers (initCuda, readPtxFile)
β”‚   β”œβ”€β”€ unit/                  # Unit tests (12 files + 10 kernel unit tests)
β”‚   β”‚   β”œβ”€β”€ driver_test.zig    # Context, stream, memory, events, graphs
β”‚   β”‚   β”œβ”€β”€ nvrtc_test.zig     # PTX/CUBIN compilation
β”‚   β”‚   β”œβ”€β”€ cublas_test.zig    # BLAS L1/L2/L3 operations
β”‚   β”‚   β”œβ”€β”€ cublaslt_test.zig  # Lightweight GEMM
β”‚   β”‚   β”œβ”€β”€ curand_test.zig    # Random number generation
β”‚   β”‚   β”œβ”€β”€ cudnn_test.zig     # Conv, activation, pooling, softmax
β”‚   β”‚   β”œβ”€β”€ cusolver_test.zig  # LU, SVD, Cholesky, eigensolve
β”‚   β”‚   β”œβ”€β”€ cusparse_test.zig  # SpMV, SpMM, SpGEMM
β”‚   β”‚   β”œβ”€β”€ cufft_test.zig     # FFT plans and execution
β”‚   β”‚   β”œβ”€β”€ nvtx_test.zig      # Profiling annotations
β”‚   β”‚   β”œβ”€β”€ runtime_test.zig   # CUDA runtime API
β”‚   β”‚   β”œβ”€β”€ types_test.zig     # Shared type tests
β”‚   β”‚   └── kernel/            # Kernel DSL unit tests (host-side, no GPU required)
β”‚   β”‚       β”œβ”€β”€ kernel_arch_test.zig           # SM version guards
β”‚   β”‚       β”œβ”€β”€ kernel_debug_test.zig          # ErrorFlag, CycleTimer declarations
β”‚   β”‚       β”œβ”€β”€ kernel_device_test.zig         # Device kernel compilation & launch
β”‚   β”‚       β”œβ”€β”€ kernel_device_types_test.zig   # DeviceSlice, DevicePtr, GridStrideIterator
β”‚   β”‚       β”œβ”€β”€ kernel_grid_stride_test.zig    # GridStrideIterator logic
β”‚   β”‚       β”œβ”€β”€ kernel_intrinsics_host_test.zig # Intrinsic type/signature validation
β”‚   β”‚       β”œβ”€β”€ kernel_shared_mem_test.zig     # SharedArray comptime API
β”‚   β”‚       β”œβ”€β”€ kernel_shared_types_test.zig   # Vec2/3/4, Matrix, LaunchConfig
β”‚   β”‚       β”œβ”€β”€ kernel_tensor_core_host_test.zig # Fragment types, SM guards
β”‚   β”‚       └── kernel_types_test.zig          # Device type layout tests
β”‚   └── integration/           # Integration tests (10 library + 7 kernel = 17 files)
β”‚       β”œβ”€β”€ gemm_roundtrip_test.zig    # cuBLAS GEMM round-trip
β”‚       β”œβ”€β”€ jit_kernel_test.zig        # NVRTC compile + launch
β”‚       β”œβ”€β”€ lu_solve_test.zig          # cuSOLVER LU solve pipeline
β”‚       β”œβ”€β”€ svd_reconstruct_test.zig   # SVD reconstruction
β”‚       β”œβ”€β”€ fft_roundtrip_test.zig     # FFT forward + inverse
β”‚       β”œβ”€β”€ curand_fft_test.zig        # cuRAND β†’ cuFFT pipeline
β”‚       β”œβ”€β”€ conv_pipeline_test.zig     # cuDNN conv pipeline
β”‚       β”œβ”€β”€ conv_relu_test.zig         # cuDNN conv + activation
β”‚       β”œβ”€β”€ sparse_pipeline_test.zig   # cuSPARSE pipeline
β”‚       β”œβ”€β”€ syrk_geam_test.zig         # cuBLAS SYRK + GEAM
β”‚       └── kernel/                    # Kernel DSL integration tests (GPU required)
β”‚           β”œβ”€β”€ kernel_device_test.zig         # Basic kernel launch correctness
β”‚           β”œβ”€β”€ kernel_event_timing_test.zig   # Event timing + multi-stream
β”‚           β”œβ”€β”€ kernel_intrinsics_gpu_test.zig # Math/atomic intrinsics on real GPU
β”‚           β”œβ”€β”€ kernel_memory_lifecycle_test.zig # Alloc/free/copy lifecycle
β”‚           β”œβ”€β”€ kernel_pipeline_test.zig       # Tiled matmul, softmax, dot product
β”‚           β”œβ”€β”€ kernel_reduction_test.zig      # Warp reduce, histogram, matmul
β”‚           β”œβ”€β”€ kernel_shared_mem_gpu_test.zig # Shared mem reduce/transpose
β”‚           └── kernel_softmax_test.zig        # Online softmax correctness
β”œβ”€β”€ examples/                  # Runnable examples
β”‚   β”œβ”€β”€ README.md              # Categorized example index (with links to per-category READMEs)
β”‚   β”œβ”€β”€ basics/                # 16 examples β€” contexts, streams, events, memory, kernels
β”‚   β”‚   β”œβ”€β”€ README.md          # Category index with API key snippets
β”‚   β”‚   β”œβ”€β”€ vector_add.zig, streams.zig, device_info.zig, event_timing.zig
β”‚   β”‚   β”œβ”€β”€ struct_kernel.zig, kernel_attributes.zig, constant_memory.zig
β”‚   β”‚   β”œβ”€β”€ peer_to_peer.zig, alloc_patterns.zig, async_memcpy.zig
β”‚   β”‚   β”œβ”€β”€ pinned_memory.zig, unified_memory.zig, context_lifecycle.zig
β”‚   β”‚   └── dtod_copy_chain.zig, memset_patterns.zig, multi_device_query.zig
β”‚   β”œβ”€β”€ kernel/                # 80 GPU kernel examples (11 categories, compiled to PTX)
β”‚   β”‚   β”œβ”€β”€ README.md          # Per-category kernel example index
β”‚   β”‚   β”œβ”€β”€ 0_Basic/           # 8 kernels β€” SAXPY, ReLU, dot product, grid stride
β”‚   β”‚   β”œβ”€β”€ 1_Reduction/       # 5 kernels β€” warp shuffle, prefix scan, multi-block
β”‚   β”‚   β”œβ”€β”€ 2_Matrix/          # 6 kernels β€” naive matmul, tiled matmul, transpose
β”‚   β”‚   β”œβ”€β”€ 3_Atomics/         # 5 kernels β€” atomic ops, histogram, warp-aggregated
β”‚   β”‚   β”œβ”€β”€ 4_SharedMemory/    # 3 kernels β€” static/dynamic smem, 1D stencil
β”‚   β”‚   β”œβ”€β”€ 5_Warp/            # 5 kernels β€” ballot, broadcast, match, scan
β”‚   β”‚   β”œβ”€β”€ 6_MathAndTypes/    # 9 kernels β€” FP16, complex, fast math, type conversion
β”‚   β”‚   β”œβ”€β”€ 7_Debug/           # 2 kernels β€” error checking, GPU printf
β”‚   β”‚   β”œβ”€β”€ 8_TensorCore/      # 11 kernels β€” WMMA (f16/bf16/int8/tf32), MMA PTX, FP8
β”‚   β”‚   β”œβ”€β”€ 9_Advanced/        # 8 kernels β€” async copy, cooperative groups, softmax
β”‚   β”‚   └── 10_Integration/    # 24 kernels β€” end-to-end pipelines and benchmarks
β”‚   β”œβ”€β”€ cublas/                # 19 examples β€” BLAS L1/L2/L3, batched, mixed-precision
β”‚   β”‚   β”œβ”€β”€ README.md          # Category index with row-major note and API key snippets
β”‚   β”‚   β”œβ”€β”€ gemm.zig, axpy.zig, dot.zig, scal.zig, nrm2_asum.zig
β”‚   β”‚   β”œβ”€β”€ gemv.zig, symv_syr.zig, trmv_trsv.zig, trsm.zig
β”‚   β”‚   β”œβ”€β”€ gemm_batched.zig, gemm_ex.zig, geam.zig, dgmm.zig
β”‚   β”‚   β”œβ”€β”€ swap_copy.zig, rot.zig, amax_amin.zig, symm.zig, syrk.zig
β”‚   β”‚   └── cosine_similarity.zig
β”‚   β”œβ”€β”€ cublaslt/              # 1 example β€” lightweight GEMM with heuristics
β”‚   β”‚   β”œβ”€β”€ README.md          # Category index
β”‚   β”‚   └── lt_sgemm.zig
β”‚   β”œβ”€β”€ cudnn/                 # 3 examples β€” convolution, activation, pooling
β”‚   β”‚   β”œβ”€β”€ README.md          # Category index
β”‚   β”‚   β”œβ”€β”€ conv2d.zig, activation.zig, pooling_softmax.zig
β”‚   β”‚   └── ...\nβ”‚   β”œβ”€β”€ cufft/                 # 4 examples β€” 1D/2D/3D FFT
β”‚   β”‚   β”œβ”€β”€ README.md          # Category index with transform type table
β”‚   β”‚   β”œβ”€β”€ fft_1d_c2c.zig, fft_1d_r2c.zig, fft_2d.zig, fft_3d.zig
β”‚   β”‚   └── ...\nβ”‚   β”œβ”€β”€ curand/                # 3 examples β€” RNG, distributions, Monte Carlo
β”‚   β”‚   β”œβ”€β”€ README.md          # Category index with generator type table
β”‚   β”‚   β”œβ”€β”€ generators.zig, distributions.zig, monte_carlo_pi.zig
β”‚   β”‚   └── ...\nβ”‚   β”œβ”€β”€ cusolver/              # 5 examples β€” LU, SVD, Cholesky, QR, eigensolve
β”‚   β”‚   β”œβ”€β”€ README.md          # Category index with devInfo note
β”‚   β”‚   β”œβ”€β”€ getrf.zig, gesvd.zig, potrf.zig, geqrf.zig, syevd.zig
β”‚   β”‚   └── ...\nβ”‚   β”œβ”€β”€ cusparse/              # 4 examples β€” CSR/COO SpMV, SpMM, SpGEMM
β”‚   β”‚   β”œβ”€β”€ README.md          # Category index with sparse format table
β”‚   β”‚   β”œβ”€β”€ spmv_csr.zig, spmv_coo.zig, spmm_csr.zig, spgemm.zig
β”‚   β”‚   └── ...\nβ”‚   β”œβ”€β”€ nvrtc/                 # 2 examples β€” JIT compilation
β”‚   β”‚   β”œβ”€β”€ README.md          # Category index with CompileOptions table
β”‚   β”‚   β”œβ”€β”€ jit_compile.zig, template_kernel.zig
β”‚   β”‚   └── ...\nβ”‚   └── nvtx/                  # 1 example β€” Nsight profiling
β”‚       β”œβ”€β”€ README.md          # Category index with Nsight usage
β”‚       └── profiling.zig
β”œβ”€β”€ docs/                      # Documentation
β”‚   β”œβ”€β”€ README.md              # Documentation index
β”‚   β”œβ”€β”€ API.md                 # Complete API reference (binding layer + Kernel DSL overview)
β”‚   β”œβ”€β”€ kernel/
β”‚   β”‚   β”œβ”€β”€ API.md             # Kernel DSL full API reference (intrinsics, smem, tensor cores)
β”‚   β”‚   └── MIGRATION.md       # CUDA C++ β†’ Zig migration guide
β”‚   β”œβ”€β”€ driver/README.md       # Driver module docs
β”‚   β”œβ”€β”€ nvrtc/README.md        # NVRTC module docs
β”‚   β”œβ”€β”€ cublas/README.md       # cuBLAS module docs
β”‚   β”œβ”€β”€ cublaslt/README.md     # cuBLAS LT module docs
β”‚   β”œβ”€β”€ curand/README.md       # cuRAND module docs
β”‚   β”œβ”€β”€ cudnn/README.md        # cuDNN module docs
β”‚   β”œβ”€β”€ cusolver/README.md     # cuSOLVER module docs
β”‚   β”œβ”€β”€ cusparse/README.md     # cuSPARSE module docs
β”‚   β”œβ”€β”€ cufft/README.md        # cuFFT module docs
β”‚   └── nvtx/README.md         # NVTX module docs

Module Overview

Driver (src/driver/ β€” 4 files)

Core CUDA types: CudaContext, CudaStream, CudaSlice(T), CudaView(T), CudaViewMut(T), CudaModule, CudaFunction, CudaEvent, CudaGraph. Device management, memory allocation, host ↔ device transfers, kernel launch, stream synchronization, event timing, graph capture, and unified memory.

NVRTC (src/nvrtc/ β€” 4 files)

Runtime compilation: compilePtx, compileCubin, compilePtxWithOptions, compileCubinWithOptions. CompileOptions for target architecture, optimization, register limits, and arbitrary flags.

cuBLAS (src/cublas/ β€” 4 files)

CublasContext wrapping cuBLAS handle. Level 1 (AXPY, SCAL, DOT, NRM2, AMAX, AMIN, SWAP, COPY, ROT, ROTG), Level 2 (GEMV, SYMV, TRMV, TRSV, SYR), Level 3 (SGEMM, DGEMM, strided batched, pointer-array batched, GemmEx, SYMM, TRSM, TRMM, SYRK, GEAM, DGMM, grouped batched GEMM). Single and double precision throughout.

cuBLAS LT (src/cublaslt/ β€” 4 files)

CublasLtContext for lightweight GEMM with fine-grained algorithm selection via getHeuristics, layout descriptors, and matmul/matmulWithAlgo. Supports mixed-precision with f16/bf16/f32/f64 data types and TF32 compute.

cuRAND (src/curand/ β€” 4 files)

CurandContext with 8 generator types (XORWOW, MRG32k3a, MTGP32, MT19937, Philox, Sobol, etc.). Distributions: uniform, normal, log-normal, Poisson. Single and double precision.

cuDNN (src/cudnn/ β€” 4 files)

CudnnContext for deep learning primitives. 2D and N-dimensional convolution (forward, backward data, backward filter), activation, pooling, softmax (with backward), batch normalization, dropout, element-wise tensor operations (opTensor, addTensor, scaleTensor, reduceTensor). Multiple algorithms (implicit GEMM, Winograd, FFT, etc.).

cuSOLVER (src/cusolver/ β€” 4 files)

CusolverDnContext for LU factorization and SVD. CusolverDnExt extends with Cholesky (potrf/potrs), QR (geqrf/orgqr), eigenvalue decomposition (syevd), and Jacobi SVD (gesvdj) with configurable tolerance and max sweeps. Single and double precision.

cuSPARSE (src/cusparse/ β€” 4 files)

CusparseContext for CSR and COO sparse matrix creation. SpMV (sparse Γ— dense vector), SpMM (sparse Γ— dense matrix), SpGEMM (sparse Γ— sparse) with work estimation / compute / copy phases. Algorithm selection for deterministic vs non-deterministic compute.

cuFFT (src/cufft/ β€” 4 files)

CufftPlan for 1D/2D/3D and batched FFT plans. Six execution modes: C2C, R2C, C2R for float and double (execC2C, execZ2Z, execR2C, execC2R, execD2Z, execZ2D).

NVTX (src/nvtx/ β€” 3 files)

rangePush/rangePop for named range markers, mark for point markers, ScopedRange for RAII-style ranges, Domain for per-module profiling isolation.

Shared Types (src/types.zig)

Dim3, LaunchConfig (with forNumElems auto-configuration), DevicePtr(T), and cuBLAS types (Operation, FillMode, DiagType, SideMode).

Device / Kernel DSL (src/kernel/ β€” 9 files)

Device-side module for writing GPU kernels in pure Zig, compiled to PTX via the NVPTX backend. Contains 175 inline functions across:

  • intrinsics.zig (98 fns): threadIdx, blockIdx, __syncthreads, atomics (atomicAdd–atomicDec), warp shuffle/vote/match/reduce, fast math, bit ops, cache hints, type conversions, __nanosleep, __byte_perm
  • tensor_core.zig (56 fns): WMMA (sm_70+), MMA PTX (sm_80+), FP8 MMA (sm_89+), wgmma/TMA/cluster (sm_90+), tcgen05 (sm_100+)
  • shared_mem.zig: SharedArray(T, N) via addrspace(.shared), dynamicShared(T), clearShared, loadToShared, storeFromShared, reduceSum
  • arch.zig: SmVersion enum (sm_52–sm_100+), requireSM comptime guard, atLeast, codename
  • types.zig: DeviceSlice(T) (get/set/len), DevicePtr(T) (load/store/atomicAdd), GridStrideIterator, globalThreadIdx, gridStride
  • shared_types.zig: Vec2/3/4, Int2/3, Matrix3x3/4x4, LaunchConfig (init1D/2D, forElementCount)
  • debug.zig: assertf, assertInBounds, safeGet, ErrorFlag (5 error codes + setError/checkNaN), printf, CycleTimer, __trap, __brkpt
  • bridge_gen.zig: init(Config) β€” comptime Fn enum, load, loadFromPtx, getFunction, getFunctionByName

β†’ Full API reference: docs/kernel/API.md

Build Targets

zig build                  # Build library (driver + nvrtc + cublas + curand)
zig build test             # All tests (unit + integration, 235 total)
zig build test-unit        # Unit tests only
zig build test-integration # Integration tests only
zig build run-<cat>-<name> # Run a host example (e.g. run-basics-vector_add)
zig build example-integration -Dgpu-arch=sm_86 -Dcublas=true -Dcufft=true  # Build all integration examples
zig build compile-kernels  # Compile all GPU kernels to PTX (default sm_80)
zig build compile-kernels -Dgpu-arch=sm_80  # Target Ampere
zig build compile-kernels -Dgpu-arch=sm_90  # Target Hopper
zig build example-kernel-<cat>-<name> -Dgpu-arch=sm_86  # Build one kernel example