Skip to content

Latest commit

 

History

History
95 lines (79 loc) · 5.16 KB

File metadata and controls

95 lines (79 loc) · 5.16 KB

⚠️ Historischer Auditbericht – Befunde ohne aktuellen Codebeleg mit <!-- TODO: add source file evidence --> markieren. Veraltete Befunde entfernen.

Audit Report — GPU Module

Last Audit: 2026-04-19 Auditor: Copilot Status: ✅ Pass (Beta)

Summary

Metric Result
Build System Registration ✅ Verified
Source Files 30 (.cpp in src/gpu/)
Test Coverage ✅ All Phase 1–4 items complete; GPU paths tested with hardware skip
Open TODOs 30 files contain TODOs (WASM runtime injection, cuVS production wiring)
Open Stubs 2 (cuVS/RAFT production wiring pending; WASM runtime injection pending)
Security Issues None

Build System

  • All GPU source files registered in cmake/CMakeLists.txt and cmake/ModularBuild.cmake.
  • CUDA paths guarded by THEMIS_ENABLE_CUDA.
  • ROCm/HIP paths guarded by THEMIS_ENABLE_HIP.
  • cuVS/RAFT integration guarded by THEMIS_ENABLE_CUVS.
  • NCCL/RCCL guarded by THEMIS_ENABLE_NCCL.
  • WASM kernel sandbox guarded by THEMIS_ENABLE_WASM.
  • Profiling integration guarded by THEMIS_ENABLE_NVTX and THEMIS_ENABLE_ROCTX.

Source Files Audited

File Purpose
admin_api.cpp JSON admin API with tenant breakdown and dry-run
alerts.cpp Threshold-based alert manager with callbacks
audit_log.cpp Ring-buffer structured audit event log
cluster_coordinator.cpp Multi-node GPU cluster coordination
cluster_topology.cpp NVLink/InfiniBand topology awareness
config.cpp GPU config validation with dry-run simulation
device_discovery.cpp CUDA/ROCm device enumeration with CPU sentinel
feature_flags.cpp Runtime GPU feature enable/disable
gpu_memory_manager_edition.cpp Edition-aware VRAM with tenant quotas
gpu_module.cpp Integration facade: policy → CB → alloc → launch
graph_cache.cpp CUDA graph capture (LRU 32-entry, QueryShape key)
kernel_validator.cpp FNV-1a checksum kernel whitelist
launcher.cpp Named async GPU streams with CPU fallback budget
load_balancer.cpp Multi-GPU: ROUND_ROBIN, LEAST_LOADED, FIRST_HEALTHY
memory_pool.cpp Slab allocator with defragmentation
metrics.cpp Prometheus-compatible counter/gauge metrics
mig_manager.cpp NVIDIA MIG (Multi-Instance GPU) partition management
p2p_transfer.cpp Peer-to-peer GPU memory transfer management
policy.cpp GPU resource policy enforcement
profiler.cpp NVTX/rocTX profiler marker integration
query_accelerator.cpp GPU-accelerated scan/sort/aggregate/join/ANN
rocm_backend.cpp AMD ROCm/HIP backend implementation
safe_fail.cpp Safe-fail handler for GPU errors with CPU fallback
stream_manager.cpp CUDA/HIP stream lifecycle management
tensor_buffer.cpp GPU tensor buffer with pinned memory support
time_slice_scheduler.cpp Time-slice scheduling for multi-tenant GPU isolation
training_loop.cpp GPU training loop for on-device model fine-tuning
unified_memory.cpp CUDA unified memory management for CPU-GPU transfers
vulkan_backend.cpp Vulkan compute backend for cross-platform GPU support
wasm_kernel_sandbox.cpp WASM sandbox for untrusted GPU kernel isolation

Test Coverage

  • tests/test_gpu_graph_cache.cpp — CUDA graph capture, LRU eviction, QueryShape keying
  • tests/test_gpu_query_accelerator.cpp — scan, sort, aggregate, hashJoin, annSearch
  • tests/test_gpu_stream_manager.cpp — stream creation, CPU fallback
  • tests/test_gpu_memory_manager.cpp — tenant quota enforcement, fragmentation tracking
  • tests/test_gpu_kernel_validator.cpp — FNV-1a whitelist, unknown kernel rejection
  • GPU hardware tests: skipped gracefully when no GPU is present in CI

Findings

Resolved

  • Kernel whitelist enforcementKernelValidator::validate() called on every kernel launch path; unknown kernels rejected before memory allocation.
  • Cross-tenant VRAM isolation — per-tenant quota enforcement in GPUMemoryManagerEdition.
  • Circuit-breaker safe-fail — GPU errors trigger automatic CPU fallback; circuit breaker state exposed in admin API.
  • Audit log for GPU operations — ring-buffer captures all allocation, deallocation, fallback, and admin events.

Open

  • cuVS/RAFT production wiring — ANN search has CPU brute-force fallback; cuVS/RAFT production integration stub (Issue #2381).
  • WASM kernel sandbox runtime — WASM isolation infrastructure complete but requires concrete WasmRuntime injection (Issue #1572, same as base module).
  • GPU memory zeroing overhead — zeroing on deallocation is enabled for security; can be disabled in performance mode (operator tradeoff).

Compliance

  • Per-tenant VRAM quotas support multi-tenant SaaS isolation requirements.
  • Audit log for GPU operations supports SOC 2 resource access tracking.
  • Dynamic time-slicing for multi-tenant isolation (Issue #1795) prevents noisy-neighbor DoS.
  • WASM sandbox (when fully wired) will meet third-party code execution security requirements for marketplace plugins.