Skip to content

Add NVIDIA pytest validation lane#4359

Open
dfredriksenTT wants to merge 4 commits intoaknezevic/nsmith/hf-bringup2from
dfredriksen/nvidia-pytest-validation
Open

Add NVIDIA pytest validation lane#4359
dfredriksenTT wants to merge 4 commits intoaknezevic/nsmith/hf-bringup2from
dfredriksen/nvidia-pytest-validation

Conversation

@dfredriksenTT
Copy link
Copy Markdown

@dfredriksenTT dfredriksenTT commented Apr 22, 2026

Summary

This PR adds a manifest-driven NVIDIA validation lane to tt-xla so a selected set of PyTorch model tests can be run as CPU-vs-CUDA comparisons using the existing evaluator stack. The new path stays inside the current tests/runner and tests/infra structure rather than introducing a separate harness.

At a high level, the change does three things. It adds a new pytest entrypoint for NVIDIA validation, adds a CUDA tester that compares CPU golden outputs to CUDA outputs, and extends the device connector and runner layers so the existing workload abstractions can execute on CUDA without forcing full TT/XLA initialization during collection.

flowchart LR
    A[Manifest row\ntest_case_id] --> B[test_models_nvidia.py]
    B --> C[Loader discovery\nexisting tt-xla model loaders]
    C --> D[DynamicTorchCudaModelTester]
    D --> E[CPU golden run]
    D --> F[CUDA run]
    E --> G[Existing comparison evaluators]
    F --> G
    G --> H[validated pass / fail]
Loading

What Changed

The new entrypoint is tests/runner/test_models_nvidia.py. It accepts --nvidia-cohort-json, resolves each test_case_id against the branch-local loader registry, and records results through the same report-property path the rest of the repository already uses.

The CUDA execution path is implemented in tests/runner/testers/torch/dynamic_torch_cuda_model_tester.py. This is intentionally small: it reuses the existing Torch tester and comparison machinery, but swaps the TT target path for a CUDA target path.

Supporting changes in tests/infra make CUDA execution fit the existing abstractions. The connector layer now knows about DeviceType.CUDA, the runner layer can execute workloads on CUDA, and several imports were made lazy so collection for the NVIDIA lane does not immediately bootstrap TT/XLA-only dependencies.

The manifest contract in test_models_nvidia.py was also tightened. Execution is keyed by test_case_id; display metadata is optional. That matches how the lane really works and avoids treating model_id as a required execution contract when it is only descriptive metadata.

Validation

Local validation:

  • python3 -m py_compile passed on the new and changed NVIDIA lane files
  • pre-commit run --all-files passed on the branch after formatting and import cleanup

Host-backed validation on the AWS A10G machine:

  • pytest --collect-only -q tests/runner/test_models_nvidia.py --nvidia-cohort-json /tmp/results-main-nvidia-cohort.json
    • result: 72 tests collected in 9.16s
  • pytest -q "tests/runner/test_models_nvidia.py::test_models_torch_nvidia[bart/question_answering/pytorch-bart-large-finetuned-squadv1]" --nvidia-cohort-json /tmp/results-main-nvidia-cohort.json
    • result: 1 passed

Earlier bounded proof on the same host also passed for:

  • squeezebert/pytorch-Mnli
  • bert_tiny_finetuned_mnli/sequence_classification/pytorch-bert-tiny-finetuned-mnli

Current Limits

This lane is real and usable, but it is not the end of the NVIDIA bringup work.

Collection still surfaces loader-discovery warnings for optional dependencies that are not installed on the validation host. That noise is real, but it does not invalidate the proof cases above. The results_main.yaml TT-pass source cohort is also only partially runnable in the current host environment. The current truthful split is 72 runnable / 28 blocked, so the blocked portion should be treated as follow-on loader or dependency adaptation work rather than silently assumed to be covered by this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant