Add NVIDIA pytest validation lane by dfredriksenTT · Pull Request #4359 · tenstorrent/tt-xla

dfredriksenTT · 2026-04-22T07:35:41Z

Summary

This PR adds a manifest-driven NVIDIA validation lane to tt-xla so a selected set of PyTorch model tests can be run as CPU-vs-CUDA comparisons using the existing evaluator stack. The new path stays inside the current tests/runner and tests/infra structure rather than introducing a separate harness.

At a high level, the change does three things. It adds a new pytest entrypoint for NVIDIA validation, adds a CUDA tester that compares CPU golden outputs to CUDA outputs, and extends the device connector and runner layers so the existing workload abstractions can execute on CUDA without forcing full TT/XLA initialization during collection.

flowchart LR
    A[Manifest row\ntest_case_id] --> B[test_models_nvidia.py]
    B --> C[Loader discovery\nexisting tt-xla model loaders]
    C --> D[DynamicTorchCudaModelTester]
    D --> E[CPU golden run]
    D --> F[CUDA run]
    E --> G[Existing comparison evaluators]
    F --> G
    G --> H[validated pass / fail]

What Changed

The new entrypoint is tests/runner/test_models_nvidia.py. It accepts --nvidia-cohort-json, resolves each test_case_id against the branch-local loader registry, and records results through the same report-property path the rest of the repository already uses.

The CUDA execution path is implemented in tests/runner/testers/torch/dynamic_torch_cuda_model_tester.py. This is intentionally small: it reuses the existing Torch tester and comparison machinery, but swaps the TT target path for a CUDA target path.

Supporting changes in tests/infra make CUDA execution fit the existing abstractions. The connector layer now knows about DeviceType.CUDA, the runner layer can execute workloads on CUDA, and several imports were made lazy so collection for the NVIDIA lane does not immediately bootstrap TT/XLA-only dependencies.

The manifest contract in test_models_nvidia.py was also tightened. Execution is keyed by test_case_id; display metadata is optional. That matches how the lane really works and avoids treating model_id as a required execution contract when it is only descriptive metadata.

Validation

Local validation:

python3 -m py_compile passed on the new and changed NVIDIA lane files
pre-commit run --all-files passed on the branch after formatting and import cleanup

Host-backed validation on the AWS A10G machine:

pytest --collect-only -q tests/runner/test_models_nvidia.py --nvidia-cohort-json /tmp/results-main-nvidia-cohort.json
- result: 72 tests collected in 9.16s
pytest -q "tests/runner/test_models_nvidia.py::test_models_torch_nvidia[bart/question_answering/pytorch-bart-large-finetuned-squadv1]" --nvidia-cohort-json /tmp/results-main-nvidia-cohort.json
- result: 1 passed

Earlier bounded proof on the same host also passed for:

squeezebert/pytorch-Mnli
bert_tiny_finetuned_mnli/sequence_classification/pytorch-bert-tiny-finetuned-mnli

Current Limits

This lane is real and usable, but it is not the end of the NVIDIA bringup work.

Collection still surfaces loader-discovery warnings for optional dependencies that are not installed on the validation host. That noise is real, but it does not invalidate the proof cases above. The results_main.yaml TT-pass source cohort is also only partially runnable in the current host environment. The current truthful split is 72 runnable / 28 blocked, so the blocked portion should be treated as follow-on loader or dependency adaptation work rather than silently assumed to be covered by this PR.

dfredriksenTT added 4 commits April 22, 2026 00:34

Add NVIDIA pytest validation lane

ea7ba1b

Register n150 and p150 pytest marks

00570c3

Apply pre-commit formatting

e61da08

Clarify NVIDIA manifest row naming

094707a

dfredriksenTT marked this pull request as ready for review April 22, 2026 09:11

dfredriksenTT requested review from AleksKnezevic, acicovicTT, ajakovljevicTT, jameszianxuTT, kmabeeTT, mrakitaTT, ndrakulicTT, sdjukicTT, sgligorijevicTT and vzeljkovicTT as code owners April 22, 2026 09:11

dfredriksenTT requested review from mtopalovicTT and nsmithtt April 22, 2026 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVIDIA pytest validation lane#4359

Add NVIDIA pytest validation lane#4359
dfredriksenTT wants to merge 4 commits intoaknezevic/nsmith/hf-bringup2from
dfredriksen/nvidia-pytest-validation

dfredriksenTT commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dfredriksenTT commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Validation

Current Limits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dfredriksenTT commented Apr 22, 2026 •

edited

Loading