Skip to content

[TENT][Sunrise] Add sunrise_link transport, platform support, and UT …#1915

Open
HomeDish wants to merge 4 commits intokvcache-ai:mainfrom
HomeDish:tent_sunrise
Open

[TENT][Sunrise] Add sunrise_link transport, platform support, and UT …#1915
HomeDish wants to merge 4 commits intokvcache-ai:mainfrom
HomeDish:tent_sunrise

Conversation

@HomeDish
Copy link
Copy Markdown

Description

This PR integrates SunriseLink as a new transport backend in TENT Transfer Engine and wires Sunrise platform support end-to-end, including transport loading, platform probing/allocation, benchmark integration, and build system updates.

Key changes:

  • Add Sunrise transport implementation:
    • tent/src/transport/sunrise_link/*
    • tent/include/tent/transport/sunrise_link/*
  • Add Sunrise platform implementation:
    • tent/src/platform/sunrise/*
    • tent/include/tent/platform/sunrise.h
  • Wire Sunrise into runtime and transport/platform loader:
    • tent/src/runtime/transport_loader.cpp
    • tent/src/runtime/platform.cpp
    • tent/src/runtime/transfer_engine_impl.cpp
    • tent/include/tent/common/types.h
  • Update build integration (CMakeLists.txt) for Sunrise components.
  • Extend benchmark support for sunrise_link:
    • update transfer_engine_bench.cpp
    • add Sunrise benchmark report markdown.
  • Add/refresh Sunrise transport documentation:
    • docs/source/zh_archive/[sunrise_link_transport.md](http://sunrise_link_transport.md/)

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • New feature
  • Bug fix
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

  • Built transfer_engine_bench successfully with Sunrise enabled.
  • Functional and performance verification via transfer_engine_bench on sunrise_link protocol.
  • Full GPU matrix validation completed (read + write, all non-self pairs) in transfer_engine_bench.cpp
  • add UT in tent/tests/sunrise_link_transport_test.cpp and test success

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Sunrise Link Transport, a GPU backend for the Mooncake TENT framework leveraging the Tang Runtime. The changes encompass the transport and platform implementations, build system updates, documentation, and tests. Feedback identifies several issues, including data races in memory registration, device context risks with thread-local streams, and the need for safer memory management using std::vector instead of malloc. Suggestions also include optimizing remote segment invalidation and ensuring asynchronous transfer logic respects configuration settings.

Comment thread mooncake-transfer-engine/tent/src/platform/sunrise/sunrise_allocator.cpp Outdated
Comment thread mooncake-transfer-engine/tent/src/platform/sunrise/sunrise_probe.cpp Outdated
Comment thread mooncake-transfer-engine/tent/src/platform/sunrise/sunrise_probe.cpp Outdated
Comment thread mooncake-transfer-engine/tent/src/platform/sunrise/sunrise_probe.cpp Outdated
liujialai added 3 commits April 17, 2026 11:28
…coverage

Integrate Sunrise platform/transport wiring across TENT runtime and examples, add SunriseLink end-to-end unit tests, and fix RDMA error logging pointer formatting to avoid crash during registration failure paths.

Made-with: Cursor
Bump pre-commit hook revisions to current releases so local checks and CI use newer lint/format toolchains consistently.

Made-with: Cursor
Address review feedback in SunriseLink transport/platform paths (stream/device context, registration map synchronization, safer probe/allocator handling, and cache-refresh strategy), and remove the obsolete transfer_engine_sunrise_bench CMake target now that its source no longer exists.
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...-transfer-engine/example/transfer_engine_bench.cpp 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@HomeDish HomeDish marked this pull request as ready for review April 20, 2026 02:28
@HomeDish HomeDish marked this pull request as draft April 20, 2026 02:34
@HomeDish HomeDish marked this pull request as ready for review April 20, 2026 03:09
@stmatengss
Copy link
Copy Markdown
Collaborator

@HomeDish Could you give more informations about run_rise transport?

@HomeDish
Copy link
Copy Markdown
Author

@HomeDish Could you give more informations about run_rise transport?

@stmatengss Thanks for the question, here is a quick summary:
• sunrise_link is a new TENT transport backendfor Sunrise GPUs.
• Refered nvlink transport, it uses Tang runtime + PTML topology to perform GPU peer-copy / IPC-handle based data movement between segments.
• It is integrated into TENT transport loading/runtime, and can be selected in benchmark via: --backend=tent --protocol=sunrise_link
• For non-Sunrise environments, existing transports (RDMA/TCP/NVLink/etc.) are unchanged.
What was added in this PR:

  1. New transport implementation under tent/src/transport/sunrise_link (+ headers).
  2. Sunrise platform probing/allocation support under tent/src/platform/sunrise.
  3. Runtime wiring so TENT can load and resolve sunrise_link.
  4. Unit test in tent/tests/sunrise_link_transport_test.cpp.
  5. Bench integration and docs update.
    Validation status:
    • sunrise_link unit test runs successfully in our Sunrise environment (with expected skips on unsupported GPU pairs).
    • Functional bench path (transfer_engine_bench, backend=tent, protocol=sunrise_link) is validated.
    If helpful, I can also share:
    • expected runtime dependencies (Tang/PTML libs),
    • and a short architecture diagram of sunrise_link data flow.

Copy link
Copy Markdown
Collaborator

@alogfans alogfans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants