varlendataset for thd e2e and benchmark by xiaoyao0115 · Pull Request #4832 · NVIDIA/Megatron-LM

xiaoyao0115 · 2026-05-16T21:59:30Z

What does this PR do ?

Add VarlenDataset for variable-length training over HF / local jsonl / parquet data

1. What this PR does

Adds a new dataset class VarlenDataset (and its MockVarlenDataset sibling) for variable-length training, gated by a new top-level flag --use-varlen-dataset. Designed for the packed-sequence (THD) path, both static (--sequence-packing-scheduler dp_balanced) and dynamic (--dynamic-context-parallel) variants.

Why a separate dataset class instead of extending --sft?

Supporting using hugging face dataset to run thd e2e.
Each __getitem__ returns one tokenized sample in unpacked form (tokens, labels, loss_mask, position_ids, original_seq_len, padded_seq_len).
The upstream packing scheduler (dp_balanced or default_dynamic_cp) sees variable-length samples and packs them across the DP×CP grid up to --max-seqlen-per-dp-cp-rank.

This is what BasePackingScheduler.get_required_sample_keys() already expects, and what the existing comment in data_schedule_utils.py flags as the "ideal" dataset shape. SFTDataset triggers a (wasteful) unpack → repack round-trip via _unpack_batch; VarlenDataset skips it.

Three additional framework-level fixes rolled in to make the new path work cleanly without breaking --sft:

_unpack_batch short-circuits when the sample already has padded_seq_len (no cu_seqlens-based slicing needed). --sft path unchanged.
data_samplers.py uses identity collate_fn for all packing schedulers, not just --dynamic-context-parallel. The previous gate excluded dp_balanced users.
pretrain_gpt.py:get_batch widens the is_packed_sequence check from args.sft to args.sft or args.use_varlen_dataset.

Three validate-args asserts guard the new flag:

--use-varlen-dataset ⊥ --sft (both select the packed-sequence dataset family).
--use-varlen-dataset ⊥ --mock-data is allowed (routes to MockVarlenDataset, configured via --varlen-mock-dataset-config-json).
--use-varlen-dataset auto-picks a packing scheduler when none is given: dp_balanced by default, or default_dynamic_cp when --dynamic-context-parallel is set. --varlen-bshd-validation opts out of the packing path entirely.

Files touched

megatron/training/datasets/varlen_dataset.py   (new, ~340 lines)
megatron/training/arguments.py                  +51   new args group + validate asserts
megatron/training/datasets/data_samplers.py     +6    identity collate for all scheduler paths
megatron/core/datasets/data_schedule_utils.py   +23   _unpack_batch short-circuit
megatron/core/datasets/gpt_dataset.py           +5    varlen_mock_dataset_config_json field
pretrain_gpt.py                                 +13   dataset_type dispatch + is_packed_sequence

Total: 5 modified + 1 new, ~96 line diff plus the new file.

2. How to use it

--use-varlen-dataset reuses the existing --data-path argument. Three input sources, all auto-detected:

# HuggingFace Hub repo id (auto-downloaded by `datasets.load_dataset`)
--use-varlen-dataset --data-path Yukang/LongAlpaca-12k
--use-varlen-dataset --data-path HuggingFaceH4/no_robots
--use-varlen-dataset --data-path databricks/databricks-dolly-15k

# Local parquet
--use-varlen-dataset --data-path /path/to/dataset.parquet

# Local jsonl
--use-varlen-dataset --data-path /path/to/dataset.jsonl

A sequence packing scheduler is auto-selected: dp_balanced (static) by default, or default_dynamic_cp when --dynamic-context-parallel is passed. To override either default, pass --sequence-packing-scheduler explicitly.

Supported dataset schemas (auto-detected from column names)

Each jsonl line / parquet row / HF Hub row must match one of:

{"instruction": "Summarize this paper.", "input": "Paper text...", "output": "..."}
{"instruction": "Who wrote 1984?", "context": "1984 was written...", "response": "Orwell"}
{"prompt": "Q?", "response": "A."}

B. ShareGPT style — conversations column with {"from": ..., "value": ...} entries.

{"conversations": [
    {"from": "human", "value": "Hi"},
    {"from": "gpt",   "value": "Hello"}
]}

from is mapped to chat-template roles via a small dict (human/user → user, gpt/assistant/model/chatgpt/bing/bard → assistant, tool/function/observation → tool); unknown speakers fall back to user.

C. OpenAI messages style — messages column with {"role": ..., "content": ...} entries.

{"messages": [
    {"role": "system",    "content": "Be terse."},
    {"role": "user",      "content": "Hi"},
    {"role": "assistant", "content": "Hello"}
]}

Detection priority: messages > conversations > alpaca-synonyms. Unrecognized columns raise a clear ValueError.

Known compatible HuggingFace datasets

The schemas above cover most public SFT corpora. Examples that work out of the box (no preprocessing, just --data-path owner/repo):

HF repo id	Schema	Approx size	Notes
`Yukang/LongAlpaca-12k`	alpaca	12 k rows / 500 MB	Long-context SFT, many samples > 16k tokens
`tatsu-lab/alpaca`	alpaca	52 k rows	The canonical Stanford Alpaca dataset
`vicgalle/alpaca-gpt4`	alpaca	52 k rows	GPT-4 regenerated Alpaca
`databricks/databricks-dolly-15k`	alpaca (instruction + context + response)	15 k rows	Dolly fields auto-handled via synonyms
`HuggingFaceH4/no_robots`	openai-messages	10 k rows / 22 MB parquet	Multi-turn chat
`Open-Orca/OpenOrca`	sharegpt-style (column `conversations`)	~3 M rows	Large; expect long load
`Open-Orca/SlimOrca`	sharegpt	~500 k rows	Filtered subset of OpenOrca
`lmsys/lmsys-chat-1m`	openai-messages	1 M rows	Multi-turn user/assistant
`cognitivecomputations/SystemChat-2.0`	sharegpt	~7 k rows	System-prompt-led conversations
`nvidia/HelpSteer2`	alpaca-like (`prompt` + `response`)	~10 k rows	Picked up via the `prompt`/`response` synonyms

Datasets explicitly not supported (would raise on schema detect):

OpenAssistant/oasst1 — tree-structured conversation graph
Anthropic/hh-rlhf — preference pairs (chosen / rejected), not a single conversation per row
Multi-modal SFT corpora (content stored as a list of image / text parts)

Mock mode (for benchmarking)

--use-varlen-dataset --mock-data
--use-varlen-dataset --mock-data --varlen-mock-dataset-config-json \
  '{"mode":"distribution","type":"lognormal","min_seq_len":1024,"max_seq_len":8192,"mean_seq_len":4096,"lognormal_sigma":1.2}'

Three mock modes (mirroring --sft-mock-dataset-config-json):

distribution (lognormal seq-length sampling)
file (per-line lengths from a CSV)
verification (real tokens from an IndexedDataset, with lognormal sampled lengths)

BSHD reference mode (for THD numerical verification)

--varlen-bshd-validation bypasses the packed-sequence path entirely: each sample is right-padded to --seq-length, no cu_seqlens, no packing scheduler. Used to obtain a BSHD reference run from the same data and same tokenization that the THD path consumes, so the two can be compared for correctness. Incompatible with --dynamic-context-parallel and --sequence-packing-scheduler.

# Side-by-side run for THD correctness validation:
--use-varlen-dataset --data-path my_data.jsonl                                              # THD (with scheduler)
--use-varlen-dataset --data-path my_data.jsonl --varlen-bshd-validation                     # BSHD reference

Tokenizer requirement

Same as --sft: needs a tokenizer with tokenize_conversation support. Pass --tokenizer-type SFTTokenizer --sft-tokenizer-prompt-format {default | nemotron-h-aligned | nemotron-nano-v2 | identity} along with --tokenizer-model <hf-tokenizer-dir>.

Limitations (raise rather than silent-mishandle)

Tree-structured (e.g. OpenAssistant oasst1) or chosen/rejected preference datasets are not supported.
Multi-modal samples (content as a list of image/text parts) are not supported.
HF Hub repos: only split="train" is loaded. Export to a local jsonl/parquet first if your dataset's primary split is named differently.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either [email protected] or [email protected].

copy-pr-bot · 2026-05-16T21:59:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: tailaim <[email protected]>

xiaoyao0115 self-assigned this May 16, 2026

xiaoyao0115 requested review from a team as code owners May 16, 2026 21:59

add varlendataset for thd e2e and benchmark

93d806a

Signed-off-by: tailaim <[email protected]>

xiaoyao0115 force-pushed the varlen-dataset branch from df8fad0 to 93d806a Compare May 16, 2026 22:09

hxbai mentioned this pull request May 17, 2026

DeepSeek-V4 training support #4468

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

varlendataset for thd e2e and benchmark#4832

varlendataset for thd e2e and benchmark#4832
xiaoyao0115 wants to merge 1 commit into
NVIDIA:devfrom
xiaoyao0115:varlen-dataset

xiaoyao0115 commented May 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaoyao0115 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

1. What this PR does

Files touched

2. How to use it

Supported dataset schemas (auto-detected from column names)

Known compatible HuggingFace datasets

Mock mode (for benchmarking)

BSHD reference mode (for THD numerical verification)

Tokenizer requirement

Limitations (raise rather than silent-mishandle)

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiaoyao0115 commented May 16, 2026 •

edited

Loading