Skip to content

varlendataset for thd e2e and benchmark#4832

Open
xiaoyao0115 wants to merge 1 commit into
NVIDIA:devfrom
xiaoyao0115:varlen-dataset
Open

varlendataset for thd e2e and benchmark#4832
xiaoyao0115 wants to merge 1 commit into
NVIDIA:devfrom
xiaoyao0115:varlen-dataset

Conversation

@xiaoyao0115
Copy link
Copy Markdown
Contributor

@xiaoyao0115 xiaoyao0115 commented May 16, 2026

What does this PR do ?

Add VarlenDataset for variable-length training over HF / local jsonl / parquet data

1. What this PR does

Adds a new dataset class VarlenDataset (and its MockVarlenDataset sibling) for variable-length training, gated by a new top-level flag --use-varlen-dataset. Designed for the packed-sequence (THD) path, both static (--sequence-packing-scheduler dp_balanced) and dynamic (--dynamic-context-parallel) variants.

Why a separate dataset class instead of extending --sft?

  • Supporting using hugging face dataset to run thd e2e.
  • Each __getitem__ returns one tokenized sample in unpacked form (tokens, labels, loss_mask, position_ids, original_seq_len, padded_seq_len).
  • The upstream packing scheduler (dp_balanced or default_dynamic_cp) sees variable-length samples and packs them across the DP×CP grid up to --max-seqlen-per-dp-cp-rank.

This is what BasePackingScheduler.get_required_sample_keys() already expects, and what the existing comment in data_schedule_utils.py flags as the "ideal" dataset shape. SFTDataset triggers a (wasteful) unpack → repack round-trip via _unpack_batch; VarlenDataset skips it.

Three additional framework-level fixes rolled in to make the new path work cleanly without breaking --sft:

  1. _unpack_batch short-circuits when the sample already has padded_seq_len (no cu_seqlens-based slicing needed). --sft path unchanged.
  2. data_samplers.py uses identity collate_fn for all packing schedulers, not just --dynamic-context-parallel. The previous gate excluded dp_balanced users.
  3. pretrain_gpt.py:get_batch widens the is_packed_sequence check from args.sft to args.sft or args.use_varlen_dataset.

Three validate-args asserts guard the new flag:

  • --use-varlen-dataset--sft (both select the packed-sequence dataset family).
  • --use-varlen-dataset--mock-data is allowed (routes to MockVarlenDataset, configured via --varlen-mock-dataset-config-json).
  • --use-varlen-dataset auto-picks a packing scheduler when none is given: dp_balanced by default, or default_dynamic_cp when --dynamic-context-parallel is set. --varlen-bshd-validation opts out of the packing path entirely.

Files touched

megatron/training/datasets/varlen_dataset.py   (new, ~340 lines)
megatron/training/arguments.py                  +51   new args group + validate asserts
megatron/training/datasets/data_samplers.py     +6    identity collate for all scheduler paths
megatron/core/datasets/data_schedule_utils.py   +23   _unpack_batch short-circuit
megatron/core/datasets/gpt_dataset.py           +5    varlen_mock_dataset_config_json field
pretrain_gpt.py                                 +13   dataset_type dispatch + is_packed_sequence

Total: 5 modified + 1 new, ~96 line diff plus the new file.

2. How to use it

--use-varlen-dataset reuses the existing --data-path argument. Three input sources, all auto-detected:

# HuggingFace Hub repo id (auto-downloaded by `datasets.load_dataset`)
--use-varlen-dataset --data-path Yukang/LongAlpaca-12k
--use-varlen-dataset --data-path HuggingFaceH4/no_robots
--use-varlen-dataset --data-path databricks/databricks-dolly-15k

# Local parquet
--use-varlen-dataset --data-path /path/to/dataset.parquet

# Local jsonl
--use-varlen-dataset --data-path /path/to/dataset.jsonl

A sequence packing scheduler is auto-selected: dp_balanced (static) by default, or default_dynamic_cp when --dynamic-context-parallel is passed. To override either default, pass --sequence-packing-scheduler explicitly.

Supported dataset schemas (auto-detected from column names)

Each jsonl line / parquet row / HF Hub row must match one of:

A. Alpaca / Dolly style — at least one of instruction | prompt | query | question, plus one of output | response | completion | answer. Optional supplementary context: input | context.

{"instruction": "Summarize this paper.", "input": "Paper text...", "output": "..."}
{"instruction": "Who wrote 1984?", "context": "1984 was written...", "response": "Orwell"}
{"prompt": "Q?", "response": "A."}

B. ShareGPT styleconversations column with {"from": ..., "value": ...} entries.

{"conversations": [
    {"from": "human", "value": "Hi"},
    {"from": "gpt",   "value": "Hello"}
]}

from is mapped to chat-template roles via a small dict (human/useruser, gpt/assistant/model/chatgpt/bing/bardassistant, tool/function/observationtool); unknown speakers fall back to user.

C. OpenAI messages stylemessages column with {"role": ..., "content": ...} entries.

{"messages": [
    {"role": "system",    "content": "Be terse."},
    {"role": "user",      "content": "Hi"},
    {"role": "assistant", "content": "Hello"}
]}

Detection priority: messages > conversations > alpaca-synonyms. Unrecognized columns raise a clear ValueError.

Known compatible HuggingFace datasets

The schemas above cover most public SFT corpora. Examples that work out of the box (no preprocessing, just --data-path owner/repo):

HF repo id Schema Approx size Notes
Yukang/LongAlpaca-12k alpaca 12 k rows / 500 MB Long-context SFT, many samples > 16k tokens
tatsu-lab/alpaca alpaca 52 k rows The canonical Stanford Alpaca dataset
vicgalle/alpaca-gpt4 alpaca 52 k rows GPT-4 regenerated Alpaca
databricks/databricks-dolly-15k alpaca (instruction + context + response) 15 k rows Dolly fields auto-handled via synonyms
HuggingFaceH4/no_robots openai-messages 10 k rows / 22 MB parquet Multi-turn chat
Open-Orca/OpenOrca sharegpt-style (column conversations) ~3 M rows Large; expect long load
Open-Orca/SlimOrca sharegpt ~500 k rows Filtered subset of OpenOrca
lmsys/lmsys-chat-1m openai-messages 1 M rows Multi-turn user/assistant
cognitivecomputations/SystemChat-2.0 sharegpt ~7 k rows System-prompt-led conversations
nvidia/HelpSteer2 alpaca-like (prompt + response) ~10 k rows Picked up via the prompt/response synonyms

Datasets explicitly not supported (would raise on schema detect):

  • OpenAssistant/oasst1 — tree-structured conversation graph
  • Anthropic/hh-rlhf — preference pairs (chosen / rejected), not a single conversation per row
  • Multi-modal SFT corpora (content stored as a list of image / text parts)

Mock mode (for benchmarking)

--use-varlen-dataset --mock-data
--use-varlen-dataset --mock-data --varlen-mock-dataset-config-json \
  '{"mode":"distribution","type":"lognormal","min_seq_len":1024,"max_seq_len":8192,"mean_seq_len":4096,"lognormal_sigma":1.2}'

Three mock modes (mirroring --sft-mock-dataset-config-json):

  • distribution (lognormal seq-length sampling)
  • file (per-line lengths from a CSV)
  • verification (real tokens from an IndexedDataset, with lognormal sampled lengths)

BSHD reference mode (for THD numerical verification)

--varlen-bshd-validation bypasses the packed-sequence path entirely: each sample is right-padded to --seq-length, no cu_seqlens, no packing scheduler. Used to obtain a BSHD reference run from the same data and same tokenization that the THD path consumes, so the two can be compared for correctness. Incompatible with --dynamic-context-parallel and --sequence-packing-scheduler.

# Side-by-side run for THD correctness validation:
--use-varlen-dataset --data-path my_data.jsonl                                              # THD (with scheduler)
--use-varlen-dataset --data-path my_data.jsonl --varlen-bshd-validation                     # BSHD reference

Tokenizer requirement

Same as --sft: needs a tokenizer with tokenize_conversation support. Pass --tokenizer-type SFTTokenizer --sft-tokenizer-prompt-format {default | nemotron-h-aligned | nemotron-nano-v2 | identity} along with --tokenizer-model <hf-tokenizer-dir>.

Limitations (raise rather than silent-mishandle)

  • Tree-structured (e.g. OpenAssistant oasst1) or chosen/rejected preference datasets are not supported.
  • Multi-modal samples (content as a list of image/text parts) are not supported.
  • HF Hub repos: only split="train" is loaded. Export to a local jsonl/parquet first if your dataset's primary split is named differently.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either [email protected] or [email protected].

@xiaoyao0115 xiaoyao0115 self-assigned this May 16, 2026
@xiaoyao0115 xiaoyao0115 requested review from a team as code owners May 16, 2026 21:59
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant