varlendataset for thd e2e and benchmark#4832
Open
xiaoyao0115 wants to merge 1 commit into
Open
Conversation
Signed-off-by: tailaim <[email protected]>
df8fad0 to
93d806a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add
VarlenDatasetfor variable-length training over HF / local jsonl / parquet data1. What this PR does
Adds a new dataset class
VarlenDataset(and itsMockVarlenDatasetsibling) for variable-length training, gated by a new top-level flag--use-varlen-dataset. Designed for the packed-sequence (THD) path, both static (--sequence-packing-scheduler dp_balanced) and dynamic (--dynamic-context-parallel) variants.Why a separate dataset class instead of extending
--sft?__getitem__returns one tokenized sample in unpacked form (tokens,labels,loss_mask,position_ids,original_seq_len,padded_seq_len).dp_balancedordefault_dynamic_cp) sees variable-length samples and packs them across the DP×CP grid up to--max-seqlen-per-dp-cp-rank.This is what
BasePackingScheduler.get_required_sample_keys()already expects, and what the existing comment indata_schedule_utils.pyflags as the "ideal" dataset shape.SFTDatasettriggers a (wasteful) unpack → repack round-trip via_unpack_batch;VarlenDatasetskips it.Three additional framework-level fixes rolled in to make the new path work cleanly without breaking
--sft:_unpack_batchshort-circuits when the sample already haspadded_seq_len(no cu_seqlens-based slicing needed).--sftpath unchanged.data_samplers.pyuses identitycollate_fnfor all packing schedulers, not just--dynamic-context-parallel. The previous gate excludeddp_balancedusers.pretrain_gpt.py:get_batchwidens theis_packed_sequencecheck fromargs.sfttoargs.sft or args.use_varlen_dataset.Three validate-args asserts guard the new flag:
--use-varlen-dataset⊥--sft(both select the packed-sequence dataset family).--use-varlen-dataset⊥--mock-datais allowed (routes toMockVarlenDataset, configured via--varlen-mock-dataset-config-json).--use-varlen-datasetauto-picks a packing scheduler when none is given:dp_balancedby default, ordefault_dynamic_cpwhen--dynamic-context-parallelis set.--varlen-bshd-validationopts out of the packing path entirely.Files touched
Total: 5 modified + 1 new, ~96 line diff plus the new file.
2. How to use it
--use-varlen-datasetreuses the existing--data-pathargument. Three input sources, all auto-detected:A sequence packing scheduler is auto-selected:
dp_balanced(static) by default, ordefault_dynamic_cpwhen--dynamic-context-parallelis passed. To override either default, pass--sequence-packing-schedulerexplicitly.Supported dataset schemas (auto-detected from column names)
Each jsonl line / parquet row / HF Hub row must match one of:
A. Alpaca / Dolly style — at least one of
instruction | prompt | query | question, plus one ofoutput | response | completion | answer. Optional supplementary context:input | context.{"instruction": "Summarize this paper.", "input": "Paper text...", "output": "..."} {"instruction": "Who wrote 1984?", "context": "1984 was written...", "response": "Orwell"} {"prompt": "Q?", "response": "A."}B. ShareGPT style —
conversationscolumn with{"from": ..., "value": ...}entries.{"conversations": [ {"from": "human", "value": "Hi"}, {"from": "gpt", "value": "Hello"} ]}fromis mapped to chat-template roles via a small dict (human/user→user,gpt/assistant/model/chatgpt/bing/bard→assistant,tool/function/observation→tool); unknown speakers fall back touser.C. OpenAI messages style —
messagescolumn with{"role": ..., "content": ...}entries.{"messages": [ {"role": "system", "content": "Be terse."}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello"} ]}Detection priority:
messages>conversations> alpaca-synonyms. Unrecognized columns raise a clearValueError.Known compatible HuggingFace datasets
The schemas above cover most public SFT corpora. Examples that work out of the box (no preprocessing, just
--data-path owner/repo):Yukang/LongAlpaca-12ktatsu-lab/alpacavicgalle/alpaca-gpt4databricks/databricks-dolly-15kHuggingFaceH4/no_robotsOpen-Orca/OpenOrcaconversations)Open-Orca/SlimOrcalmsys/lmsys-chat-1mcognitivecomputations/SystemChat-2.0nvidia/HelpSteer2prompt+response)prompt/responsesynonymsDatasets explicitly not supported (would raise on schema detect):
OpenAssistant/oasst1— tree-structured conversation graphAnthropic/hh-rlhf— preference pairs (chosen / rejected), not a single conversation per rowMock mode (for benchmarking)
--use-varlen-dataset --mock-data --use-varlen-dataset --mock-data --varlen-mock-dataset-config-json \ '{"mode":"distribution","type":"lognormal","min_seq_len":1024,"max_seq_len":8192,"mean_seq_len":4096,"lognormal_sigma":1.2}'Three mock modes (mirroring
--sft-mock-dataset-config-json):distribution(lognormal seq-length sampling)file(per-line lengths from a CSV)verification(real tokens from an IndexedDataset, with lognormal sampled lengths)BSHD reference mode (for THD numerical verification)
--varlen-bshd-validationbypasses the packed-sequence path entirely: each sample is right-padded to--seq-length, nocu_seqlens, no packing scheduler. Used to obtain a BSHD reference run from the same data and same tokenization that the THD path consumes, so the two can be compared for correctness. Incompatible with--dynamic-context-paralleland--sequence-packing-scheduler.Tokenizer requirement
Same as
--sft: needs a tokenizer withtokenize_conversationsupport. Pass--tokenizer-type SFTTokenizer --sft-tokenizer-prompt-format {default | nemotron-h-aligned | nemotron-nano-v2 | identity}along with--tokenizer-model <hf-tokenizer-dir>.Limitations (raise rather than silent-mishandle)
split="train"is loaded. Export to a local jsonl/parquet first if your dataset's primary split is named differently.Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
[email protected]or[email protected].