All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Added API reference and user guide for the
olmo_core.generatemodule and interactive chat interface. - Added support for in-loop perplexity evals with context parallelism (CP) and tensor parallelism (TP).
- Added documentation for verifying chat template settings before running evals after SFT.
- Added
olmo_core.data.composablemodule. - Added
PeriNormTransformerBlock. - Added exponential learning rate scheduler to
olmo_core.optim.scheduler. - Added internal Olmo3 32B midtraining and long context configs.
- MoE: Added
TrainModuleConfigABC - Added a
MetricSaverCallbackwhich just saves metrics at specific intervals to JSON files in thesave_folder. - Added
fixed_stepsoption toCheckpointerandEvaluatorcallbacks for configuring checkpoints/evals at specific step numbers. - Added support for supervised finetuning.
- Added support for gated attention.
- Added support for no-global-rope ("GNoPE").
- Added support for MXFP8 Linear layers via torchao.
- Added support for tracking total flops.
- Added support for Gemma 3 models.
- Added support for Qwen3 models.
- Added support for Muon and Dion optimizers.
- Added support for Ulysses-style context parallelism.
- Added Beaker URL to Wandb logging.
- Added 60M, 14M, and 1M model sizes.
SpeedMonitorCallbackwill log Chinchilla multiple number of tokens during training with aTransformerTrainModule.- Added support for flash-attn 4 (CUTE implementation).
- Added
Callback.pre_log_metrics()method. - Added
SequenceMixerbase class that both attention and recurrent layers inherit from. - Added
GatedDeltaNetlayer implementation. - Added
InitMethod.fan_infor per-layer fan-in initialization where each weight matrix usesstd = 1/√d_in. - Added
StabilityMonitorCallbackfor detecting training instability via spike detection in loss and gradient norm. - Added
gateandactivationparameters toTransformerConfig.gemma3_like(). - Added another ladder script, with
--train-singleflag. - Added
CuTeRMSNorm, a CuTe-based RMSNorm implementation from the QuACK library. - Added
lazyoption toDownstreamEvaluatorCallbackConfigfor lazily loading each task which can decrease startup time. TrainingProgress(fromTrainer.training_progress) now includescurrent_tokens,bps,tps, andmfufields.BeakerCallbackwill include throughput metrics in the workload description.- Added
olmo_core.io.deterministic_glob_directoryfunction. - Added the option to cache the results of certain IO operations on remote files, like
get_file_size()anddeterministic_glob_directory()by setting the env varOLMO_CORE_FS_CACHE_DIRto a local directory. - Added
eval_on_finishoption toEvaluatorCallback. - Added the option to use a process pool instead of a thread pool when writing checkpoints.
- Added
max_document_lengthandlong_doc_strategyoptions toNumpyDocumentSourcein composable data API. - Mark ephemeral checkpoints with the
ephemeralflag in their metadata. - Added
ephemeral: Optional[bool]flag theCheckpointer.find_checkpoints()for filtering. - Added support for block-pattern based initialization of hybrid transformers.
TransformerConfig.blocknow accepts a dict of namedTransformerBlockConfigs, paired with ablock_patternlist that controls per-layer block selection. - Added optional
vocab_sizefield toDataCollatorfor validating token IDs are in[0, vocab_size)before the batch reaches the model. Wired through automatically in bothNumpyDataLoaderConfigandComposableDataLoaderConfig. - Added Olmo-hybrid official training configs and conversion script.
- Added new in-loop eval tasks: Generative QA BPB tasks, expanded MT-MBPP languages, and Science/Medical RC tasks.
- Added paged KV cache support to
FlashAttention4Backendfor inference on Blackwell (SM >= 10.0) GPUs. - Added Code Fresh per-language perplexity evals
- Fixed
Transformer.get_rope_buffers()crashing on non-rope attention mixers likeGatedDeltaNet. - Fixed A100 peak flops spec in
SpeedMonitorCallbackbeing 2x too low, which inflated MFU by 2x. - Fixed
AttentionConfig.num_params()overcounting QK norm parameters when using GQA/MQA withuse_head_qk_norm=False. - Fixed the peak learning rate in
src/scripts/train/OLMo3/OLMo3-32B-midtraining.pyto the correct one. - Fixed type annotation issue in
NumpyInterleavedFSLDatasetwhere_num_interleaving_exempt_instancesand_num_interleavable_instanceswere missingOptional[int]type hints, causing mypy type errors. - Fixed bug in GPUMonitorCallback where it was using a Wandb reserved keyword, causing data to be unable to be visualized in the Wandb dashboard.
- Fixed the ConsoleLoggerCallback filtering to support the new prefix (gpu_memory) for GPUMonitorCallback.
- Avoid torch dynamo recompiles when intra-document masking enabled by marking
cu_doc_lensandmax_doc_lendynamic. - Flops tracking for ParallelMLP and SWA layers.
- Fix overflow when too many global flops are computed.
- Ladder lmevaluator typo.
- Made some functions involved in data loading preprocessing more robust to race conditions.
- GAPMonitorCallback would raise an error if a local tensor shard had 0 elements.
- Fixed a bug where final metrics might not get logged.
- Fix failing test_build_world_mesh_cpu for pytorch 2.10.
- Fix failing convert_checkpoint_to_hf_test due by reducing total disk space required.
- Ensure all metrics have been logged and bookkeeping ops complete before writing a checkpoint.
- Fixed
self == InitMethod.*comparisons inAttention,FusedAttention, andGatedDeltaNetinit that should have beeninit_method == InitMethod.*, causing depth-scaled output projection init to never apply. - Minor improvements to make checkpointing more robust.
- Updated SFT documentation with alternative tokenization approach and tips for new base models.
- Renamed
olmo_core.distributed.utils.scatter_object()tobroadcast_object()for correctness. - Updated stable torch version to 2.9.1, updated versions of underlying libraries in Beaker Images.
olmo_core.io.join_path()now accepts an arbitrary number of components to join.- All
olmo_core.nnmodule configs now inherit from a common base class,ModuleConfig. - Big changes to
olmo_core.model_ladderAPI. - Add ngram instance filter to olmo3_ladder.
- Upgraded to beaker-py v2.
- Now, we check
dist.is_initialized()before callingdist.init_process_group()ininit_distributed().
v2.4.0 - 2025-11-20
- Added option to skip ranges of steps in the trainer.
- Send a Slack notification when a Beaker job appears to be stuck.
- Added
ignore_fingerprint_mismatchparameter toNumpyDataLoaderConfigto allow resuming training from a checkpoint with a different dataset mix. - Added helpful error messages when OLMo-mix-0625 files are not found, directing users to use OLMo-mix-0925 and the fingerprint override flag.
- Added
olmo_core.generate.chatmodule to allow interacting with OlmoCore models without conversion to other formats. - Added
GAPMonitorCallbackfor monitoring gradients, activations, and parameters (GAP). - Added official Olmo 3 7B and 32B pretraining scripts and data mix.
- Added official Olmo 3 7B and 32B midtraining scripts and data mix.
- Added official Olmo 3 7B and 32B long-context scripts and data mix.
- Added a
NoOpOptimizerthat does nothing, uses no memory, and can be used for debugging. - Added official config for Olmo 3 32B.
- Olmo 3 model card and checkpoint manifests.
- Set missing
NCCL_NVLSTREE_MAX_CHUNKSIZEenv var that is now needed for running jobs on Augusta cluster. - Fixed bug with
RemoteFileSystemReaderthat caused excess memory usage. - No longer overrides
random's RNG seed when buildingSourceMixtureDatasetConfig. - Fix handling URLs in
olmo_core.nn.hf.checkpoint.save_hf_modeland inexamples/huggingface. - Fix potential NaN loss that can occur when using instance masking.
- Stability improvements developed while training Olmo3 32B.
- Removed unused field in
YaRNRoPEScalingConfig.
v2.3.0 - 2025-10-17
- Fixed parsing username+password git remote URLs in
launch.beakermodule. - Fixed bug with default setup steps in
launch.beaker.BeakerLaunchConfigwhen a branch can't be resolved. - Cluster names in Beaker have changed.
- Fixed mixture rounding error with
SourceMixtureDataset, which was previously causing samples to be repeated at the end of training. - Don't DDOS Beaker from big jobs.
- A configuration error is now raised if you pass in a URL for the trainer or dataset's working directory. Previously the URL would just get mangled into a local path, leading to unexpected behavior.
- Fixed an issue where the
ConsoleLoggerCallbackwould attempt to log before the first step. - Only call
teardown_distributed_environment()when training ends cleanly to avoid a hang for the duration of the distributed backend's timeout when there's an error from one rank. - Fixed tensor parallelism issue with torch 2.8.
- More fixes for Beaker cluster names.
Callback.post_train()will still be called even if the run is canceled before the dry-run batch.GarbageCollectorCallbackwill restoregcsettings even whenTrainer.fit()exits on an error.- Make
move_to_deviceblocking for MPS device to fix possible incorrect transfer of data from CPU to MPS. - Fixed bug where
glob_directory()would fail to match certain glob patterns. - Added one more type of error to retry on when the Google Storage API throws it.
- Perform a garbage collection after checkpointing to avoid running out of CPU memory.
- Avoidable overflow error when using NumpyPackedFSLDataset.
- Fixed issue with NumpyFSLDatasetMixture + SourceMixtureDataset where not all instances would have the same sequence length.
- Attention backend will no longer default to flash in non-CUDA environments.
- The
diroption toTrainer.maybe_load_checkpoint()is now optional and defaults to thesave_folder. - Set
fused_linear_cross_entropy_loss accum_dtypeto fp32 inLMHead. - Increased
NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MSfrom 10 minutes to 30 minutes. SlackNotifierCallbackwill now notify on checkpoint saved and post epoch events.BeakerLaunchConfig.launch()will now send Slack notifications by default whenfollow=Trueif the env varSLACK_WEBHOOK_URLis set.src/examples/llama/has been renamed tosrc/examples/llm/.- Refactored eval task groups into
task_groups.py - The
use_flashargument to theAttentionclasses is deprecated. Usebackend="flash_2"instead. - Refactored
NumpyDatasetConfigby splitting it into a separate config per underlying dataset class. - Refactored
internal/experimentmodule to facilitate modifying datasets or supplying a fully customExperimentConfig. - Simplified
SourceMixtureDatasetConfigby removing redundantsequence_lengthanddtypefields. - The
model_idargument toconvert_state_from_hfis deprecated. Conversion information is deduced from the model type. - Refactored the example conversion scripts to/from HF, including decreasing false failures in validation.
- Small refactor to
source_mixture.pyto make it easier to define data mixes in yaml. - Reorganized/cleaned up internal training scripts.
- Added CLI script
src/scripts/unshard.pyfor converting distributed checkpoints to regular PyTorch or safetensors format. - Added a custom block that does LayerNorm scaling.
- Added
OLMo-mix-0625-150Bsampledata mix. - Added alias support to
DataMixenum. - Added the
HalfCoslearning rate scheduler. - Added
CONTRIBUTING.mdguidelines. - Added a lightweight, gantry-like Beaker launch CLI:
python -m olmo_core.launch.beaker. - Added Beaker images with torch 2.8. There is
olmo-core-tch280cu128-2025-09-18andolmo-core-tch280cu129-2025-09-18for CUDA 12.8 and 12.9, respectively. - Added TransformerEngine to Docker images and a TransformerEngine attention backend.
- Added
Callback.close()method, which is always called when exitingTrainer.fit(). - Added flash-attention 3 to Docker images, added
flash_3attention backend. - Added support for sliding window attention to the Torch attention backend. Performance is not optimized, so other backends should be preferred.
- Added
RoPEScalingConfig.to_hf_config()for each RoPE scaling method to support automatic conversion to HuggingFace format. - Guide to dataset mixing in
docs/source/guides/data_mixing.rst. - Added support for converting FlexOlmo models (with both dropless and default MoEs) between OLMo Core and HF formats.
- Added
olmo3_7Bmodel config. - Added additional internal configuration tools.
- Added a new named data mix that we used for the 32B run
- Added the ability for
GenerationModuleto load multiple checkpoints at once and average them. - Added internal OLMo3 7B midtraining config.
- Added internal OLMo3 7B midtraining and long-context configs.
- Added ability to convert OLMo3 models to/from HF format with support for rope scaling configs.
- Added the
WSDS(Warmup-Stable-Decay-Simplified) learning rate scheduler. - Added a script that can pull out a single training batch from a training job
v2.2.0 - 2025-08-26
- Added option to set LR scheduler based on tokens instead of steps (e.g.
--train_module.scheduler.units=tokens). - Added a "packed" numpy FSL variant that packs documents into sequences using the best-fit-decreasing bin packing algorithm following the work from Fewer Truncates Improve Language Modeling.
- Added module
olmo_core.testing. - Added a "interleaved" numpy FSL variant that interleaves several documents into sequences following the work from LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models.
- Added sliding window attention as a feature
- Added
BatchSizeSchedulerCallbackfor setting a batch size schedule over the course of a training run. - Added optional
TrainModulemethod,.pre_train(), which runs right afterCallback.pre_train(). - The
BeakerCallbackwill save the config and Python requirements to the results dataset. - Added
from_filemethod toConfigclass. - Added in-loop evals for OLMES basic skills eval
- Added in-loop fast MCQA for in-loop evals and translated MBPP tasks
- Added in-loop few-shot HumanEval BPB
- Added
fastandfullin-loop recommendations, wherefastis a roughly 2-3x faster subset offull - Added support for converting to HF models in lower precisions.
- Added support for headwise QK norm.
- Add BOS token in in-loop evals, when specified by the tokenizer (
ai2-olmo-eval==0.8.4) - Add support for BOS token matching EOS token for intra-document masking in FSL numpy datasets.
- Added option to allow profiler to record on multiple ranks.
- Added support for accessing Google on non-Google clusters via auth with service account keys.
- Added an example script for launching an SFT job.
- Added support for revisions in
convert_checkpoint_from_hf.pyand theload_hf_modelmethod ofolmo_core.nn.hf.checkpoint. foreachsupport inSkipStepAdamW.- Added
budgetmode for activation checkpointing configuration. - Added
io.remove_file()andio.glob_directoryfunctions. - Added ABF, PI, and YaRN rope scaling strategies.
- Added a script to compare two WandB runs
- Added
namespaceoption tonn.buffer_cache.BufferCache. - Added the option to configure
head_stridefor context parallelism with ring-flash-attn. - Added the option to group multiple npy source files together for packing with the packed FSL dataset by setting
source_group_sizeto an integer greater than 1. - Added
load_optim_state: Optional[bool]option toTrainer.load_checkpoint(). - Added
GenerationModulefor OLMo-core native autoregressive generation with support for kv caching. - Added optional hostname constraints for beaker experiments on Google clusters.
- Output of
LMHeadwhenlabelsis passed as input is now a 4-tuple instead of a 3-tuple, with(logits, loss, ce_loss, z_loss), wherelossis the combined loss (ce_loss + z_loss). - The
ConfigSavercallback will automatically set the config to save for other callbacks (WandBCallback,CometCallback, andBeakerCallbackas of now). - Fixed bug causing slow evals in BPB/RC in-loop evals due to fast MC
- Changed default precision of converted HF models in
src/examples/huggingface/convert_checkpoint_to_hf.pyto bfloat16. - Changed default cluster to
saturninsrc/examples/llama/train_launch.py. - Made some beaker secrets optional for internal experiments.
- Changed
SlidingWindowAttentionConfigto improve clarity. - Changed the default Beaker budget
- Modify
TokenizerConfig.from_hf()to fallback to tokenizer_config.json if config.json is not found. - Fixed loading checkpoints with missing keys from transformer train modules using torch 2.7.
- Made MoE load balancing loss more robust.
- Fixed a bug with
ReorderedNormTransformerBlockwhen using fine-grained FSDP wrapping and activation checkpointing together. - Fixed an issue preventing tensor parallelism from working with
LMHeadwhen using the "fused_linear" loss implementation. - Fixed a bug with
LMHeadwhen using "fused_linear" loss implementation where thece_lossoutput included thez_lossadded to it. - Fixed training on single GPU when using a
SkipStepOptimizer. - Fixed the initialization of the
CosWithWarmupAndLinearDecaylearning rate scheduler - Ensured eval tasks are sorted to maintain the same order across ranks (the cookbook was configuring these in an unsorted way).
- W&B callback uses working directory instead of save folder for local cache.
- Reset speed monitor callback after changing batch size.
- Fixed parallelism compatiblity between cp + tp and cp + pp and added test to catch regressions.
- Ensure sharded parameters are initialized differently on separate ranks.
- Fixed fingerprinting for FSL datasets
- Fixed bug where
stepstate inSkipStepAdamWwas not incremented, biasing the optimizer steps. Added option to restore the bug for backwards compatibility. - Removed
sklearnfrom upstream dependencyai2-olmo-eval. - Made removing ephemeral checkpoints more robust.
- Made running bookkeeping operations more robust.
- Ensure RoPE modules with different settings use a unique sub-cache for their buffers.
- Fixed bug with context parallelism where every transformer block would use the same RoPE buffers even if their RoPE was configured differently.
- Fixed MFU computation to work with FSDP, corrected some device specs.
- Optimization: avoid redundant calls to
model.train()inTransformerTrainModule. NumpyDatasetConfig.expand_globnow works with remote directories.- Fixed Attention block sharding when TP and head-wise QK norm are both applied.
- Added RoPE scaling configs to
ropemodule's exports.
v2.1.0 - 2025-04-14
- Added 50B Dolmino 11/24 mix.
- Added support for auxiliary-loss-free MoE load-balancing, similar to DeepSeek-v3. You can activate this by setting
bias_gammato a non-zero float in yourMoERouterconfig. - Added support for sequence-level MoE load balancing loss.
- Compatibility with B200s.
- Added support for
warmup_fractionas an alternative towarmup_stepsin all schedulers, allowing warmup to be specified as a fraction of total training steps. - A better config for the 1B model, ported from the old OLMo trainer.
- Added
auto_resumeoption toCometCallbackfor resume an existing run. - (BETA) Added methods
load_hf_modelandsave_hf_modelfor saving supported OLMo Core models to HF transformers format. Also added lower-level methods for converting state between the formats. - Added the ability to run the evaluator callback on
.pre_train()by settingeval_on_startup=True, and to cancel the run after the first time evals run by settingcancel_after_first_eval=True. - Added support for label mask files with numpy FSL datasets.
- Added a
gitconfiguration toBeakerLaunchConfig.
TransformerTrainModuleConfigcan now be used to build aTransformerPipelineTrainModuleby adding app_configspec. This makes theTransformerPipelineTrainModuleConfigredundant, but it will be kept around for backwards compatibility until the next major release.- Several state dict methods in
TrainModulenow take anoptimoption, which can disable the use of optimizer state. - Updated
Float8Configfor latest version oftorchao. - Undo a fix applied to
olmo_core.data.numpy_dataset.NumpyFSLDatasetMixturethat was generating a mismatch between the shape of instances in the dataset and the shape of instances in the data loader. - Made the 1B and 7B scripts more similar to each other.
- Changed underlying logic and top-level arguments of
convert_checkpoint_from_hf.pyandconvert_checkpoint_to_hf.py. - Beaker experiments launched with the
BeakerLaunchConfigwill now log with ANSI colors enabled.
- Fixed calculation of total steps based on epochs at the end of a training job.
- Fixed a bug where the trainer might try to save a duplicate final checkpoint if the run that already completed was restarted.
- When submitting a Beaker job from a branch that's tracking a GitHub fork, OLMo-core now instructs Beaker to pull from the fork instead of from the main repo.
- Made Beaker image resolution more robust.
- Having
t_maxoverrides in the default model configs is confusing and error prone, so we removed them. - Beaker launcher will only clone a single branch at runtime when possible, which can be much faster.
v2.0.1 - 2025-03-18
- Added information about the official 32B training run.
- Added information about the official 32B anneal training run.
- Added automatic support for LL128 when running on Augusta.
- Added information about 32B training logs.
- The official config for the 32B had unrealistic batch size settings.
- Ignore
group_overridesfor frozen parameters instead of throwing an error. - Bump
ai2-olmo-eval==0.7.1, which fixes makes the in-loop evaluation consistent with OLMES by removing a bias
- Removed the "fused" cross-entropy loss variant. It had a bug and consistently under-performed the native PyTorch version when compiled. See Post Incident Report: bug with fused CE loss for more information.
v2.0.0 - 2025-03-12
This major release introduces a few breaking changes. We've provided more information here: OLMo-core v2 design and upgrade guide.
- Added
TrainModuleabstraction withTransformerTrainModuleimplementation, which encapsulates both a model and optimizer. - Added
namespaceargument toTrainer.record_metric(). - Added support for context parallelism.
- Added support for expert parallelism with MoE models.
- Added in-loop evals for Minerva, GSM, HumanEval, MBPP (
ai2-olmo-eval==0.7.0) - Added
CosWithWarmupAndLinearDecaylearning rate scheduler - Added
WSDlearning rate scheduler - Added
RunDurationinmodel_ladderto configure training durations in terms of Chinchilla multipliers.
- The
Trainernow takes aTrainModuleinstead of a model and optimizer, and several configuration options have been moved toTransformerTrainModule, includingrank_microbatch_size,fused_loss,compile_loss,z_loss_multiplier, andautocast_precision. - Several
TransformerModelConfigoptions have been toTransformerTrainModule/TransformerTrainModuleConfig, includingdp_config,tp_config,float8_config, andcompile.
- Removed the following callbacks:
MoEHandlerCallback,SchedulerCallback,MatrixNormalizerCallback,GradClipperCallback, andFloat8HandlerCallback. The functionality from all of those callbacks has been moved to theTransformerTrainModuleclass. - Removed the callback methods
.pre_eval_batch()and.post_eval_batch().
- Fixed the model ladder code when training on mps or cpu device
v1.9.0 - 2025-03-10
- Ensure certain optimizer param group fields are not overridden by the values in a checkpoint.
- Added
instance_filter_configfield toNumpyDatasetConfig. - Added conversion script for OLMo 2 checkpoints to Huggingface format.
- Added
BeakerCallback. - Added logging for in-loop eval throughput
- Ensure certain optimizer param group fields are not overridden by the values in a checkpoint.
- Fixed issue where non-zero ranks would report partially-reduced values for training metrics.
v1.8.0 - 2025-01-29
- Added support for tensor parallelism. See the
TransformerConfigclass for usage. - Added more downstream tasks from the model ladder.
- Added
io.copy_dir()function. - Added new LR schedulers:
LinearWithWarmup,InvSqrtWithWarmup,ConstantWithWarmup,SequentialScheduler. - Added option to pre-download checkpoint files from remote storage before trying to load a checkpoint.
- Added a callback for sending Slack notifications.
- Makes the MPS device work on Apple Silicon
- Added
SkipStepAdamWoptimizer. - The trainer can load model-only checkpoints now.
- Added the option to throttle checkpoint uploads to one rank from each node at a time.
- Added support for logging rich Table objects as text in source mixture datasets.
- Added
unshard_strategyparameter tounshard_checkpoint()function inolmo_core.distributed.checkpoint. - Added function
load_keys()toolmo_core.distributed.checkpoint. - Added support for low precision optim state in
SkipStepAdamW.
- Changed storage of shared shard state in sharded checkpoints from smallest shard to lowest rank (normally 0).
- Changed how the trainer handles loading a checkpoint when
load_pathis provided. Nowload_pathis only used if no checkpoint is found in thesave_folder.
- Added missing
weights_only=Falseargument to fix loading train checkpoints with newer versions of PyTorch. - Fixed bug where GCS upload does not retry on transient failures.
- Fixed bug where source mixture datasets were truncating source files instead of randomly sampling.
- Fixed bug in source mixture datsets where sampling from small npy files raised an mmap exception due to 0 instances in the sampled index.
v1.7.0 - 2024-11-27
- Added
key_mappingargument toolmo_core.distributed.checkpoint.load_model_and_optim_state()for loading checkpoints with different key names. - Added
load_key_mappingfield to the trainer, same idea as the newkey_mappingargument above. - Added an implementation of nGPT called
NormalizedTransformer. - Added an example showing how to convert a HuggingFace Llama 3.2 checkpoint into the right format for OLMo-core.
- Added an API for scaling RoPE embeddings.
- Added a
ModelLadderAPI.
- The
w_outandnormtop-level children of theTransformermodel are now wrapped together in anlm_headmodule. Training scripts will have backwards compatibility with older checkpoints due to theload_key_mappingexplained above.
- (Optimization) Mark model input sizes as dynamic for
torch.compile()to avoid recompile during evals or variable-sequence / batch size training. This doesn't seem to hurt throughput. - Made HTTPS and GCS IO functions more robust.
- Fixed a bug where we were always getting dolma2 tokenized validation data when generating config with DataMix.v3_small_ppl_validation.
v1.6.3 - 2024-11-15
- Added
olmo_core.distributed.checkpoint.get_checkpoint_metadata()function. - (BETA) Added flag to compile the optimizer step. So far only tested with AdamW. May not work with other optimizers.
- Old ephemeral checkpoints won't be removed until after the latest ephemeral checkpoint is saved successfully.
- Made GCS uploads more robust.
- Fixed single-node training on Google Augusta cluster.
numpy.random.dirichlet()does not always sum to 1.0, so allow for a small tolerance in validating domain weights.
v1.6.2 - 2024-11-08
- Added option to disable
GarbageCollectorCallback, not that you'd want to do this usually, but I needed to run an experiment to show how important that callback is.
- Fixed a bug where some default callbacks could be added twice if given a different name by the user.
- Fixed a bug where some
Trainerbookkeeping tasks may not complete before.fit()returns.
v1.6.1 - 2024-11-06
- Added
retriesfield toBeakerLaunchConfig. - Allow running on Augusta cluster with existing train scripts.
- Added
olmo_core.utils.logging_configured()function to check if logging has been configured.
- Fixed a potential distributed deadlock bug when training without a separate CPU-only bookkeeping backend.
- Removed some unnecessary host-device syncs in
olmo_core.distributed.utils. - Added
Trainer(Config).async_bookkeepingfield to toggle async bookkeeping.
v1.6.0 - 2024-11-01
- Added option to compile the trainer's loss function (
Trainer.compile_loss). - Added
SourceMixtureDatasetfor composing a training mixture based on ratios of source datasets. - Added
NumpyFSLDatasetMixturefor constructing aNumpyDatasetBasefrom aSourceMixtureDataset. Note this is only supported for FSL datasets. - Added tests for
SourceMixture*andNumpyFSLDatasetMixture. - Added
DownstreamEvaluatorCallbackConfigclass for running in-loop downstream eval via OLMo-in-loop-evals.
- Moved some types into
olmo_core.data.typesto avoid some circular dependencies.
- Made GCS client more robust by automatically retrying timeout errors for most operations.
v1.5.0 - 2024-10-23
- Added Google Cloud support for
list_directory()andclear_directory(). - Added
CometCallbackfor logging training runs to Comet.ml. - Added
DataMixBaseclass, to allow extending to new data mix groups. - Added support for MoE-based models.
- Added method
DataLoaderBase.get_mock_batch(). - Trainer now starts with a dry-run of a fake batch created by
DataLoaderBase.get_mock_batch(). - Added
Callback.pre_backward(),.pre_eval_batch(), and.post_eval_batch()methods. - Added
Trainer.model_forward(),.get_losses(), and.eval_batch()methods. - Added a new
TransformerActivationCheckpointingMode, "selected_ops" (requires torch 2.5 or newer).
BeakerLaunchConfig.setup_stepsshould now include steps to clone your repo (which it will by default). This change allows support for private repos.
prepare_cli_environment()now callsadd_cached_path_clients().- Removed an unnecessary host-device sync.
v1.4.0 - 2024-10-02
- Updated default layer norm epsilon for OLMo models from
1e-5to1e-6to match latest model. - Renamed
FSLDataLoadertoNumpyFSLDataLoader. - Renamed
VSLDataLoadertoNumpyVSLDataLoader. - The trainer now takes a
data_loader: DataLoaderBaseinstead of adataset: NumpyDatasetBase.
v1.3.2 - 2024-09-27
- Added
Config.validate(),Config.replace(), andConfig.apply()methods. - Trainer now records sequence length as a metric.
- Ensure additional cached-path clients are added in the process pool workers from some dataset preparation methods.
- Fixed
label_masktensor created byNumpyPaddedFSLDataset. - Removed redundant warning messages about CUDA alloc retries.
- Fixed non-deterministic deadlock bug with async checkpointing.
v1.3.1 - 2024-09-26
- Fixed the name given to evaluator metrics logged.
v1.3.0 - 2024-09-26
- Added
torchaoto the Docker/Beaker images. - Added support for
torchaofloat8training via theFloat8HandlerCallback. - Added
Callback.post_attach()method.
v1.2.0 - 2024-09-25
- Added support for wildcards in
OptimGroupOverride.params. - Added
NumpyPaddedFSLDatasetvariant. - Added
Evaluatorclass andEvaluatorCallbackfor in-loop evals. - Added
v3-small-ppl-validationdata mix.
- Fixed bug with data loader when using threading.
v1.1.0 - 2024-09-18
- Added support for changing train sequence length when loading a checkpoint.
- Added support for sequence length warm-up during training via the callback
SequenceLengthSchedulerCallback. - Added support for variable sequence length (VSL) datasets and VSL curriculums as introduced in "Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum".
- Added
LionandSkipStepLionoptimizers. - Added
init_seedargument toTransformerandTransformerConfig.
- Renamed
MemMapDatasettoNumpyFSLDataset. - Batch size is now specified in tokens, not instances.
v1.0.6 - 2024-09-05
- Added "selected_modules" transformer activation checkpointing mode.
- Added
OLMo-1B.pyofficial training script. - Added
OLMo-13B.pyofficial training script. - Added
Trainer.get_metric(),.get_loss(), and.get_zloss()methods. - Added
io.copy_file()function. - Added
ProfilerCallbackfor profiling/tracing the training loop with PyTorchprofilermodule. - Added an "L2 norm" metric reduce type.
- Made reducing metrics more numerically stable with large world sizes.
v1.0.5 - 2024-09-03
- Fixed bug with checkpointer callback searching for existing ephemeral checkpoints when the checkpoint folder doesn't exist.
- Checkpointer callback won't collect existing ephemeral checkpoints that were saved after the checkpoint that was loaded from.
v1.0.4 - 2024-09-01
- Added
Trainer.save_checkpoint()andTrainer.save_checkpoint_async()methods. - Added
Callback.post_checkpoint_saved()andCallback.post_checkpoint_loaded()methods. - Added
ConfigSaverCallback. - Added
MemMapDataset.fingerprintproperty.
- The
work_dirargument toTrainerConfignow defaults tosave_folderissave_folderis a local path, otherwise a temporary directory with the same name as the basename of thesave_folder. - The
seedargument toprepare_training_environment()is now optional.
- Fixed setting the right env vars for single node training on Jupiter.
v1.0.3 - 2024-08-30
- Add
Trainer.hard_stopfield. - The trainer now catches
SIGTERMand marks the run as canceled. - Added
CheckpointerCallback.removestrategy for configuring which old checkpoints found in the save folder are removed. - Added
ReorderedNormTransformerBlockimplementation. - Added
WandBCallback.notesfield.
- Fixed bug with how command arguments were expanded by
BeakerLaunchConfig.
v1.0.2 - 2024-08-29
- Added support for unsharding model state into
safetensorsformat witholmo_core.distributed.checkpoint.unshard_checkpoint(..., use_safetensors=True). - Added
data.TokenizerConfigconfig class anddata.TokenizerNameenumeration. - Added data mixes with
data.DataMixAPI. - Added
block_idxattribute to theTransformerBlockclass. - Added
init_methodoption toTransformerfor controlling how the weights are initialized.
- Fixed
list_directoryfor remote folders.
- Callbacks now have to have a name assigned.
v1.0.1 - 2024-08-26
- Fixed a bug with resetting the initial LR in optimizers after a loading a checkpoint.
v1.0.0 - 2024-08-26
- Ported, refactored, and optimized the modeling and training from the OLMo repo while fixing several bugs. Introduces a new highly efficient yet customizable trainer and a standard API for launching jobs directly to Beaker from a Python script.
v0.1.0 - 2024-06-11
- Initial release.