Skip to content

Latest commit

 

History

History
153 lines (111 loc) · 5.1 KB

File metadata and controls

153 lines (111 loc) · 5.1 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Added

  • Local type handler registries.
  • Expose merge_trees publicly: this function can be use to merge trees into a single tree using a comprehensive recursive strategy

Changed

  • The PyPi orbax package is deprecated in favor of domain-specific namespace packages, namely orbax-checkpoint and orbax-export. Imports are unchanged, and still of the form import orbax.checkpoint or import orbax.export.
  • Finer scoped jax.monitoring calls on the save path.
  • CheckpointManager.metadata() now accepts a step parameter. If provided, it will return StepMetadata, and will otherwise return RootMetadata.
  • CheckpointManager.restore() will now attempt to initialize checkpoint handlers using StepMetadata.item_handlers and the global HandlerTypeRegistry if no args are provided.
  • CompositeCheckpointHandler.metadata() now returns StepMetadata.
  • Double the default timeout from 600 to 1200 (20 minutes) in AsyncOptions; timeout_secs now becomes a mandatory parameter with default value of 1200 (20 minutes) in AsyncCheckpointer.

Fixed

  • Fixed get_device_memory issue on TPU 7x devices where the device kind string was consistently reported without a space, causing a ValueError.
  • Fixed hanging in AsyncCheckpointer if timeout occurs during save. Remaining time is now calculated and applied to commit operations and synchronization barriers, ensuring that all async operations time out instead of hanging if preceding operations consume most of the timeout budget.
  • Fixed hanging in ocp.save_pytree_async if timeout occurs during save. The details of the fix is similar to the fix of AsyncCheckpointer above.

[0.1.7] - 2022-03-29

Added

  • Support for OCDBT driver in Tensorstore.

[0.1.6] - 2022-03-22

Fixed

  • Small bug fixes.

[0.1.5] - 2022-03-17

Added

  • Use a more precise timestamp when generating temporary directory names to permit more than one concurrent checkpointing attempt per second.

[0.1.4] - 2022-03-15

Added

  • Support for generic transformation function in PyTreeCheckpointHandler.
  • Support n-digit checkpoint step format.

Fixed

  • Eliminate Flax dependency to fix circular dependency problem.

[0.1.3] - 2022-03-03

Added

  • sharding option on ArrayRestoreArgs

[0.1.2] - 2022-02-17

Added

  • Add "standard user recipe" to documentation.
  • Add unit tests using mock to simulate preemption.
  • Logging to increase transparency around why checkpoints are kept vs. deleted.
  • Expand on uses of restore_args in colab.
  • Expose utils_test.
  • Add msgpack_utils to move toward eliminating Flax dependency.
  • CheckpointManager starts a background thread to finalize checkpoints so that checkpoints are finalized as soon as possible in async case.

Changed

  • Remove CheckpointManager update API.
  • Remove support for deprecated GDA.
  • Add tmp suffix on step directory creation in CheckpointManager.save.

Fixed

  • Preemption when using keep_time_interval caused the most recent steps before preemption to be kept, despite not falling on the keep time interval.

[0.1.1] - 2022-01-30

Added

  • A util function that constructs restore_args from a target PyTree.
  • CheckpointManager delete API, which allows deleting an existing step.
  • Made dev dependencies optional to minimize import overhead.

Changed

  • Refactored higher-level utils in checkpoint_utils, which provides user-convenience functions.
  • Guard option to create top-level directory behind create option.
  • Remove support for Python 3.7.

[0.1.0] - 2022-01-04

Added

  • Check for metric file in addition to item directory in CheckpointManager.
  • Additional logs to indicate save/restore completion.
  • Support for None leaves in PyTree save/restore.
  • ArrayCheckpointHandler for individual arrays/scalars.
  • read: bool option on all_steps to force read from storage location instead of using cached steps.
  • Simplified "Getting Started" section in the docs.
  • CheckpointManager creates the top level directory if it does not yet exist.
  • Write msgpack bytes asynchronously.

Changed

  • Removed some unused test_utils methods for filtering empty nodes.
  • Update docs on PyTreeCheckpointHandler.
  • Removed unneeded AbstractCheckpointManager.

Fixed

  • Usage of bytes_limiter to prevent too many bytes from being read during a single restore call.
  • Temp checkpoint cleanup when using a step prefix (i.e. 'checkpoint_0').

[0.0.23] - 2022-12-08

Added

  • Option to customize metadata file name for Tensorstore.

Fixed

  • Restore failure on GCS due to misidentification of checkpoint as "not finalized".

[0.0.22] - 2022-12-05

Added

  • Added CHANGELOG.md for version updates (additions and changes), ingested by auto-publish functionality.

[0.0.21] - 2022-12-05

Changed

  • Fix mistaken usages of placeholder "AGGREGATED" where "NOT-AGGREGATED" would be more appropriate. Ensure backwards compatibility is maintained.