Skip to content

Add blake3 hashing#538

Merged
mihaimaruseac merged 1 commit intosigstore:mainfrom
eqtylab:blake3
Oct 9, 2025
Merged

Add blake3 hashing#538
mihaimaruseac merged 1 commit intosigstore:mainfrom
eqtylab:blake3

Conversation

@makew0rld
Copy link
Copy Markdown
Contributor

Closes #530

Summary

BLAKE3 is an excellent cryptographic hash algorithm, both in terms of features and performance. Adding it as an option for model hashing will greatly speed up the hash time for large files.

Our existing internal tooling already tracks files and blobs using BLAKE3, and supporting it for model manifests would make them interoperable with our tooling without expensive rehashing being required.

See the commit message for some explanation of the design decisions.

Checklist
  • All commits are signed-off, using DCO
  • All new code has docstrings and type annotations
  • All new code is covered by tests. Aim for at least 90% coverage. CI is configured to highlight lines not covered by tests.
  • Public facing changes are paired with documentation changes
  • Release note has been added to CHANGELOG.md if needed

@makew0rld makew0rld requested review from a team as code owners October 9, 2025 20:44
@makew0rld makew0rld force-pushed the blake3 branch 2 times, most recently from a78d172 to 70ace84 Compare October 9, 2025 20:49
Copy link
Copy Markdown
Member

@mihaimaruseac mihaimaruseac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good, but I'm mostly concerned on taking on a dependency which was not updated in 7 years.

Comment thread src/model_signing/_hashing/io.py Outdated
Comment on lines +333 to +336
For BLAKE3 this is equivalent to not sharding. Sharding is bypassed
because BLAKE3 already operates in parallel. This means the chunk_size
and shard_size args are ignored.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit concerned about this, given that sharding is also introduced to allow verifying only a portion of the file, rather than the integrity of the entire file. But that's an optimization, so might not matter much

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, I hadn't thought of that. BLAKE3 actually supports this as well (look up "blake3 bao"), but I think adding support for that is out of scope for this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's merge as it is and if we actually need this support we can add it.

Comment thread benchmarks/serialize.py Outdated
Because BLAKE3 natively supports parallelism without changing the final
hash, sharding is bypassed. This is much more useful than getting
different file hashes depending on which hashing method you used.

The BLAKE3 hashing is done by memory mapping the file, and defaults to
the max number of workers which is the number of logical CPU cores.
This is a good default and the most performant setup. It is also what
the standard BLAKE3 CLI tool (b3sum) does. It is implemented in Rust
and so will be true parallelism rather than the thread concurrency
implemented for other hashing algorithms, so the speed up should be
quite large. But it will likely be slower on HDDs than having no
parallelism. I think this is the right default, but the HDD concern
is documented.

Resolves: sigstore#530

Signed-off-by: makeworld <makeworld@protonmail.com>
@makew0rld
Copy link
Copy Markdown
Contributor Author

@mihaimaruseac thanks for the quick review!

taking on a dependency which was not updated in 7 years

I'm not sure what you mean, maybe you're looking at a different dependency? The blake3 package (PyPI, repo) had its latest release last week.

@mihaimaruseac
Copy link
Copy Markdown
Member

Oh, I accidentally was looking at blake-256, my bad.

Comment on lines +333 to +336
For BLAKE3 this is equivalent to not sharding. Sharding is bypassed
because BLAKE3 already operates in parallel. This means the chunk_size
and shard_size args are ignored.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's merge as it is and if we actually need this support we can add it.

@mihaimaruseac mihaimaruseac enabled auto-merge (squash) October 9, 2025 21:13
@mihaimaruseac mihaimaruseac merged commit 1f9a11d into sigstore:main Oct 9, 2025
51 checks passed
@makew0rld
Copy link
Copy Markdown
Contributor Author

Thanks for the quick merge! If you're able to cut a new release for this soon that would be awesome.

@mihaimaruseac
Copy link
Copy Markdown
Member

Working on a release as we speak!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for BLAKE3 hashing

3 participants