Add blake3 hashing#538
Conversation
a78d172 to
70ace84
Compare
mihaimaruseac
left a comment
There was a problem hiding this comment.
Overall this looks good, but I'm mostly concerned on taking on a dependency which was not updated in 7 years.
| For BLAKE3 this is equivalent to not sharding. Sharding is bypassed | ||
| because BLAKE3 already operates in parallel. This means the chunk_size | ||
| and shard_size args are ignored. | ||
|
|
There was a problem hiding this comment.
A little bit concerned about this, given that sharding is also introduced to allow verifying only a portion of the file, rather than the integrity of the entire file. But that's an optimization, so might not matter much
There was a problem hiding this comment.
That's interesting, I hadn't thought of that. BLAKE3 actually supports this as well (look up "blake3 bao"), but I think adding support for that is out of scope for this PR.
There was a problem hiding this comment.
Yeah, let's merge as it is and if we actually need this support we can add it.
Because BLAKE3 natively supports parallelism without changing the final hash, sharding is bypassed. This is much more useful than getting different file hashes depending on which hashing method you used. The BLAKE3 hashing is done by memory mapping the file, and defaults to the max number of workers which is the number of logical CPU cores. This is a good default and the most performant setup. It is also what the standard BLAKE3 CLI tool (b3sum) does. It is implemented in Rust and so will be true parallelism rather than the thread concurrency implemented for other hashing algorithms, so the speed up should be quite large. But it will likely be slower on HDDs than having no parallelism. I think this is the right default, but the HDD concern is documented. Resolves: sigstore#530 Signed-off-by: makeworld <makeworld@protonmail.com>
|
@mihaimaruseac thanks for the quick review!
I'm not sure what you mean, maybe you're looking at a different dependency? The |
|
Oh, I accidentally was looking at |
| For BLAKE3 this is equivalent to not sharding. Sharding is bypassed | ||
| because BLAKE3 already operates in parallel. This means the chunk_size | ||
| and shard_size args are ignored. | ||
|
|
There was a problem hiding this comment.
Yeah, let's merge as it is and if we actually need this support we can add it.
|
Thanks for the quick merge! If you're able to cut a new release for this soon that would be awesome. |
|
Working on a release as we speak! |
Closes #530
Summary
BLAKE3 is an excellent cryptographic hash algorithm, both in terms of features and performance. Adding it as an option for model hashing will greatly speed up the hash time for large files.
Our existing internal tooling already tracks files and blobs using BLAKE3, and supporting it for model manifests would make them interoperable with our tooling without expensive rehashing being required.
See the commit message for some explanation of the design decisions.
Checklist