`stats`: use cached crate in addition to existing stats cache, with the blake3 hash of the file as the main key

Currently, the stats cache is stored in the same location as the input data. 

One file is the cache results (ending in ".stats.csv"), and when the `--stats-jsonl` option is chosen, a ".jsonl" file - both with the same file stem as the input file. The `.jsonl` file contains the stats options and some cache metadata used to generate the ".stats.csv" file.

The current caching mechanism only checks for the filename and filetime metadata - which is serviceable and works IRL - changing the input file necessarily makes it newer than the cache files, automatically invalidating the cache.  However, it introduces some "noise" to the file system.

Before, I saw it as a "feature", not a bug, as these files are tiny are can be used independently.

However, It had some limitations:
- two more tiny files
- the cache may be invalid and still be used if the input file was changed somehow, and its still older than the stats cache
- if there was an identical file with a different name and/or in another location, separate cache files were maintained for them
-  there was no easy way to share cache files for reference data with other users in the network

By using the cached crate, we can:
- use a configurable Disk Cache or alternately, a Redis Cache
- as the filehash is the main key, if an identical file, even with a different name is used, will generate a cache hit as the main key is the sha256 hash of the file
- with  a Redis Cache, we can share the Cache with other users.

Computing the sha256 hash doesn't take that long with our CPU-accelerated version - 250ms for the 520mb NYC 311 file (contrast that with `shasum -a 256` on the same machine, which is at least 4x slower).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`stats`: use cached crate in addition to existing stats cache, with the blake3 hash of the file as the main key #2972

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

stats: use cached crate in addition to existing stats cache, with the blake3 hash of the file as the main key #2972

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`stats`: use cached crate in addition to existing stats cache, with the blake3 hash of the file as the main key #2972