Currently, the stats cache is stored in the same location as the input data.
One file is the cache results (ending in ".stats.csv"), and when the --stats-jsonl option is chosen, a ".jsonl" file - both with the same file stem as the input file. The .jsonl file contains the stats options and some cache metadata used to generate the ".stats.csv" file.
The current caching mechanism only checks for the filename and filetime metadata - which is serviceable and works IRL - changing the input file necessarily makes it newer than the cache files, automatically invalidating the cache. However, it introduces some "noise" to the file system.
Before, I saw it as a "feature", not a bug, as these files are tiny are can be used independently.
However, It had some limitations:
- two more tiny files
- the cache may be invalid and still be used if the input file was changed somehow, and its still older than the stats cache
- if there was an identical file with a different name and/or in another location, separate cache files were maintained for them
- there was no easy way to share cache files for reference data with other users in the network
By using the cached crate, we can:
- use a configurable Disk Cache or alternately, a Redis Cache
- as the filehash is the main key, if an identical file, even with a different name is used, will generate a cache hit as the main key is the sha256 hash of the file
- with a Redis Cache, we can share the Cache with other users.
Computing the sha256 hash doesn't take that long with our CPU-accelerated version - 250ms for the 520mb NYC 311 file (contrast that with shasum -a 256 on the same machine, which is at least 4x slower).
Currently, the stats cache is stored in the same location as the input data.
One file is the cache results (ending in ".stats.csv"), and when the
--stats-jsonloption is chosen, a ".jsonl" file - both with the same file stem as the input file. The.jsonlfile contains the stats options and some cache metadata used to generate the ".stats.csv" file.The current caching mechanism only checks for the filename and filetime metadata - which is serviceable and works IRL - changing the input file necessarily makes it newer than the cache files, automatically invalidating the cache. However, it introduces some "noise" to the file system.
Before, I saw it as a "feature", not a bug, as these files are tiny are can be used independently.
However, It had some limitations:
By using the cached crate, we can:
Computing the sha256 hash doesn't take that long with our CPU-accelerated version - 250ms for the 520mb NYC 311 file (contrast that with
shasum -a 256on the same machine, which is at least 4x slower).