Skip to content

stats: use cached crate in addition to existing stats cache, with the blake3 hash of the file as the main key #2972

@jqnatividad

Description

@jqnatividad

Currently, the stats cache is stored in the same location as the input data.

One file is the cache results (ending in ".stats.csv"), and when the --stats-jsonl option is chosen, a ".jsonl" file - both with the same file stem as the input file. The .jsonl file contains the stats options and some cache metadata used to generate the ".stats.csv" file.

The current caching mechanism only checks for the filename and filetime metadata - which is serviceable and works IRL - changing the input file necessarily makes it newer than the cache files, automatically invalidating the cache. However, it introduces some "noise" to the file system.

Before, I saw it as a "feature", not a bug, as these files are tiny are can be used independently.

However, It had some limitations:

  • two more tiny files
  • the cache may be invalid and still be used if the input file was changed somehow, and its still older than the stats cache
  • if there was an identical file with a different name and/or in another location, separate cache files were maintained for them
  • there was no easy way to share cache files for reference data with other users in the network

By using the cached crate, we can:

  • use a configurable Disk Cache or alternately, a Redis Cache
  • as the filehash is the main key, if an identical file, even with a different name is used, will generate a cache hit as the main key is the sha256 hash of the file
  • with a Redis Cache, we can share the Cache with other users.

Computing the sha256 hash doesn't take that long with our CPU-accelerated version - 250ms for the 520mb NYC 311 file (contrast that with shasum -a 256 on the same machine, which is at least 4x slower).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request. Once marked with this label, its in the backlog.performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions