pycacheR

cachepy tracks your data and code so you don't have to

Python port of cacheR.

What does cachepy do?

It automatically checks for changes in code and input data and re-runs the function only if necessary. Results are cached to disk as pickle files.

It's like snakemake/nextflow, but on the fly.

What is it useful for?

Keeping analysis results up to date
Saving time on expensive computations
Not using obsolete results
Reusing heavy computations safely and transparently

Installation

# From GitHub
pip install git+https://github.com/BIMSBbioinfo/pycacheR.git

# From Guix
guix install -f https://raw.githubusercontent.com/BIMSBbioinfo/pycacheR/main/guix.scm

The import name is cachepy:

from cachepy import cache_file

Basic usage

The package provides:

cache_file() — a caching decorator
cache_tree_nodes() / cache_tree_reset() — functions for inspecting and managing the cache tree
cache_stats() / cache_prune() / cache_list() — cache inspection and cleanup

from cachepy import cache_file

cache_dir = "/tmp/my_cache"

@cache_file(cache_dir)
def inner(x):
    return x + 1

@cache_file(cache_dir)
def outer(x):
    return inner(x) * 2

outer(3)
#> 8

How does cachepy decide to recompute?

A cached call is reused only if all of the following are unchanged:

The function body (source code hash, including inline changes)
The arguments (normalized and hashed — positional, named, and default-filled forms are equivalent)
The tracked files / directories, where relevant (file_args, depends_on_files)
The package versions of imported modules used by the function
The environment variables specified via env_vars
The version string, if provided
Any external variables specified via depends_on_vars

If any of these change, cachepy invalidates the old entry and recomputes.

Features

Feature	How
Argument normalization	`f(1, 2)`, `f(a=1, b=2)`, `f(b=2, a=1)` all hit the same cache entry
File dependency tracking	`file_args=["path"]` hashes file content, not just the path
Body change detection	Redefining a function invalidates its cache
Version parameter	`version="2.0"` for manual cache-busting
Force / skip save	`f(x, _force=True)` re-executes; `f(x, _skip_save=True)` runs without writing
External dependencies	`depends_on_files=[...]` and `depends_on_vars={...}`
Environment variables	`env_vars=["GENOME_BUILD"]` invalidates when env changes
Verbose logging	`verbose=True` logs cache hits, misses, and first executions
Dependency graph	Nested cached calls are tracked as a parent/child graph
Recursive memoization	Recursive cached functions automatically memoize sub-problems
Cache statistics	`cache_stats(dir)` reports entry count and disk usage
Cache pruning	`cache_prune(dir, days_old=30)` removes stale entries
Sentinel / lock files	Concurrent-safe via `.computing` sentinels and `.lock` files
YAML configuration	Load project-level defaults from a YAML file

Limitations & caveats

Package boundaries: cachepy stops tracking when it hits a function imported from an installed package. Instead, it records the package name and version. It does not inspect the internals of those functions.
Native code / C extensions: C/C++ extensions and external tools (e.g. subprocess.run(["bwa", ...])) are not tracked. If they change, cachepy will not notice unless their inputs or outputs change in a tracked place.
Side effects: Functions with side effects (writing to global variables, random seeds, databases, network calls, etc.) are not fully safe to cache. Prefer pure, data-in / data-out functions.
Pickle limitations: Results must be picklable. Objects that cannot be pickled (open file handles, database connections, running threads, lambda closures in some cases) will raise an error at save time. Large results incur serialization overhead on every cache hit.
Argument hashing: Arguments are hashed via pickle + hashlib. Objects with non-deterministic pickling (e.g. sets, dicts with hash-randomized key order in older Python, custom __reduce__) may produce unstable hashes. NumPy arrays, pandas DataFrames, and PyTorch tensors are handled correctly.
No distributed execution: cachepy is a single-machine, single-process cache. It does not coordinate across machines or provide cluster scheduling.

When you probably shouldn't use cachepy

Highly stateful / interactive code where caching would confuse you more than it helps
Situations where you need full workflow orchestration, scheduling, and cluster execution (use snakemake / nextflow / targets / etc. instead)
Functions that return non-picklable objects (wrap them to return serializable data instead)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
cachepy		cachepy
conda-recipe		conda-recipe
docs		docs
notebooks		notebooks
tests		tests
.DS_Store		.DS_Store
._.DS_Store		._.DS_Store
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
data.txt		data.txt
demo.py		demo.py
guix.scm		guix.scm
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pycacheR

What does cachepy do?

What is it useful for?

Installation

Basic usage

How does cachepy decide to recompute?

Features

Limitations & caveats

When you probably shouldn't use cachepy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pycacheR

What does cachepy do?

What is it useful for?

Installation

Basic usage

How does cachepy decide to recompute?

Features

Limitations & caveats

When you probably shouldn't use cachepy

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages