Version: 1.0 Repository: EcosystemEcologyLab/lab-principles Applies to: EcosystemEcologyLab data pipeline projects (R or Python) Extends: SCIENCE_PRINCIPLES.md — read that file first
When copying this file into a project repository, record the source commit hash in the project's CLAUDE.md alongside the hash for SCIENCE_PRINCIPLES.md.
These rules apply to any project whose primary purpose is acquiring, processing, quality-controlling, or analysing scientific datasets. They apply regardless of whether the pipeline is written in R, Python, or a combination.
- All R dependencies must be declared and version-locked using
renv - All Python dependencies must be declared and version-locked using
pyproject.tomlorrequirements.txtwith pinned versions - Never install packages interactively without immediately updating the lockfile and committing it
- Never introduce a new dependency without flagging it for discussion first
- The lockfile is a scientific record — treat it as such
Every output file (CSV, RDS, NetCDF, figure, HTML report) must carry provenance
metadata. The metadata may be embedded in the file (e.g. as a header block or
file attributes) or stored in a companion file with the same base name and a
.meta.json extension.
Required fields:
| Field | Content | Example |
|---|---|---|
run_datetime_utc |
ISO 8601 timestamp of pipeline run | 2026-03-27T18:25:56Z |
pipeline_version |
Git commit hash of the repository at run time | 34befe2 |
input_sources |
URLs, DOIs, or file paths of all primary inputs | snapshot CSV path + per-site DOIs |
r_session_info |
Saved output of sessionInfo() |
saved to outputs/session_info.txt |
notes |
Any manual decisions, overrides, or deviations from defaults | free text |
Implementation rules:
pipeline_versionmust be captured programmatically at run time:system("git rev-parse --short HEAD", intern = TRUE)in Rr_session_infomust be written tooutputs/session_info.txtat the end of every pipeline run — this file is gitignored but must be present with every output setnotesis required even if empty — an empty string is acceptable; a missing field is not
Every record excluded from analysis must be logged. Exclusions and unknowns are distinct categories and must be logged separately.
- File:
outputs/exclusion_log.csv(gitignored — regenerated each run) - Required columns:
| Column | Content |
|---|---|
site_id |
FLUXNET site ID or equivalent record identifier |
variable |
Variable name or ALL if the whole record is excluded |
timestamp |
Record timestamp or ALL if the whole site-year is excluded |
reason |
Human-readable reason for exclusion |
threshold |
The threshold or rule applied (e.g. QC_THRESHOLD_YY=0.75) |
excluded_by |
Script name that performed the exclusion |
- File:
outputs/unknown_log.csv(gitignored — regenerated each run) - Required columns:
record_id,reason,logged_by - A record is UNKNOWN when it cannot be assessed — not when it fails QC. Failed QC → exclusion log. Missing data → unknown log.
- Both logs must be written even if empty (zero-row CSV with headers)
- Summary counts from both logs must be printed to the console at the end of each QC script run
- No record may be silently dropped — every exclusion must appear in one of the two logs
When a scientist makes a manual decision that overrides a pipeline default (e.g. manually including or excluding a site, overriding a QC threshold for a specific site-year), that decision must be stored in a human override file.
- File location:
data/overrides/(git-tracked — these are scientific decisions) - File format: CSV with columns
record_id,decision,reason,date,author - Claude must never modify override files — read them, apply them, flag them in output metadata, but never write to them
- Override files must survive pipeline reruns — they are inputs, not outputs
- Every script must begin with
source("R/pipeline_config.R")andcheck_pipeline_config()(or the Python equivalent) - Scripts are numbered and must be run in order — document dependencies between scripts explicitly
- Scripts communicate only via files in
data/oroutputs/— never via R global environment variables or Python module-level state - Every script must write a completion message to the console on success:
message("Script XX complete: N records processed, M excluded, K unknown")
- All scientifically meaningful thresholds (QC cutoffs, minimum data coverage requirements, aggregation rules) must be declared as named constants in a configuration file — never as magic numbers inline in scripts
- Named constants must be documented with their units, valid range, default value, and the scientific rationale for the default
- Changing a threshold is a scientific decision — it must be committed with a clear commit message explaining why
| Directory | Git-tracked | Content |
|---|---|---|
data/snapshots/ |
Yes | Timestamped manifests of input data |
data/overrides/ |
Yes | Human override files |
data/raw/ |
No | Downloaded source data |
data/extracted/ |
No | Unzipped/parsed source data |
data/processed/ |
No | Pipeline outputs |
outputs/ |
No | Final analysis outputs and logs |
figures/ |
No | Generated figures |