Docstring/retrieval tooling built on top of the Pile-of-Rocq data pipeline.
This repository extends:
Use the upstream README for extraction/export details and dataset exploration conventions:
If you only want to consume docstrings/retrieval, you do not need to run the generation pipeline below.
Public dataset:
Structure (single-repo layout):
- one folder per environment:
<env>/ - one parquet table per artifact:
sources.parquettoc_nodes.parquet(includesdocstring,source_id, spans, kind/name)env_toc.parquetproofs.parquet,proof_steps.parquet,step_deps.parquet,proof_axioms.parquet,env_metadata.parquet
- HF configs are
<env>-<table>, for examplecoq-mathcomp-toc_nodes
Minimal Python snippet:
from datasets import load_dataset
repo = "theostos/pile-of-rocq"
env = "coq-mathcomp"
toc_nodes = load_dataset(repo, f"{env}-toc_nodes", split="train")
print("rows:", len(toc_nodes))
print("columns:", toc_nodes.column_names)
target = next(
row for row in toc_nodes
if (row.get("docstring") or "").strip() and row.get("kind") in {"definition", "notation", "theorem"}
)
print(target["kind"], target.get("name"))
print((target.get("docstring") or "")[:300])Docker images are published at:
Run one image and expose retrieval on port 8010:
docker run --rm -p 8010:8010 theostos/coq-mathcomp:9.0-2.5.0 rocq-ml-retrieval-server \
--embeddings-root /home/rocq/docstring_embeddings \
--host 0.0.0.0 \
--port 8010Optional quick check:
curl -s http://127.0.0.1:8010/envsPipeline flow:
script/chunk.pyscript/compute_docstrings.pyscript/merge_docstrings_exports.pyscript/export_hf.pyscript/peek_env_toc.py(inspection/debug)
chunk.py:
- reads env JSON/JSONL export files
- extracts target TOC elements and removes proof text
- injects UIDs and builds annotation chunks
- writes the same target UIDs into TOC target nodes (
toc[*].data.uid) - writes per-env JSONL ready for
compute_docstrings.py
Typical setup used here:
- Base extraction dump:
../Pile-of-rocq/exported_v3/*.jsonor*.jsonl - Previous annotations (shards + merged):
exported_v3/ - Merged annotations output:
exported_v3/merged/
Single env test:
python -m script.chunk_ter \
--input-dir ../Pile-of-rocq/exported_v3 \
--output-dir chunk_ter_output \
--env coq-actuaryAll envs:
python -m script.chunk_ter \
--input-dir ../Pile-of-rocq/exported_v3 \
--output-dir chunk_ter_output \
--workers 4python -m script.compute_docstrings \
--input-dir chunk_ter_output \
--output-dir exported_v3/annotated_jsonl_local \
--state-dir .compute_docstrings_state_local \
--config config/annotator/config.yaml \
--templates config/annotator \
--dir-template config/annotator/prompt_dir_one_liner.txt \
--workers 4Notes:
- This script is resumable (
--state-dir). - For multiple parallel jobs (HPC), use different
--output-dirand--state-dirper job and split envs with repeated--env. - Output rows keep
annotationsunchanged and also mirror them intotoc[*].docstringusingtoc[*].data.uid.
Merge all shard outputs:
python -m script.merge_docstrings_exports \
--input-root exported_v3 \
--output-dir exported_v3/mergedMerge + coverage audit against latest Pile-of-rocq dump:
python -m script.merge_docstrings_exports \
--input-root exported_v3 \
--output-dir exported_v3/merged \
--baseline-export-dir ../Pile-of-rocq/export_v2 \
--report-path exported_v3/merged/coverage_report.jsonpython -m script.export_hf \
--export-path ../Pile-of-rocq/export_v2 \
--config-path ../Pile-of-rocq/config \
--docstrings-path exported_v3/merged \
--env-toc-path exported_v3/merged/toc \
--output-path ../Pile-of-rocq/hf_export_docstrings_envtocOutput per env contains normalized tables, including:
toc_nodes.parquetwith a singledocstringfield per TOC nodeenv_toc.parquetwith env-level tree nodesexport_hf.pynow prefers TOC-nativedocstring/data.uidwhen present and keeps fallback matching from merged annotation files.- Most previously
*_jsoncolumns are now native nested Arrow/Parquet fields (lists/structs). diagstable is not exported.
Login:
hf auth loginPush all envs into one dataset repo:
python -m script.export_hf \
--export-path export_v2 \
--config-path Pile-of-rocq/config \
--env-toc-path export_v2/toc \
--push-to-hub \
--hf-namespace <your-hf-namespace> \
--push-layout single-repo \
--repo-name pile-of-rocqYou do not need to pre-create the dataset repo manually; the script creates it if needed.
In single-repo mode, configs are <env>-<table> subsets (for example coq-actuary-toc_nodes).
Browse env TOC:
python -m script.peek_env_toc \
--export-root ../Pile-of-rocq/hf_export_docstrings_envtoc \
--env coq-actuary \
--node-path Corelib/NumbersInspect a file and sample docstrings:
python -m script.peek_env_toc \
--export-root ../Pile-of-rocq/hf_export_docstrings_envtoc \
--env coq-actuary \
--file Corelib/Lists/List.v \
--max-docstrings 20Retrieval tooling (embedding precompute + HTTP server + client) are in:
src/retrieval/
See the standalone guide: