This guide captures the exact commands to run each stage of the repository's two pipelines and highlights what to verify after every step. Use it to reproduce an end-to-end run on a fresh clone and to document any failures you encounter.
- Install dependencies
pip install -r requirements.txt
- Populate environment variables
- Confirm
OPENAI_API_KEYandGOOGLE_API_KEYare exported in the shell (or stored in.env). - Optionally set
OPENAI_MODEL,OPENAI_FINETUNE_MODEL,STONEY_EXTRACTION_MODEL, andSTONEY_TASK_MODELto override defaults.
- Confirm
- Artifacts directory sanity check
- Ensure
Dictionaries/,OpenAIFineTune/, anddata/exist and are writable (the pipeline scripts will create their own sub-directories if needed).
- Ensure
Run the Gemini-backed generator to build the 150k example corpus:
python bilingual_qa_generator.pyExpectation:
Dictionaries/bilingual_training_set.jsonlgrows toward 150,000 lines (75k English-perspective + 75k Stoney-perspective examples).- Periodic checkpoints appear under
Dictionaries/checkpoints/with progress metadata. - The script logs API calls and warnings when malformed dictionary rows are skipped. 【F:bilingual_qa_generator.py†L23-L209】
Capture failures by noting the last log lines, checkpoint counts, and the batch size (context_size=5). Quota or timeout issues can be retried because the generator resumes from rolling buffers.
Transform the raw corpus into train/validation JSONL files:
python finetunesetup.pyExpectation:
OpenAIFineTune/stoney_train.jsonlandOpenAIFineTune/stoney_valid.jsonlare produced with an 80/20 split.- Each entry is a chat-style
messagesarray ready for OpenAI fine-tuning. 【F:finetunesetup.py†L10-L78】
If the input file is missing or malformed, the script logs the issue and aborts. Record the offending line number or JSON snippet when that happens.
Submit the formatted datasets to OpenAI:
python openai_finetune.pyExpectation:
- The script validates the presence of the train/valid files and the
OPENAI_API_KEY. - A fine-tuning job is created for the base model defined by
OPENAI_FINETUNE_MODEL/OPENAI_MODEL. - Optional Hugging Face dataset publishing and Weights & Biases tracking kick in when the relevant environment variables are set. 【F:openai_finetune.py†L40-L199】
Document the returned job ID, status transitions, and any API errors (quota, billing, validation failures). If dataset uploads fail, capture the exception message and whether retries were attempted.
Execute the single entry point that loads the grammar PDF, extracts rules, curates them, and generates RL tasks:
python run_stoney_grammar_pipeline.pyPipeline flow:
pdf_ingest.load_page_assetsrenders the source PDF into 127 base64 PNG + text chunks. 【F:stoney_rl_grammar/pdf_ingest.py†L14-L35】StoneyGrammarExtractor.extract_rulescalls the Responses API to convert each chunk into structured grammar rules, persisting per-chunk JSON underdata/grammar_extracted_stoney/. 【F:stoney_rl_grammar/rule_extractor.py†L62-L165】RuleOrganizer.organizefilters, deduplicates, and writes curated rules todata/rl_training_rules_stoney.json. 【F:stoney_rl_grammar/rule_organizer.py†L17-L81】StoneyTaskGenerator.generate_tasksstreams RL-ready tasks todata/training_datasets_stoney.jsonl. 【F:stoney_rl_grammar/task_generator.py†L63-L140】
During the extraction stage, the OpenAI Python SDK bundled with the repo rejects the response_format argument when calling client.responses.create, causing every chunk to fail and the retry loop to exhaust:
2025-10-29 00:05:48,141 - ERROR - Unable to process page_001_chunk_00: Responses.create() got an unexpected keyword argument 'response_format'
The error repeats for each page until the run is interrupted. 【d9a4f4†L1-L2】【8b4d30†L1-L16】
Action items while reproducing:
- Confirm the installed
openaipackage version (pip show openai). Versions prior to 1.12.0 do not yet acceptresponse_format. - Option 1: upgrade the SDK (
pip install --upgrade openai) soresponse_format={"type": "json_object"}is supported. - Option 2: remove the
response_formatargument and parse raw JSON manually, but that reintroduces non-JSON responses. - Document which option you choose and capture the resulting behavior (successful rule extraction counts or new errors).
Once data/training_datasets_stoney.jsonl exists, install and smoke-test the bundled RL environment:
pip install -e environments/stoney_nakoda_translation
uv run vf-eval stoney-nakoda-translation -a '{"dataset_path": "data/training_datasets_stoney.jsonl", "max_examples": 50}'The environment expects the task JSONL generated above and exposes configuration knobs documented in environments/stoney_nakoda_translation/README.md. 【F:environments/stoney_nakoda_translation/README.md†L1-L60】
Record evaluation metrics (exact_match_reward, char_overlap_reward, pattern_reward) and any loader issues (e.g., missing dataset path).
Failure logging template
Stage: <dictionary generation | format conversion | fine-tune | grammar extraction | task generation | RL eval>
Command: <exact command>
Timestamp (UTC): <YYYY-MM-DD HH:MM>
Observed behavior: <error message or anomaly>
Artifacts: <paths, line counts, checkpoint names>
Next action: <retry, upgrade dependency, open issue>
Keep this document updated as you resolve blockers so future operators have an authoritative runbook.