Skip to content

Codex/0330#6

Open
PI-33 wants to merge 7 commits intoGen-Verse:mainfrom
PI-33:codex/0330
Open

Codex/0330#6
PI-33 wants to merge 7 commits intoGen-Verse:mainfrom
PI-33:codex/0330

Conversation

@PI-33
Copy link
Copy Markdown

@PI-33 PI-33 commented Mar 30, 2026

No description provided.

PI-33 and others added 7 commits October 21, 2025 18:39
…pe-ppo-or-grpo

Refactor RL pipeline for GRPO grouping
- Add environment.yml, requirements.txt for reproducible env setup
- Add start.sh launch script with VLLM_CUDART_SO_PATH for /proc-restricted containers
- Add .gitignore for temp_data, logs, checkpoints, and data files
- Fix optimization_config.py for 2-GPU small-scale training (CodeContests_200 + MBPP)
- Fix openrlhf_deepspeed.py: fallback to torch.optim.AdamW when FusedAdam JIT fails
- Fix trainer.py: handle empty training batches and packed metadata averaging

Co-authored-by: Cursor <cursoragent@cursor.com>
…ging, and 8-GPU config

- Add experiment directory auto-creation with config snapshots and result archiving
- Separate TensorBoard logs into dedicated tb_logs/ dir, pass step index for proper tracking
- Add per-step training metrics logging (policy_loss, kl_loss, clip_ratio, entropy, lr)
- Extend reward.py with estimated reward statistics output
- Update optimization_config for Qwen2.5-7B-Instruct on 8-GPU (4×TP2) setup
- Add config variants (debug, paper-aligned) under optimization/configs/
- Improve .gitignore to exclude logs, experiment checkpoints, and temp data

Made-with: Cursor
- analysis/: training curves, eval summaries, BoN robustness analysis
- figures/: training visualization plots
- scripts: generate_report.py, regenerate_figures.py
- DESIGN_SELF_BOOTSTRAPPED_GRPO.md: self-supervised reward design

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant