Jade Zou1,2,*,‡, Tao Huang2,*,‡, Weijie Kong2,*, Junzhe Li1,2, Yue Wu2, Qi Tian2, Jiangfeng Xiong2, Jianwei Zhang2,†, Liefeng Bo2, Zhao Zhong2,§
1Peking University 2Tencent Hunyuan
*Equal contribution §Corresponding author †Project leader ‡Work done during internship at Tencent Hunyuan
- Code for Precise-SDE is available in this repository, including the Precise sampler, FLUX.2 Klein training entrypoint, evaluation scripts, and reward server integrations.
Reinforcement learning (RL) is an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. For online RL on flow-matching models, a central step is turning the deterministic sampling trajectory into a stochastic policy, usually by replacing the reverse-time ODE with an SDE. The stochastic sampler is therefore part of the policy: it controls exploration, affects denoising stability, and determines the action probabilities used by policy-gradient optimization.
We decompose stochastic sampler design into two coupled problems: choosing an exploration schedule that balances diversity and stability, and discretizing the resulting SDE faithfully at the small step counts used in RL. Existing samplers expose failure modes in this regime: Euler-style stochastic samplers can introduce excess discretization noise, while coefficient-preserving rules can bias the marginal distribution. We propose Precise, an SDE-consistent stochastic sampler with a logSNR-derived exploration schedule and a closed-form finite-step transition. The key approximation freezes the clean-latent posterior mean, which keeps the denoising trajectory faithful while avoiding excess noise.
Across FLUX.2 Klein experiments, Precise improves reward optimization speed and stability, reaches state-of-the-art alignment scores on in-domain rewards such as PickScore and HPSv2.1, and requires less wall-clock training time to match the best in-domain performance of prior samplers.
The main experiments compare Precise against Dance-GRPO, Flow-GRPO, and CPS under matched training recipes. Higher is better for all plotted rewards.
Install uv, then create the root environment from the repository root:
uv sync --project .GPU training requires a CUDA-capable Linux environment compatible with the PyTorch and accelerator versions pinned in pyproject.toml.
External model and reward weights are resolved through pinned Hugging Face checkpoints unless you configure a local model mirror. Override local roots with:
export PRECISE_SDE_MODEL_ROOT=/path/to/modelsPinned checkpoints:
| Name | Hugging Face checkpoint | Revision |
|---|---|---|
| FLUX.2 Klein | black-forest-labs/FLUX.2-klein-base-4B |
a3b4f4849157f664bdbc776fd7453c2783562f4d |
| CLIP | openai/clip-vit-large-patch14 |
32bd64288804d66eefd0ccbe215aa642df71cc41 |
| OpenCLIP ViT-H | laion/CLIP-ViT-H-14-laion2B-s32B-b79K |
1c2b8495b28150b8a4922ee1c8edee224c284c0c |
| PickScore | yuvalkirstain/PickScore_v1 |
a4e4367c6dfa7288a00c550414478f865b875800 |
| HPSv2 | xswu/HPSv2 |
697403c78157020a1ae59d23f111aa58ced35b0a |
| ImageReward | zai-org/ImageReward |
5736be03b2652728fb87788c9797b0570450ab72 |
| UnifiedReward v2 | CodeGoat24/UnifiedReward-2.0-qwen35-9b |
f01548b009741e12ff9817ed91dba94701ed9579 |
| GenEval Mask2Former | tsbpp/geneval_mask2former |
22b5a198cedf6b45e45165cf1c865d58de4a2832 |
Use the launcher rather than calling trainer scripts directly:
bash launch/train.sh --flux --reward mix --sde precise --noise-level 1.5 --step 20Supported model selectors:
--flux
Supported rewards:
mixpickscoregeneval
Supported SDE modes:
preciseflow_grpocpsdance_grpodance_precise
The launcher selects the trainer, config entrypoint, and PRECISE_SDE_LAUNCH_* environment together. The FLUX.2 Klein config builder lives in config/flux2_klein.py; the shared trainer is precise_sde/train/rl_trainer.py.
eval/infer_eval.sh runs through the root uv project and supports FLUX.2 Klein checkpoints. Pass at least one checkpoint base explicitly:
bash eval/infer_eval.sh \
--flux \
--ckpt-base checkpoints/logs/run-name/checkpoints \
--eval-config '1000|precise|0|20|pickscore|{"clipscore": 1.0}'Use repeated --ckpt-base and --eval-config arguments to evaluate multiple runs in one invocation.
Remote reward services are intentionally isolated from the main training environment when they need heavyweight or version-sensitive dependencies.
GenEval runs as a separate nested uv project:
bash precise_sde/rewards/servers/geneval/bootstrap.sh
bash precise_sde/rewards/servers/geneval/start_server.sh
uv run --project precise_sde/rewards/servers/geneval \
python precise_sde/rewards/servers/geneval/check_server.pyThe server binds to 127.0.0.1:18085 by default. Override the client URL with PRECISE_SDE_GENEVAL_URL.
UnifiedReward v2 should run in a fresh conda environment rather than the repo root uv environment:
conda create -n vllm python=3.12 -y
conda activate vllm
pip install -r precise_sde/rewards/servers/unified_reward/requirements.txt
bash precise_sde/rewards/servers/unified_reward/start_server.shProbe a running server with:
uv run --project . python precise_sde/rewards/servers/unified_reward/test_api.py \
--base-url http://127.0.0.1:8080 --tests 1,2,4Point training at the server with PRECISE_SDE_UNIFIEDREWARD_URL or PRECISE_SDE_UNIFIEDREWARD_URLS. The exact request and response contract expected by training is documented in precise_sde/rewards/servers/unified_reward/README.md.
This repository builds on the Flow-GRPO training codebase and compares with Flow-GRPO, Dance-GRPO, and CPS samplers. The experiments use FLUX.2 Klein as the pretrained flow-matching backbone, and rely on community reward and evaluation models including PickScore, HPSv2.1, ImageReward, UnifiedReward v2, CLIP/OpenCLIP, and GenEval. We thank the authors and maintainers of these projects, as well as the Diffusers, Accelerate, PyTorch, and Hugging Face ecosystems that make this codebase possible.
@misc{zou2026precisesdeconsistentstochasticsampling,
title={Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models},
author={Jade Zou and Tao Huang and Weijie Kong and Junzhe Li and Yue Wu and Qi Tian and Jiangfeng Xiong and Jianwei Zhang and Liefeng Bo and Zhao Zhong},
year={2026},
eprint={2605.23522},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.23522},
}
