Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Jade Zou^1,2,*,‡, Tao Huang^2,*,‡, Weijie Kong^2,*, Junzhe Li^1,2, Yue Wu², Qi Tian², Jiangfeng Xiong², Jianwei Zhang^2,†, Liefeng Bo², Zhao Zhong^2,§

¹Peking University ²Tencent Hunyuan

^*Equal contribution ^§Corresponding author ^†Project leader ^‡Work done during internship at Tencent Hunyuan

News

Code for Precise-SDE is available in this repository, including the Precise sampler, FLUX.2 Klein training entrypoint, evaluation scripts, and reward server integrations.

Abstract

Reinforcement learning (RL) is an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. For online RL on flow-matching models, a central step is turning the deterministic sampling trajectory into a stochastic policy, usually by replacing the reverse-time ODE with an SDE. The stochastic sampler is therefore part of the policy: it controls exploration, affects denoising stability, and determines the action probabilities used by policy-gradient optimization.

We decompose stochastic sampler design into two coupled problems: choosing an exploration schedule that balances diversity and stability, and discretizing the resulting SDE faithfully at the small step counts used in RL. Existing samplers expose failure modes in this regime: Euler-style stochastic samplers can introduce excess discretization noise, while coefficient-preserving rules can bias the marginal distribution. We propose Precise, an SDE-consistent stochastic sampler with a logSNR-derived exploration schedule and a closed-form finite-step transition. The key approximation freezes the clean-latent posterior mean, which keeps the denoising trajectory faithful while avoiding excess noise.

Across FLUX.2 Klein experiments, Precise improves reward optimization speed and stability, reaches state-of-the-art alignment scores on in-domain rewards such as PickScore and HPSv2.1, and requires less wall-clock training time to match the best in-domain performance of prior samplers.

Training Curves

The main experiments compare Precise against Dance-GRPO, Flow-GRPO, and CPS under matched training recipes. Higher is better for all plotted rewards.

FLUX.2 Klein, 20 NFE

Setup

Install uv, then create the root environment from the repository root:

uv sync --project .

GPU training requires a CUDA-capable Linux environment compatible with the PyTorch and accelerator versions pinned in pyproject.toml.

Models

External model and reward weights are resolved through pinned Hugging Face checkpoints unless you configure a local model mirror. Override local roots with:

export PRECISE_SDE_MODEL_ROOT=/path/to/models

Pinned checkpoints:

Name	Hugging Face checkpoint	Revision
FLUX.2 Klein	`black-forest-labs/FLUX.2-klein-base-4B`	`a3b4f4849157f664bdbc776fd7453c2783562f4d`
CLIP	`openai/clip-vit-large-patch14`	`32bd64288804d66eefd0ccbe215aa642df71cc41`
OpenCLIP ViT-H	`laion/CLIP-ViT-H-14-laion2B-s32B-b79K`	`1c2b8495b28150b8a4922ee1c8edee224c284c0c`
PickScore	`yuvalkirstain/PickScore_v1`	`a4e4367c6dfa7288a00c550414478f865b875800`
HPSv2	`xswu/HPSv2`	`697403c78157020a1ae59d23f111aa58ced35b0a`
ImageReward	`zai-org/ImageReward`	`5736be03b2652728fb87788c9797b0570450ab72`
UnifiedReward v2	`CodeGoat24/UnifiedReward-2.0-qwen35-9b`	`f01548b009741e12ff9817ed91dba94701ed9579`
GenEval Mask2Former	`tsbpp/geneval_mask2former`	`22b5a198cedf6b45e45165cf1c865d58de4a2832`

Training

Use the launcher rather than calling trainer scripts directly:

bash launch/train.sh --flux --reward mix --sde precise --noise-level 1.5 --step 20

Supported model selectors:

--flux

Supported rewards:

mix
pickscore
geneval

Supported SDE modes:

precise
flow_grpo
cps
dance_grpo
dance_precise

The launcher selects the trainer, config entrypoint, and PRECISE_SDE_LAUNCH_* environment together. The FLUX.2 Klein config builder lives in config/flux2_klein.py; the shared trainer is precise_sde/train/rl_trainer.py.

Evaluation

eval/infer_eval.sh runs through the root uv project and supports FLUX.2 Klein checkpoints. Pass at least one checkpoint base explicitly:

bash eval/infer_eval.sh \
  --flux \
  --ckpt-base checkpoints/logs/run-name/checkpoints \
  --eval-config '1000|precise|0|20|pickscore|{"clipscore": 1.0}'

Use repeated --ckpt-base and --eval-config arguments to evaluate multiple runs in one invocation.

Reward Server Setup

Remote reward services are intentionally isolated from the main training environment when they need heavyweight or version-sensitive dependencies.

GenEval

GenEval runs as a separate nested uv project:

bash precise_sde/rewards/servers/geneval/bootstrap.sh
bash precise_sde/rewards/servers/geneval/start_server.sh
uv run --project precise_sde/rewards/servers/geneval \
  python precise_sde/rewards/servers/geneval/check_server.py

The server binds to 127.0.0.1:18085 by default. Override the client URL with PRECISE_SDE_GENEVAL_URL.

UnifiedReward

UnifiedReward v2 should run in a fresh conda environment rather than the repo root uv environment:

conda create -n vllm python=3.12 -y
conda activate vllm
pip install -r precise_sde/rewards/servers/unified_reward/requirements.txt
bash precise_sde/rewards/servers/unified_reward/start_server.sh

Probe a running server with:

uv run --project . python precise_sde/rewards/servers/unified_reward/test_api.py \
  --base-url http://127.0.0.1:8080 --tests 1,2,4

Point training at the server with PRECISE_SDE_UNIFIEDREWARD_URL or PRECISE_SDE_UNIFIEDREWARD_URLS. The exact request and response contract expected by training is documented in precise_sde/rewards/servers/unified_reward/README.md.

Acknowledgements

This repository builds on the Flow-GRPO training codebase and compares with Flow-GRPO, Dance-GRPO, and CPS samplers. The experiments use FLUX.2 Klein as the pretrained flow-matching backbone, and rely on community reward and evaluation models including PickScore, HPSv2.1, ImageReward, UnifiedReward v2, CLIP/OpenCLIP, and GenEval. We thank the authors and maintainers of these projects, as well as the Diffusers, Accelerate, PyTorch, and Hugging Face ecosystems that make this codebase possible.

BibTeX

@misc{zou2026precisesdeconsistentstochasticsampling,
  title={Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models},
  author={Jade Zou and Tao Huang and Weijie Kong and Junzhe Li and Yue Wu and Qi Tian and Jiangfeng Xiong and Jianwei Zhang and Liefeng Bo and Zhao Zhong},
  year={2026},
  eprint={2605.23522},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2605.23522},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

News

Table of Contents

Abstract

Training Curves

FLUX.2 Klein, 20 NFE

Setup

Models

Training

Evaluation

Reward Server Setup

GenEval

UnifiedReward

Acknowledgements

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
accelerate_configs		accelerate_configs
assets/figures		assets/figures
config		config
dataset		dataset
eval		eval
launch		launch
precise_sde		precise_sde
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

News

Table of Contents

Abstract

Training Curves

FLUX.2 Klein, 20 NFE

Setup

Models

Training

Evaluation

Reward Server Setup

GenEval

UnifiedReward

Acknowledgements

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages