GEditBench v2 is a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. To reduce reliance on proprietary API-based evaluation, we further propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench including 3,506 expert-annotated preference pairs for evaluating assessment models of image editing in visual consistency dimension.
GEditBench_v2/
βββ configs/
β βββ datasets/ # candidate pools, benchmark definitions
β βββ lora_sft/ # LoRA/VLM training configs
β βββ pipelines/ # annotation and eval pipeline configs
βββ data/
β βββ a_raw_img_prompt_pair_data/ # raw source pairs before filtering
β βββ b_filtered_img_prompt_pair_data/ # filtered subsets and generated candidate metadata
β βββ c_annotated_group_data/ # grouped annotation outputs
β βββ d_train_data/ # pairwise training data for judge learning
β βββ e_openedit_pair_res/ # GEditBench v2/OpenEdit pairwise evaluation results
β βββ f_reward_results/ # reward/judge evaluation outputs
β βββ z_reward_bench/ # benchmark assets and shuffled benchmark annotations
βββ environments/ # publishable env profiles and lock files
βββ scripts/ # installers and utility launchers
βββ src/
β βββ autogen/
β βββ autopipeline/
β βββ autotrain/
β βββ cli/
β βββ inference/
β βββ prompts/
βββ vllm_deploy_scripts/ # helper scripts for serving judge backends
The checked-in data/ tree is an open-source skeleton with README.md placeholders in every tracked folder. See data/README.md for the intended contents and file schemas.
GEditBench v2 exposes three primary CLIs:
| CLI | Scope | Representative Commands |
|---|---|---|
autogen |
data filtering and candidate generation | filter, run candidates, run geditv2 |
autopipeline |
annotation, evaluation, and pair construction | annotation, eval, train-pairs |
autotrain |
LoRA VLM training launcher | top-level training entry |
In practical terms, the repository supports the following loop:
- Sample or filter diverse source image-instruction pairs.
- Generate multiple edited candidates with image-editing models.
- Annotate visual consistency automatically with task-specific pipelines.
- Convert grouped results into pairwise preference data.
- Train a VLM judge on those pairs.
- Evaluate the resulting judge on GEditBench v2 or reward benchmarks.
autogenCLI (env & usage)autopipelineCLI (env & usage)autotrainCLI (env & usage)- PVC-Judge Inference
- Before using: Configuration You Should Update First
- Start Model Arena on GEditBench v2
# if need filtering, please refer to https://github.com/QwenLM/Qwen3-VL-Embedding
git clone https://github.com/QwenLM/Qwen3-VL-Embedding.git
cd Qwen3-VL-Embedding
bash scripts/setup_environment.sh
# if only for candidates generation
pip install git+https://github.com/huggingface/diffusersOur autogen pipeline can be concluded as follows:
Step0: Download open-source dataset from HuggingFace, e.g., UnicEdit-10M and Nano-Consistency-150k.
Step1: Prepare data pool
- For Pico-Banana-400k
python ./src/autogen/prepare_pico_data.py \
--task subject-add \
--output-dir ./data/a_raw_img_prompt_pair_data \
--image-save-path /path/to/save/source/images \
--path-to-pico-sft-jsonl /path/to/pico/sft.jsonl- For Nano-Consistency-150k
python ./src/autogen/prepare_nano_consistent_data.py \
--task background_change \
--path-to-nano-data /path/to/nano/consistency/data \
--output-dir ./data/a_raw_img_prompt_pair_data \
--image-save-path /path/to/save/source/images \ # optional
--sample-num 4000- For UnicEdit-10M
python ./src/autogen/prepare_unicedit.py \
--path-to-uniedit-data path/to/XrZUMXM-/xiaotanhua/UnicEdit-10M/data \
--output-dir ./data/a_raw_img_prompt_pair_data \
--max-workers 100Step2: Filter images using autogen CIL
# (optional, or you can invoke the CLIs directly with `python -m src.cli.<tool>`)
./scripts/install_autogen.sh
# you can use `python -m src.cli.autogen --help` or autogen --help for detailed information
# filter
autogen filter \
--sample-num 1500 \
--task background_change \ # the edit task of the input file
--input-file ./data/a_raw_img_prompt_pair_data/subject_add.jsonl \
--output-dir ./data/b_filtered_img_prompt_pair_data \
--qwen-embedding-model-path /path/to/qwen3/vl/embedding/model \
--image-save-path /path/to/the/clean/source/images \
--embedding-batch-size 256Step3: Generate candidates using autogen CIL
autogen run candidates \
--task subject-add \
--model qwen-image-edit \
--dataset-path ./data/b_filtered_img_prompt_pair_data \
--gpus-per-worker 1 \
--output-bucket-prefix /path/to/save/output/images \Step3.5: Generation for GEditBench v2
autogen run geditv2 \
--model qwen-image-edit \
--bench-path /path/to/GEditBench-v2 \
--image-save-dir /path/to/GEditBench-v2/candidates/gallery \
--gpus-per-worker 1 \
--merge-to-metadata /path/to/GEditBenchv2/candidates/gallery/metadata.jsonl # (optional) if you want to join model comparisonIn this work, we propose two novel region-decoupled preference data synthesis pipelines, called object-centric and human-centric. Detailed documents are provided here (written entirely by Codex, gpt-5.4 xhigh).
./scripts/install_autopipeline.sh # (optional)
conda env create -f environments/annotate.yml
conda activate annotate
# or:
python3.11 -m venv .venvs/annotate
source .venvs/annotate/bin/activate
python -m pip install -r environments/requirements/annotate.lock.txtAnnotate data using, e.g., object-centric, human-centric, or vlm-as-a-judge, pipelines
First create the task-specific pipeline config in ./configs/pipelines/object_centric (human_centric, or vlm_as_a_judge)
autopipeline annotation \
--edit-task subject_add \
--pipeline-config-path $(pwd)/configs/pipelines/object_centric/subject_add.yaml \
--save-path $(pwd)/data/c_annotated_group_data \
--user-config $(pwd)/configs/pipelines/user_config.yaml \
--candidate-pool-dir $(pwd)/configs/datasets/candidate_poolsThe repository also includes helper scripts such as vllm_deploy_scripts/*.sh for implementing vLLM server.
Evaluate models on Reward benchmarks
autopipeline eval \
--bmk vc_reward \
--pipeline-config-path $(pwd)/configs/pipelines/vlm_as_a_judge/openai.yaml \
--user-config $(pwd)/configs/pipelines/user_config.yaml \
--save-path $(pwd)/data/f_reward_results \
--max-workers 200 \
--geditv2-metadata-file metadata.jsonl # (optional for model comparison on GEditBench v2)Format transition for PVC-Judge training: Convert grouped results into preference pairs
autopipeline train-pairs \
--tasks color_alter,material_alter \
--prompts-num 1500 \
--input-dir $(pwd)/data/c_annotated_group_data \
--output-dir $(pwd)/data/d_train_data \
--mode auto \ # "group" for object- and human-centric pipelines, while "judge" for vlm-as-a-judge pipeline
--filt-out-strategy head_tail \
--thresholds-config-file $(pwd)/configs/pipelines/data_construction_configs.json \./scripts/install_autotrain.sh # (optional)
conda env create -f environments/train.yml
conda activate train
# or:
python3.12 -m venv .venvs/train
source .venvs/train/bin/activate
python -m pip install -r environments/requirements/train.lock.txt
python -m pip install -r environments/requirements/optional/train.txtThe training launcher resolves the YAML config in ./configs/lora_sft/, creates an output directory, and starts DeepSpeed on src/autotrain/train/train_sft_lora.py.
autotrain \
--config qwen3_vl_8b_train \
--config-path $(pwd)/configs/lora_sft \
--num-gpus 8- Merge LoRA weights to models, required env
torch/peft/transformers
python ./scripts/merge_lora.py \
--base-model-path /path/to/Qwen3/VL/8B/Instruct \
--lora-weights-path /path/to/LoRA/Weights \
--model-save-dir /path/to/save/PVC/Judge/model- Implement online server via vLLM
python -m vllm.entrypoints.openai.api_server \
--model /path/to/save/PVC/Judge/model \
--served-model-name PVC-Judge \
--tensor-parallel-size 1 \
--mm-encoder-tp-mode data \
--limit-mm-per-prompt.video 0 \
--host 0.0.0.0 \
--port 25930 \
--dtype bfloat16 \
--gpu-memory-utilization 0.80 \
--max_num_seqs 32 \
--max-model-len 48000 \
--distributed-executor-backend mp- use
autopipelineto inference SeeautopipelineUsage!
# For local judge inference
conda env create -f environments/pvc_judge.yml
conda activate pvc_judge
# or:
python3.12 -m venv .venvs/pvc_judge
source .venvs/pvc_judge/bin/activate
python -m pip install -r environments/requirements/pvc_judge.lock.txt
# Run
bash ./scripts/local_eval.sh vc_rewardBefore the first run, review these files and replace internal defaults with your own paths and endpoints:
configs/pipelines/user_config.yamlconfigs/datasets/bmk.jsonconfigs/lora_sft/*.yamlsrc/autogen/constants.py
At minimum, check:
- model checkpoint roots
- benchmark/data roots
- vLLM or API endpoints
- credentials and API keys
- output directories
Step0: Download GEditBench-v2-CandidatesGallery
Step1: Generate images on GEditBench v2 by autogen (or your own scripts) to the candidates gallery folder and merge the generation information to metadata, run
autogen run geditv2 \
--model qwen-image-edit \
--bench-path /path/to/GEditBench-v2 \
--image-save-dir /path/to/GEditBench-v2/candidates/gallery \
--gpus-per-worker 1 \
--merge-to-metadata /path/to/GEditBenchv2/candidates/gallery/metadata.jsonlThen, you will get:
GEditBench-v2-CandidatesGallery/
βββ BAGEL/
βββ Step1X_Edit_v1p2/
...
βββ Your_model_here/ <- !generated images by your model
β βββ background_change_000000.png
β βββ background_change_000001.png
β ...
β
βββ metadata.jsonl
βββ metadata_{timestamp}.jsonl <- !the merged metadata
Step2: Run pairwise comparison using autopipeline
autopipeline eval \
--bmk geditv2 \
--pipeline-config-path $(pwd)/configs/pipelines/vlm_as_a_judge/openai.yaml \
--user-config $(pwd)/configs/pipelines/user_config.yaml \
--save-path $(pwd)/data/e_geditv2_pair_res \
--max-workers 200 \
--geditv2-metadata-file /path/to/GEditBenchv2/candidates/gallery/metadata_{timestamp}.jsonlThen, you will get the comparison results in ./data/e_geditv2_pair_res/geditv2/eval_xx_meta_data_{timestamp/{new_timestamp}.jsonl
Step3: Compute the Elo Score
# you can use the bash script
bash ./scripts/elo.sh 1000 # bootstrap iteration number
# or you can use the normal .py script
python ./src/common_utils/elo_score.py
--result-files "/absolute/path/to/data/e_geditv2_pair_res/geditv2/eval_xx_meta_data_{timestamp/{new_timestamp}.jsonl" \
--bootstrap 1000 \
--alpha 1 \ # for Bradley-Terry Model
--dimension-weighting "balanced" \
--seed 42We would like to express our sincere gratitude to the following projects:
The majority of the editing instructions in our open-set category were sourced from these excellent repositories. Thank you for your amazing contributions to the open-source community!



