VT-Bench is a unified benchmark for vision–tabular multimodal learning, designed to evaluate multimodal models under two major paradigms:
- Discriminative prediction: image + tabular record → class / value
- Generative reasoning: image + table/database + question → free-form answer
Two paradigms in vision–tabular multi-modal learning and the cross-modal grounding challenge in generative reasoning.
VT-Bench provides:
- Broad model coverage: vision-only models, tabular-only models, vision–tabular fusion models, and general-purpose VLMs
- Diverse real-world scenarios: 756K paired samples spanning 9 domains, covering both prediction and reasoning tasks
- Unified evaluation: consistent preprocessing, standardized interfaces, and optional modality diagnostic metrics for prediction
Clone the repository:
git clone https://anonymous.4open.science/r/VT-Bench-13C2.git
cd VT-BenchCreate and activate the conda environment:
conda env create --file environment.yaml
conda activate vt_benchDue to version incompatibilities, if you want to fine-tune Vision-Language Models, please use environment_vlm.yaml instead:
conda env create --file environment_vlm.yaml
conda activate vt_bench_vlmAdditionally, you need to download LLaMA-Factory:
git clone https://github.com/hiyouga/LLaMA-Factory.gitYou need to specify five arguments: task, dataset, model, setting, and diagnostics.
-
task:
predictionorreasoning -
dataset: choose a dataset identifier from the Datasets section
-
model: choose a model name from the Models section
- For VLMs, use the Hugging Face full identifier, e.g.
Qwen/Qwen3-VL-8B-Instruct
- For VLMs, use the Hugging Face full identifier, e.g.
-
setting: controls evaluation configuration (varies across datasets; see Datasets)
-
diagnostics: whether to compute additional modality diagnostic metrics for prediction tasks
- options:
none/mcr/mir/full - we also support computing MCR for a specified checkpoint
- options:
python run.py --task prediction --dataset skin --model TIP --setting none --diagnostics mcrCompute MCR for a specific checkpoint:
python run.py --task prediction --dataset skin --model TIP --setting none --diagnostics mcr --checkpoint {YOUR_PRETRAINED_CKPT_PATH}python run.py --task reasoning --dataset ehrxqa --model Qwen/Qwen3-VL-8B-Instruct --setting full --diagnostics noneFor API Models (GPT-4.1 / Gemini-3-Flash-Preview), please export API keys before running:
export OPENAI_API_KEY='your_openai_api_key'
export GOOGLE_API_KEY='your_google_api_key'| Dataset | String Identifier | Availability | Source |
|---|---|---|---|
| Skin Cancer | skin |
Public | kaggle |
| Breast Cancer | breast |
Public | kaggle |
| Adoption | adoption |
Public | kaggle |
| CelebA | celebA |
Public | kaggle |
| DVM-Car | dvm |
Public | DVM-Car |
| Pawpularity | pawpularity |
Public | kaggle |
| Anime | anime |
Public | kaggle |
| Pneumonia | pneumonia |
Constructed | MIMIC-IV & MIMIC-CXR |
| Los | los |
Constructed | MIMIC-IV & MIMIC-CXR |
| Respiratory Rat | rr |
Constructed | MIMIC-IV & MIMIC-CXR |
The three constructed datasets (pneumonia, los, rr) are built from MIMIC-CXR-JPG (v2.0.0) and MIMIC-IV (v2.2). These source datasets require a credentialed PhysioNet license. Due to the Data Use Agreement (DUA), only credentialed users can access the source data.
To access the source datasets, you must:
- Become a credentialed PhysioNet user
- If you do not have a PhysioNet account, register for one here.
- Follow these instructions for credentialing on PhysioNet.
- Complete the "CITI Data or Specimens Only Research" training course.
- Sign the data use agreement (DUA) for each project
After obtaining access:
- The construction code is under: dataset/Constructed_datasets/
- Update the file paths in built_classification.py and built_regression.py
- Download images via download_newdataset_image.sh
All preprocessing scripts for prediction datasets are under dataset/.
- Processing steps for the DVM dataset can be found here.
| Dataset | String Identifier | Availability | Source |
|---|---|---|---|
| DVM-Car QA | dvm |
Constructed | kaggle |
| MMQA | mmqa |
Public | MMQA |
| EHRXQA | ehrxqa |
Public Credentialized Access | EHRXQA |
-
Generation script: reasoning/DVM_QA/generate_qa.py.
-
settingfor DVM-Car QA should be a list specifying the tasks to evaluate:"loc": Row Localization"attr": Attribute Retrieval"count": Constrained Counting"mean": Conditional Mean
Example:
--setting '["loc","attr","count","mean"]'- See the official project page: MMQA
- Only supports full evaluation.
- See the official repo: EHRXQA
- Supported settings:
full/stage1/stage2
To integrate a new dataset, please provide a custom script that follows VT-Bench’s standard preprocessing pipeline:
-
Filter samples with missing labels or images, then split into train/val/test
-
Process tabular features:
- impute missing values (mean for numeric;
"MISSING"for categorical) - encode categorical features
- normalize/standardize numeric features
- impute missing values (mean for numeric;
-
Process images:
- resize to 224 × 224
- store as
.npy
-
Export model-ready files:
.csvfeature files.pttensor files following the naming conventions used in existing datasets
A runnable example is provided here: dataset/skin_cancer.ipynb
VT-Bench supports comprehensive evaluation of vision-tabular multi-modal learning models, including vision unimodal models, tabular unimodal models, vision-tabular multi-modal fusion models, and general-purpose vision-language models (VLMs). For discriminative prediction tasks, we implement adaptive hyperparameter optimization based on the Optuna framework. For generative reasoning tasks, VLMs are evaluated in a zero-shot setting using their official pretrained checkpoints.
-
ResNet-50: A 50-layer deep residual learning architecture that introduces identity skip connections to ease optimization and improve accuracy for large-scale visual recognition.
-
ViT-B/16: A pure Transformer architecture for vision that represents images as sequences of fixed-size patches and applies self-attention directly over patch tokens, achieving strong image classification performance with compute-efficient training.
-
LightGBM: An efficient Gradient Boosting Decision Tree implementation optimized for large-scale, high-dimensional data, using Gradient-based One-Side Sampling and Exclusive Feature Bundling to improve training speed and scalability.
-
TabTransformer: A Transformer-based architecture that contextualizes categorical feature embeddings via self-attention, producing more informative and robust representations for tabular learning.
-
TabPFN v2: A tabular foundation model trained on millions of synthetic datasets to perform general-purpose inference for supervised prediction on tables, frequently outperforming heavily tuned gradient-boosted tree ensembles with lower training overhead.
-
Concat: A parameter-efficient multi-task deep model that integrates structural imaging with demographic and clinical features through simple feature concatenation.
-
MAX: A multi-modal deep learning framework that learns modality-specific representations from clinical variables and imaging data, then fuses them using element-wise maximum operation.
-
MUL: An integrative CNN that uses channel-wise multiplicative fusion between imaging and non-imaging streams, yielding better performance than simple concatenation baselines.
-
DAFT: A lightweight conditioning module for CNNs that fuses imaging features with tabular variables by dynamically predicting per-channel scale and shift parameters and applying affine transforms to convolutional feature maps.
-
CHARMS: A cross-modal knowledge transfer method that aligns image channels with tabular features via optimal transport and mutual-information maximization, enabling selective transfer of visually relevant signals.
-
MMCL: A multi-modal contrastive pretraining framework that leverages paired imaging and tabular data to learn strong unimodal encoders, combining SimCLR-style image contrast with SCARF-style tabular augmentations.
-
TIP: A self-supervised tabular-image pre-training framework designed for multi-modal classification under incomplete tabular inputs, combining masked tabular reconstruction with image-tabular matching and contrastive objectives.
-
Table-LLaVA-v1.5-7B: A multi-modal table understanding model that answers table-centric instructions directly from table images, avoiding reliance on serialized formats and substantially outperforming recent open-source multi-modal baselines.
-
Qwen3-VL-8B-Instruct: An instruction-tuned variant of the Qwen3-VL family that supports interleaved text-image-video inputs with a native 256K-token context window for general-purpose multi-modal understanding and generation.
-
Qwen3-VL-8B-Thinking: A reasoning-oriented variant of Qwen3-VL that emphasizes stronger multi-step multi-modal reasoning across single-image, multi-image, and video tasks with enhanced spatiotemporal modeling capabilities.
-
InternVL3-8B: An open-source multi-modal LLM that performs native multi-modal pre-training, jointly learning vision and language from mixed multi-modal data and pure-text corpora in a single stage.
-
GLM-4.1V-9B-Thinking: A reasoning-oriented vision-language model trained with a large-scale multi-modal pre-trained vision backbone and curriculum-based reinforcement learning to enhance general-purpose multi-modal reasoning.
-
Llama-3.2-11B-Vision-Instruct: A multi-modal large language model that integrates visual reasoning capabilities into the Llama 3 architecture using a separately trained vision adapter with cross-attention layers.
-
Pixtral-12B: A 12B vision-language model trained for both natural images and documents, supporting flexible visual tokenization and handling multiple images within a long 128K-token context window.
-
GPT-4.1: A flagship large language model optimized for superior coding performance, instruction following, and long-context processing with a 1-million-token context window.
-
Gemini-3-Flash-Preview: A lightweight Gemini-family model optimized for low-latency, cost-effective inference while retaining strong general-purpose reasoning and multi-modal capability.
- For generative reasoning, add the Hugging Face / API calling logic in the corresponding evaluator scripts for each reasoning dataset.
- For discriminative prediction, add a new model function under
prediction/models/.
We would like to thank the following repositories for their great works: