VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

Introduction

VT-Bench is a unified benchmark for vision–tabular multimodal learning, designed to evaluate multimodal models under two major paradigms:

Discriminative prediction: image + tabular record → class / value
Generative reasoning: image + table/database + question → free-form answer

Two paradigms in vision–tabular multi-modal learning and the cross-modal grounding challenge in generative reasoning.

VT-Bench provides:

Broad model coverage: vision-only models, tabular-only models, vision–tabular fusion models, and general-purpose VLMs
Diverse real-world scenarios: 756K paired samples spanning 9 domains, covering both prediction and reasoning tasks
Unified evaluation: consistent preprocessing, standardized interfaces, and optional modality diagnostic metrics for prediction

Quickstart

1) Download

Clone the repository:

git clone https://anonymous.4open.science/r/VT-Bench-13C2.git
cd VT-Bench

2) Environment Setup

Create and activate the conda environment:

conda env create --file environment.yaml
conda activate vt_bench

Due to version incompatibilities, if you want to fine-tune Vision-Language Models, please use environment_vlm.yaml instead:

conda env create --file environment_vlm.yaml
conda activate vt_bench_vlm

Additionally, you need to download LLaMA-Factory:

git clone https://github.com/hiyouga/LLaMA-Factory.git

3) Run Evaluation

You need to specify five arguments: task, dataset, model, setting, and diagnostics.

task: prediction or reasoning
dataset: choose a dataset identifier from the Datasets section
model: choose a model name from the Models section
- For VLMs, use the Hugging Face full identifier, e.g. Qwen/Qwen3-VL-8B-Instruct
setting: controls evaluation configuration (varies across datasets; see Datasets)
diagnostics: whether to compute additional modality diagnostic metrics for prediction tasks
- options: none / mcr / mir / full
- we also support computing MCR for a specified checkpoint

Example: Discriminative Prediction

python run.py --task prediction --dataset skin --model TIP --setting none --diagnostics mcr

Compute MCR for a specific checkpoint:

python run.py --task prediction --dataset skin --model TIP --setting none --diagnostics mcr   --checkpoint {YOUR_PRETRAINED_CKPT_PATH}

Example: Generative Reasoning

python run.py --task reasoning --dataset ehrxqa --model Qwen/Qwen3-VL-8B-Instruct --setting full --diagnostics none

For API Models (GPT-4.1 / Gemini-3-Flash-Preview), please export API keys before running:

export OPENAI_API_KEY='your_openai_api_key'
export GOOGLE_API_KEY='your_google_api_key'

Datasets

Discriminative Prediction Dataset

Public Dataset

Dataset	String Identifier	Availability	Source
Skin Cancer	`skin`	Public	kaggle
Breast Cancer	`breast`	Public	kaggle
Adoption	`adoption`	Public	kaggle
CelebA	`celebA`	Public	kaggle
DVM-Car	`dvm`	Public	DVM-Car
Pawpularity	`pawpularity`	Public	kaggle
Anime	`anime`	Public	kaggle
Pneumonia	`pneumonia`	Constructed	MIMIC-IV & MIMIC-CXR
Los	`los`	Constructed	MIMIC-IV & MIMIC-CXR
Respiratory Rat	`rr`	Constructed	MIMIC-IV & MIMIC-CXR

Constructed Datasets

The three constructed datasets (pneumonia, los, rr) are built from MIMIC-CXR-JPG (v2.0.0) and MIMIC-IV (v2.2). These source datasets require a credentialed PhysioNet license. Due to the Data Use Agreement (DUA), only credentialed users can access the source data.

To access the source datasets, you must:

Become a credentialed PhysioNet user
- If you do not have a PhysioNet account, register for one here.
- Follow these instructions for credentialing on PhysioNet.
- Complete the "CITI Data or Specimens Only Research" training course.
Sign the data use agreement (DUA) for each project
- https://physionet.org/sign-dua/mimic-cxr-jpg/2.0.0/
- https://physionet.org/sign-dua/mimiciv/2.2/

After obtaining access:

The construction code is under: dataset/Constructed_datasets/
Update the file paths in built_classification.py and built_regression.py
Download images via download_newdataset_image.sh

Preprocessing Scripts

All preprocessing scripts for prediction datasets are under dataset/.

Processing steps for the DVM dataset can be found here.

Generative Reasoning Datasets

Dataset	String Identifier	Availability	Source
DVM-Car QA	`dvm`	Constructed	kaggle
MMQA	`mmqa`	Public	MMQA
EHRXQA	`ehrxqa`	Public Credentialized Access	EHRXQA

DVM-Car QA

Generation script: reasoning/DVM_QA/generate_qa.py.
setting for DVM-Car QA should be a list specifying the tasks to evaluate:
- "loc": Row Localization
- "attr": Attribute Retrieval
- "count": Constrained Counting
- "mean": Conditional Mean

Example:

--setting '["loc","attr","count","mean"]'

MMQA

See the official project page: MMQA
Only supports full evaluation.

EHRXQA

See the official repo: EHRXQA
Supported settings: full / stage1 / stage2

How to Add New Datasets

To integrate a new dataset, please provide a custom script that follows VT-Bench’s standard preprocessing pipeline:

Filter samples with missing labels or images, then split into train/val/test
Process tabular features:
- impute missing values (mean for numeric; "MISSING" for categorical)
- encode categorical features
- normalize/standardize numeric features
Process images:
- resize to 224 × 224
- store as .npy
Export model-ready files:
- .csv feature files
- .pt tensor files following the naming conventions used in existing datasets

A runnable example is provided here: dataset/skin_cancer.ipynb

Models

VT-Bench supports comprehensive evaluation of vision-tabular multi-modal learning models, including vision unimodal models, tabular unimodal models, vision-tabular multi-modal fusion models, and general-purpose vision-language models (VLMs). For discriminative prediction tasks, we implement adaptive hyperparameter optimization based on the Optuna framework. For generative reasoning tasks, VLMs are evaluated in a zero-shot setting using their official pretrained checkpoints.

Vision Unimodal Models

ResNet-50: A 50-layer deep residual learning architecture that introduces identity skip connections to ease optimization and improve accuracy for large-scale visual recognition.
ViT-B/16: A pure Transformer architecture for vision that represents images as sequences of fixed-size patches and applies self-attention directly over patch tokens, achieving strong image classification performance with compute-efficient training.

Tabular Unimodal Models

LightGBM: An efficient Gradient Boosting Decision Tree implementation optimized for large-scale, high-dimensional data, using Gradient-based One-Side Sampling and Exclusive Feature Bundling to improve training speed and scalability.
TabTransformer: A Transformer-based architecture that contextualizes categorical feature embeddings via self-attention, producing more informative and robust representations for tabular learning.
TabPFN v2: A tabular foundation model trained on millions of synthetic datasets to perform general-purpose inference for supervised prediction on tables, frequently outperforming heavily tuned gradient-boosted tree ensembles with lower training overhead.

Vision-Tabular Multi-Modal Models

Concat: A parameter-efficient multi-task deep model that integrates structural imaging with demographic and clinical features through simple feature concatenation.
MAX: A multi-modal deep learning framework that learns modality-specific representations from clinical variables and imaging data, then fuses them using element-wise maximum operation.
MUL: An integrative CNN that uses channel-wise multiplicative fusion between imaging and non-imaging streams, yielding better performance than simple concatenation baselines.
DAFT: A lightweight conditioning module for CNNs that fuses imaging features with tabular variables by dynamically predicting per-channel scale and shift parameters and applying affine transforms to convolutional feature maps.
CHARMS: A cross-modal knowledge transfer method that aligns image channels with tabular features via optimal transport and mutual-information maximization, enabling selective transfer of visually relevant signals.
MMCL: A multi-modal contrastive pretraining framework that leverages paired imaging and tabular data to learn strong unimodal encoders, combining SimCLR-style image contrast with SCARF-style tabular augmentations.
TIP: A self-supervised tabular-image pre-training framework designed for multi-modal classification under incomplete tabular inputs, combining masked tabular reconstruction with image-tabular matching and contrastive objectives.

VLMs

Table-LLaVA-v1.5-7B: A multi-modal table understanding model that answers table-centric instructions directly from table images, avoiding reliance on serialized formats and substantially outperforming recent open-source multi-modal baselines.
Qwen3-VL-8B-Instruct: An instruction-tuned variant of the Qwen3-VL family that supports interleaved text-image-video inputs with a native 256K-token context window for general-purpose multi-modal understanding and generation.
Qwen3-VL-8B-Thinking: A reasoning-oriented variant of Qwen3-VL that emphasizes stronger multi-step multi-modal reasoning across single-image, multi-image, and video tasks with enhanced spatiotemporal modeling capabilities.
InternVL3-8B: An open-source multi-modal LLM that performs native multi-modal pre-training, jointly learning vision and language from mixed multi-modal data and pure-text corpora in a single stage.
GLM-4.1V-9B-Thinking: A reasoning-oriented vision-language model trained with a large-scale multi-modal pre-trained vision backbone and curriculum-based reinforcement learning to enhance general-purpose multi-modal reasoning.
Llama-3.2-11B-Vision-Instruct: A multi-modal large language model that integrates visual reasoning capabilities into the Llama 3 architecture using a separately trained vision adapter with cross-attention layers.
Pixtral-12B: A 12B vision-language model trained for both natural images and documents, supporting flexible visual tokenization and handling multiple images within a long 128K-token context window.
GPT-4.1: A flagship large language model optimized for superior coding performance, instruction following, and long-context processing with a 1-million-token context window.
Gemini-3-Flash-Preview: A lightweight Gemini-family model optimized for low-latency, cost-effective inference while retaining strong general-purpose reasoning and multi-modal capability.

How to Add New Models

For generative reasoning, add the Hugging Face / API calling logic in the corresponding evaluator scripts for each reasoning dataset.
For discriminative prediction, add a new model function under prediction/models/.

Acknowledgements

We would like to thank the following repositories for their great works:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

Introduction

Quickstart

1) Download

2) Environment Setup

3) Run Evaluation

Example: Discriminative Prediction

Example: Generative Reasoning

Datasets

Discriminative Prediction Dataset

Public Dataset

Constructed Datasets

Preprocessing Scripts

Generative Reasoning Datasets

DVM-Car QA

MMQA

EHRXQA

How to Add New Datasets

Models

Vision Unimodal Models

Tabular Unimodal Models

Vision-Tabular Multi-Modal Models

VLMs

How to Add New Models

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset		dataset
img		img
prediction		prediction
reasoning		reasoning
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
environment_vlm.yaml		environment_vlm.yaml
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

Introduction

Quickstart

1) Download

2) Environment Setup

3) Run Evaluation

Example: Discriminative Prediction

Example: Generative Reasoning

Datasets

Discriminative Prediction Dataset

Public Dataset

Constructed Datasets

Preprocessing Scripts

Generative Reasoning Datasets

DVM-Car QA

MMQA

EHRXQA

How to Add New Datasets

Models

Vision Unimodal Models

Tabular Unimodal Models

Vision-Tabular Multi-Modal Models

VLMs

How to Add New Models

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages