Skip to content

LAMDA-NeSy/VT-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

Introduction

VT-Bench is a unified benchmark for vision–tabular multimodal learning, designed to evaluate multimodal models under two major paradigms:

  • Discriminative prediction: image + tabular record → class / value
  • Generative reasoning: image + table/database + question → free-form answer
Two paradigms in vision–tabular multi-modal learning and the cross-modal grounding challenge in generative reasoning.

VT-Bench provides:

  • Broad model coverage: vision-only models, tabular-only models, vision–tabular fusion models, and general-purpose VLMs
  • Diverse real-world scenarios: 756K paired samples spanning 9 domains, covering both prediction and reasoning tasks
  • Unified evaluation: consistent preprocessing, standardized interfaces, and optional modality diagnostic metrics for prediction

Quickstart

1) Download

Clone the repository:

git clone https://anonymous.4open.science/r/VT-Bench-13C2.git
cd VT-Bench

2) Environment Setup

Create and activate the conda environment:

conda env create --file environment.yaml
conda activate vt_bench

Due to version incompatibilities, if you want to fine-tune Vision-Language Models, please use environment_vlm.yaml instead:

conda env create --file environment_vlm.yaml
conda activate vt_bench_vlm

Additionally, you need to download LLaMA-Factory:

git clone https://github.com/hiyouga/LLaMA-Factory.git

3) Run Evaluation

You need to specify five arguments: task, dataset, model, setting, and diagnostics.

  • task: prediction or reasoning

  • dataset: choose a dataset identifier from the Datasets section

  • model: choose a model name from the Models section

    • For VLMs, use the Hugging Face full identifier, e.g. Qwen/Qwen3-VL-8B-Instruct
  • setting: controls evaluation configuration (varies across datasets; see Datasets)

  • diagnostics: whether to compute additional modality diagnostic metrics for prediction tasks

    • options: none / mcr / mir / full
    • we also support computing MCR for a specified checkpoint

Example: Discriminative Prediction

python run.py --task prediction --dataset skin --model TIP --setting none --diagnostics mcr

Compute MCR for a specific checkpoint:

python run.py --task prediction --dataset skin --model TIP --setting none --diagnostics mcr   --checkpoint {YOUR_PRETRAINED_CKPT_PATH}

Example: Generative Reasoning

python run.py --task reasoning --dataset ehrxqa --model Qwen/Qwen3-VL-8B-Instruct --setting full --diagnostics none

For API Models (GPT-4.1 / Gemini-3-Flash-Preview), please export API keys before running:

export OPENAI_API_KEY='your_openai_api_key'
export GOOGLE_API_KEY='your_google_api_key'

Datasets

Discriminative Prediction Dataset

Public Dataset

Dataset String Identifier Availability Source
Skin Cancer skin Public kaggle
Breast Cancer breast Public kaggle
Adoption adoption Public kaggle
CelebA celebA Public kaggle
DVM-Car dvm Public DVM-Car
Pawpularity pawpularity Public kaggle
Anime anime Public kaggle
Pneumonia pneumonia Constructed MIMIC-IV & MIMIC-CXR
Los los Constructed MIMIC-IV & MIMIC-CXR
Respiratory Rat rr Constructed MIMIC-IV & MIMIC-CXR

Constructed Datasets

The three constructed datasets (pneumonia, los, rr) are built from MIMIC-CXR-JPG (v2.0.0) and MIMIC-IV (v2.2). These source datasets require a credentialed PhysioNet license. Due to the Data Use Agreement (DUA), only credentialed users can access the source data.

To access the source datasets, you must:

  1. Become a credentialed PhysioNet user
    • If you do not have a PhysioNet account, register for one here.
    • Follow these instructions for credentialing on PhysioNet.
    • Complete the "CITI Data or Specimens Only Research" training course.
  2. Sign the data use agreement (DUA) for each project

After obtaining access:

Preprocessing Scripts

All preprocessing scripts for prediction datasets are under dataset/.

  • Processing steps for the DVM dataset can be found here.

Generative Reasoning Datasets

Dataset String Identifier Availability Source
DVM-Car QA dvm Constructed kaggle
MMQA mmqa Public MMQA
EHRXQA ehrxqa Public Credentialized Access EHRXQA

DVM-Car QA

  • Generation script: reasoning/DVM_QA/generate_qa.py.

  • setting for DVM-Car QA should be a list specifying the tasks to evaluate:

    • "loc": Row Localization
    • "attr": Attribute Retrieval
    • "count": Constrained Counting
    • "mean": Conditional Mean

Example:

--setting '["loc","attr","count","mean"]'

MMQA

  • See the official project page: MMQA
  • Only supports full evaluation.

EHRXQA

  • See the official repo: EHRXQA
  • Supported settings: full / stage1 / stage2

How to Add New Datasets

To integrate a new dataset, please provide a custom script that follows VT-Bench’s standard preprocessing pipeline:

  1. Filter samples with missing labels or images, then split into train/val/test

  2. Process tabular features:

    • impute missing values (mean for numeric; "MISSING" for categorical)
    • encode categorical features
    • normalize/standardize numeric features
  3. Process images:

    • resize to 224 × 224
    • store as .npy
  4. Export model-ready files:

    • .csv feature files
    • .pt tensor files following the naming conventions used in existing datasets

A runnable example is provided here: dataset/skin_cancer.ipynb

Models

VT-Bench supports comprehensive evaluation of vision-tabular multi-modal learning models, including vision unimodal models, tabular unimodal models, vision-tabular multi-modal fusion models, and general-purpose vision-language models (VLMs). For discriminative prediction tasks, we implement adaptive hyperparameter optimization based on the Optuna framework. For generative reasoning tasks, VLMs are evaluated in a zero-shot setting using their official pretrained checkpoints.

Vision Unimodal Models

  1. ResNet-50: A 50-layer deep residual learning architecture that introduces identity skip connections to ease optimization and improve accuracy for large-scale visual recognition.

  2. ViT-B/16: A pure Transformer architecture for vision that represents images as sequences of fixed-size patches and applies self-attention directly over patch tokens, achieving strong image classification performance with compute-efficient training.

Tabular Unimodal Models

  1. LightGBM: An efficient Gradient Boosting Decision Tree implementation optimized for large-scale, high-dimensional data, using Gradient-based One-Side Sampling and Exclusive Feature Bundling to improve training speed and scalability.

  2. TabTransformer: A Transformer-based architecture that contextualizes categorical feature embeddings via self-attention, producing more informative and robust representations for tabular learning.

  3. TabPFN v2: A tabular foundation model trained on millions of synthetic datasets to perform general-purpose inference for supervised prediction on tables, frequently outperforming heavily tuned gradient-boosted tree ensembles with lower training overhead.

Vision-Tabular Multi-Modal Models

  1. Concat: A parameter-efficient multi-task deep model that integrates structural imaging with demographic and clinical features through simple feature concatenation.

  2. MAX: A multi-modal deep learning framework that learns modality-specific representations from clinical variables and imaging data, then fuses them using element-wise maximum operation.

  3. MUL: An integrative CNN that uses channel-wise multiplicative fusion between imaging and non-imaging streams, yielding better performance than simple concatenation baselines.

  4. DAFT: A lightweight conditioning module for CNNs that fuses imaging features with tabular variables by dynamically predicting per-channel scale and shift parameters and applying affine transforms to convolutional feature maps.

  5. CHARMS: A cross-modal knowledge transfer method that aligns image channels with tabular features via optimal transport and mutual-information maximization, enabling selective transfer of visually relevant signals.

  6. MMCL: A multi-modal contrastive pretraining framework that leverages paired imaging and tabular data to learn strong unimodal encoders, combining SimCLR-style image contrast with SCARF-style tabular augmentations.

  7. TIP: A self-supervised tabular-image pre-training framework designed for multi-modal classification under incomplete tabular inputs, combining masked tabular reconstruction with image-tabular matching and contrastive objectives.

VLMs

  1. Table-LLaVA-v1.5-7B: A multi-modal table understanding model that answers table-centric instructions directly from table images, avoiding reliance on serialized formats and substantially outperforming recent open-source multi-modal baselines.

  2. Qwen3-VL-8B-Instruct: An instruction-tuned variant of the Qwen3-VL family that supports interleaved text-image-video inputs with a native 256K-token context window for general-purpose multi-modal understanding and generation.

  3. Qwen3-VL-8B-Thinking: A reasoning-oriented variant of Qwen3-VL that emphasizes stronger multi-step multi-modal reasoning across single-image, multi-image, and video tasks with enhanced spatiotemporal modeling capabilities.

  4. InternVL3-8B: An open-source multi-modal LLM that performs native multi-modal pre-training, jointly learning vision and language from mixed multi-modal data and pure-text corpora in a single stage.

  5. GLM-4.1V-9B-Thinking: A reasoning-oriented vision-language model trained with a large-scale multi-modal pre-trained vision backbone and curriculum-based reinforcement learning to enhance general-purpose multi-modal reasoning.

  6. Llama-3.2-11B-Vision-Instruct: A multi-modal large language model that integrates visual reasoning capabilities into the Llama 3 architecture using a separately trained vision adapter with cross-attention layers.

  7. Pixtral-12B: A 12B vision-language model trained for both natural images and documents, supporting flexible visual tokenization and handling multiple images within a long 128K-token context window.

  8. GPT-4.1: A flagship large language model optimized for superior coding performance, instruction following, and long-context processing with a 1-million-token context window.

  9. Gemini-3-Flash-Preview: A lightweight Gemini-family model optimized for low-latency, cost-effective inference while retaining strong general-purpose reasoning and multi-modal capability.

How to Add New Models

  • For generative reasoning, add the Hugging Face / API calling logic in the corresponding evaluator scripts for each reasoning dataset.
  • For discriminative prediction, add a new model function under prediction/models/.

Acknowledgements

We would like to thank the following repositories for their great works:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors