🌍 Toward Global Large Language Models in Medicine

[📜 Paper] • [🤗 HuggingFace]

📢 News

[2025.01.25] 🔥 GlobMed has been expanded to cover 20 languages, with the addition of PubMedQA and MedMCQA.
[2025.01.08] 🔥 GlobMed is publicly available through the Hugging Face platform.

GlobMed

We introduce GlobMed, the largest multilingual medical dataset to date, spanning 20 languages.

Scale & Coverage

800,000+ entries
20 languages
- High-resource: Arabic, Chinese, English, French, German, Hindi, Indonesian, Japanese, Korean, Portuguese, Russian, Spanish, Thai
- Low-resource: Bengali, Malay, Swahili, Urdu, Wolof, Yoruba, Zulu

Core Tasks

Natural Language Inference: BioNLI, MedNLI
Long-Form Question Answering: ExpertQA-Bio, ExpertQA-Med, LiveQA
Multiple-choice Question Answering: HeadQA, MedExpQA, MedQA, MMLU-Pro

GlobMed-Bench

Building on GlobMed, we establish GlobMed-Bench, which systematically assesses 56 state-of-the-art LLMs across multiple multilingual medical tasks.

Evaluation Scope

56 state-of-the-art LLMs
40,000+ independent experiments
125M+ generated responses
Overall Performance of 12 Proprietary LLMs
Overall Performance of 44 Open-Weight LLMs

Key Findings

Proprietary LLMs generally achieve stronger overall performance
Open-Weight LLMs exhibit significant performance variance and demonstrate a scaling law
LLMs show significant performance disparities across languages, especially on low-resource languages
Reasoning-enhanced LLMs consistently outperform non-reasoning counterparts
Medical LLMs do not consistently outperform their general counterparts

GlobMed-LLMs

We introduced GlobMed-LLMs, a suite of multilingual medical LLMs trained on GlobMed, with parameters ranging from 1.7B to 8B.

Usage

Load Dataset

from datasets import load_dataset

# load GlobMed-MMLU-Pro (English)
globmed_mmlu_pro = load_dataset("ruiyang-medinfo/GlobMed_MMLU-Pro", "en")

globmed_mmlu_pro_train = globmed_mmlu_pro["train"]
globmed_mmlu_pro_test = globmed_mmlu_pro["test"]

Running Evaluation

Installation

Basic Installation

# Create environment
conda create -n globalmed python=3.10 -y && conda activate globalmed

# Install project
pip install -e .

vLLM Installation (Optional)

For local inference with vLLM:

pip install -e ".[vllm]"

Environment Variables

Copy the environment template and fill in your API keys:

cp .env.example .env
# Edit .env file with your API keys

The .env file supports the following configurations:

# OpenAI / OpenRouter / Local vLLM serve
OPENAI_BASE=https://api.openai.com/v1
OPENAI_KEY=your-api-key

# Azure OpenAI
AZURE_OPENAI_BASE=https://your-resource.openai.azure.com/
AZURE_OPENAI_KEY=your-azure-key
AZURE_OPENAI_API_VERSION=2024-12-01-preview
AZURE_OPENAI_DEPLOYMENT=gpt-4o

# HuggingFace (for dataset download)
HF_TOKEN=hf_...

Option 1: Commercial API

Using OpenAI, OpenRouter, or other commercial APIs:

# Set environment variables
export OPENAI_BASE="https://api.openai.com/v1"
export OPENAI_KEY="your-api-key"

# Run evaluation (single language)
python -m globalmed.run_openai \
    --model_id "gpt-4o" \
    --subset "MedQA" \
    --lang "en" \
    --n_thread 32

# Run evaluation (all languages)
python -m globalmed.run_openai \
    --model_id "gpt-4o" \
    --subset "MedQA" \
    --n_thread 32

Using Azure OpenAI:

export AZURE_OPENAI_BASE="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_KEY="your-key"
export AZURE_OPENAI_API_VERSION="2024-12-01-preview"

python -m globalmed.run_openai \
    --model_id "gpt-4o" \
    --subset "MedQA" \
    --use_azure \
    --n_thread 32

Option 2: Local Model (vLLM serve)

Start a vLLM server first, then use the OpenAI-compatible interface:

Step 1: Start vLLM Server

# Basic startup
vllm serve Qwen/Qwen3-8B --port 8000

# Enable thinking mode (Qwen3 series)
vllm serve Qwen/Qwen3-8B --port 8000 --reasoning-parser qwen3

# Multi-GPU tensor parallel
vllm serve Qwen/Qwen3-30B-A3B --port 8000 --tensor-parallel-size 4

Step 2: Run Evaluation

export OPENAI_BASE="http://localhost:8000/v1"
export OPENAI_KEY="EMPTY"

python -m globalmed.run_openai \
    --model_id "Qwen/Qwen3-8B" \
    --subset "MedQA" \
    --n_thread 64

Option 3: Local Offline Inference (vLLM)

Use vLLM directly for batch offline inference:

python -m globalmed.run_vllm \
    --model_id "meta-llama/Llama-3.3-70B-Instruct" \
    --subset "MedQA" \
    --tensor_parallel_size 4 \
    --batch_size 128

Resuming Failed Inference

If some samples fail during inference (due to API errors, timeouts, etc.), the script will print a warning at the end:

WARNING - X items missing from JSON export
WARNING - Missing IDs (first 5): [...]

If you see this warning, re-run the same command to resume and fill in the failed samples. The script automatically detects completed samples and only processes the missing ones. This ensures fair scoring with complete results.

Viewing Results

View Results Table

python -m globalmed.pretty_print --subset "MedQA"

Output example:

Dataset: MedQA (shot) [run_idx=0]
Models found: 2
+------------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+
|      Model       |  en   |  zh   |  fr   |  de   |  ja   |  ko   |  pt   |  es   |  sw  |  wo  |  yo  |  zu  |
+------------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+
|    model-a       | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx |xx.xx |xx.xx |xx.xx |xx.xx |
|    model-b       | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx |xx.xx |xx.xx |xx.xx |xx.xx |
+------------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+

Filter by Languages

python -m globalmed.pretty_print --subset "MedQA" --lang en zh fr de

Result File Formats

Results are stored in results/{model_id}/{subset}_{prompt_style}/:

JSONL Intermediate Results

Each line is a JSON object for one sample.

run_openai.py format:

{
  "id": "MedQA_en_0",
  "request": "[{\"role\": \"user\", \"content\": \"...\"}]",
  "responses": [
    {"content": "Let me analyze... The answer is (A).", "reasoning": null, "answer": "A"}
  ],
  "metadata": {"id": "MedQA_en_0", "question": "...", "options": {"A": "...", "B": "..."}, "answer_index": 0}
}

run_vllm.py format:

{
  "id": "MedQA_en_0",
  "request": "<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n",
  "responses": [
    {"content": "Let me analyze... The answer is (A).", "answer": "A"}
  ],
  "metadata": {"id": "MedQA_en_0", "question": "...", "options": {"A": "...", "B": "..."}, "answer_index": 0}
}

JSON Final Results

Aggregated results for scoring:

[
  {
    "id": "MedQA_en_0",
    "answer": ["A"],
    "ground_truth": "A"
  },
  {
    "id": "MedQA_en_1",
    "answer": ["B"],
    "ground_truth": "C"
  }
]

Contact

Rui Yang: yang.rui@duke-nus.edu.sg
Weihao Xuan: xuan@ms.k.u-tokyo.ac.jp

Citation

@article{yang2026toward,
  title={Toward Global Large Language Models in Medicine},
  author={Yang, Rui and Li, Huitao and Xuan, Weihao and Qi, Heli and Li, Xin and Yu, Kunyu and Chen, Yingjian and Wang, Rongrong and Behmoaras, Jacques and Cai, Tianxi and others},
  journal={arXiv preprint arXiv:2601.02186},
  year={2026}
}

License

This repository is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
config		config
figure		figure
src/globalmed		src/globalmed
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🌍 Toward Global Large Language Models in Medicine

📢 News

GlobMed

Scale & Coverage

Core Tasks

GlobMed-Bench

Evaluation Scope

Key Findings

GlobMed-LLMs

Usage

Load Dataset

Running Evaluation

Installation

Basic Installation

vLLM Installation (Optional)

Environment Variables

Option 1: Commercial API

Option 2: Local Model (vLLM serve)

Option 3: Local Offline Inference (vLLM)

Resuming Failed Inference

Viewing Results

View Results Table

Filter by Languages

Result File Formats

JSONL Intermediate Results

JSON Final Results

Contact

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages