GitHub - H-EmbodVis/HERMESV2: HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

Xin Zhou¹, Dingkang Liang¹, Xiwu Chen², Feiyang Tan², Dingyuan Zhang¹, Hengshuang Zhao³, Xiang Bai¹

¹Huazhong University of Science and Technology, ²Mach Drive, ³The University of Hong Kong

Abstract

Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current-to-Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks.

TL; DR

Unified driving world model: jointly supports 3D scene understanding and future geometry prediction.
BEV representation for LLMs: compresses multi-view visual inputs into spatially consistent BEV tokens.
LLM-enhanced world queries: transfer semantic and world knowledge from language reasoning to future generation.
Current-to-Future Link: bridges current scene understanding and future geometric evolution.
Textual Injection: uses text embeddings as conditioning signals for future scene generation.
Joint Geometric Optimization: aligns latent features with geometry-aware priors through explicit and implicit constraints.

Updates

2025.04.30: Release extended paper and code.
2025.06.26: The HERMES conference version is accepted to ICCV 2025.
2025.01.24: The HERMES paper and demo were released.

Method Overview

HERMES++ unifies understanding and generation around a shared BEV representation:

Multi-view images are encoded and projected into BEV space.
BEV features are compressed into LLM-compatible visual tokens.
The LLM performs scene understanding and enriches world queries with semantic knowledge.
The Current-to-Future Link generates future latent representations conditioned on current BEV features, textual semantics, and future ego-motion.
A future geometry decoder predicts future point clouds, optimized with Joint Geometric Optimization.

Main Results

Qualitative Results

Demo

Demo 1

Demo 2

Demo 3

Getting Started

We provide separate setup, data, and usage documents:

After preparing the environment and data, train or evaluate with the configs in projects/configs/hermes.

To Do

Release demo.
Release checkpoints.
Release training code.
Release processed datasets.

Acknowledgement

This project builds on HERMES, BEVFormer v2, InternVL, UniPAD, OmniDrive, DriveMonkey, and related open-source autonomous driving research. We thank the authors of these projects for their contributions to the community.

Citation

If this repository is useful for your research, please consider citing these papers.

@article{zhou2026hermespp,
  title={HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation},
  author={Zhou, Xin and Liang, Dingkang and Chen, Xiwu and Tan, Feiyang and Zhang, Dingyuan and Zhao, Hengshuang and Bai, Xiang},
  journal={arXiv preprint arXiv:2604.28196},
  year={2026}
}
@inproceedings{zhou2025hermes,
  title={HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation},
  author={Zhou, Xin and Liang, Dingkang and Tu, Sifan and Chen, Xiwu and Ding, Yikang and Zhang, Dingyuan and Tan, Feiyang and Zhao, Hengshuang and Bai, Xiang},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
docs		docs
extra_tools		extra_tools
figures		figures
mmcv		mmcv
mmdet3d		mmdet3d
mmdetection		mmdetection
projects		projects
requirements		requirements
third_lib		third_lib
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
requirements_internvl.txt		requirements_internvl.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

Abstract

TL; DR

Updates

Method Overview

Main Results

Qualitative Results

Demo

Getting Started

To Do

Acknowledgement

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

Abstract

TL; DR

Updates

Method Overview

Main Results

Qualitative Results

Demo

Getting Started

To Do

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages