Xin Zhou1, Dingkang Liang1, Xiwu Chen2, Feiyang Tan2, Dingyuan Zhang1, Hengshuang Zhao3, Xiang Bai1
1Huazhong University of Science and Technology, 2Mach Drive, 3The University of Hong Kong
Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current-to-Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks.
- Unified driving world model: jointly supports 3D scene understanding and future geometry prediction.
- BEV representation for LLMs: compresses multi-view visual inputs into spatially consistent BEV tokens.
- LLM-enhanced world queries: transfer semantic and world knowledge from language reasoning to future generation.
- Current-to-Future Link: bridges current scene understanding and future geometric evolution.
- Textual Injection: uses text embeddings as conditioning signals for future scene generation.
- Joint Geometric Optimization: aligns latent features with geometry-aware priors through explicit and implicit constraints.
- 2025.04.30: Release extended paper and code.
- 2025.06.26: The HERMES conference version is accepted to ICCV 2025.
- 2025.01.24: The HERMES paper and demo were released.
HERMES++ unifies understanding and generation around a shared BEV representation:
- Multi-view images are encoded and projected into BEV space.
- BEV features are compressed into LLM-compatible visual tokens.
- The LLM performs scene understanding and enriches world queries with semantic knowledge.
- The Current-to-Future Link generates future latent representations conditioned on current BEV features, textual semantics, and future ego-motion.
- A future geometry decoder predicts future point clouds, optimized with Joint Geometric Optimization.
We provide separate setup, data, and usage documents:
After preparing the environment and data, train or evaluate with the configs in projects/configs/hermes.
- Release demo.
- Release checkpoints.
- Release training code.
- Release processed datasets.
This project builds on HERMES, BEVFormer v2, InternVL, UniPAD, OmniDrive, DriveMonkey, and related open-source autonomous driving research. We thank the authors of these projects for their contributions to the community.
If this repository is useful for your research, please consider citing these papers.
@article{zhou2026hermespp,
title={HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation},
author={Zhou, Xin and Liang, Dingkang and Chen, Xiwu and Tan, Feiyang and Zhang, Dingyuan and Zhao, Hengshuang and Bai, Xiang},
journal={arXiv preprint arXiv:2604.28196},
year={2026}
}
@inproceedings{zhou2025hermes,
title={HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation},
author={Zhou, Xin and Liang, Dingkang and Tu, Sifan and Chen, Xiwu and Ding, Yikang and Zhang, Dingyuan and Tan, Feiyang and Zhao, Hengshuang and Bai, Xiang},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}





