[πEagle Report] [πEagle 2 Report] [πEagle 2.5 Report] [πLocateAnything Report]
[π€Model Collection] [π€LocateAnything Demo] [πProject Page]
- [2026/05] π₯ Release LocateAnything β A generalist vision-language grounding model based on Eagle.
- [2025/12] π A native resolution variant of the Eagle model is adopted as the VLM backbone of GR00T-N1.6. Check out the tech blog for more details.
- [2025/10] π₯ Release Eagle 2.5 source code.
- [2025/09] π₯ Eagle 2.5 is accepted to NeurIPS 2025.
- [2025/09] π Eagle 2 is supported in Torch-TRT.
- [2025/07] π Release Eagle 2.5 model.
- [2025/06] π₯ Eagle 2.5 is adopted as the VLM backbone of GR00T-N1.5. Check out the tech blog for more details.
- [2025/04] π Release Eagle 2.5 tech report.
- [2025/03] π₯ Eagle 2 is adopted as the VLM backbone (System-2) of GR00T-N1. Check out the GTC launch and white paper for more details.
- [2025/01] π Release Eagle 2 tech report and models.
- [2025/01] π Eagle is accepted as ICLR 2025 Spotlight.
- [2024/08] π Release Eagle.
Eagle is a family of frontier vision-language models (VLMs) from NVIDIA that explore data-centric strategies across general-purpose multimodal understanding, long-context reasoning, and embodied applications.
Beyond advancing multimodal foundation model research, Eagle has also served as a research and development platform that supports multiple flagship NVIDIA efforts across enterprise intelligence and Physical AI, including but not limited to Llama-Nemotron Nano VLM, Nemotron VLMs (V2 Nano VL / V3 Nano Omni), NeMo Retriever (Llama Nemoretriever Colembed), NVIDIA Isaac GR00T N1 (Tech Report) / N1.5 / N1.6, and Cosmos.
| Model | Features | Summary |
|---|---|---|
| LocateAnything | Generalist grounding, detection and pointing | Fast & high quality vision-language grounding with Parallel Box Decoding |
| Eagle 2.5 | Frontier VLM with SOTA image & video understanding | Framework & data strategy for long-context multimodal understanding |
| Eagle 2 | Frontier VLM with SOTA image understanding | Exploring the post-training data strategies for frontier VLMs |
| Eagle | VLMs with mixture-of-encoders | Exploring the design space for vision-centric VLMs |
demo.mp4Dense Object Detection |
decoding_demo.mp4Fast Decoding Speed |
Wukong.mp4User Prompt |
VLM Captioning
1.05 seconds, Introductiondetailed caption: The clip begins with a majestic scene of a warrior standing atop a mountaintop, surrounded by mystical energy. The warrior is adorned in ornate, decorated armor with intricate patterns and elaborate designs. The background features a misty, ethereal landscape with mountains in the distance. The warrior then climbs a large stone adorned with intricate engravings. The stone is massive, set amidst a mystical and serene environment with other towering rocks nearby. As the warrior ascends, the camera zooms in to reveal the detailed engravings on the stone, showcasing its ancient and mystical nature. The scene transitions to the warrior reaching the top of the stone and spreading his arms wide, standing victorious. Suddenly, the focus shifts to a mystical figure in flowing robes, who appears to be a sage or a mystical character, standing amidst the rocky landscape. This figure is illuminated by an otherworldly light, suggesting his power and wisdom. The clip concludes with this mystical character appearing calm and serene, hinting at a connection to the warrior and the stone, underscoring themes of power, wisdom, and mystical journey. Show More5.99 seconds, Explaining game genredetailed caption: The clip begins with a character dressed in ornate armor, moving stealthily through a forest and up a mountain. The camera focuses on the intricate details of the armor, which has elaborate patterns and is adorned with gold accents. As the character ascends, a large, mystical landscape with towering trees and distant mountains comes into view. The character is then seen standing on the mountaintop, with an ethereal glow surrounding them. The camera shifts to show a wide view of the sky with clouds, creating a dramatic backdrop. Next, the character engages in combat, wielding a large weapon amidst an army dressed similarly, with a focus on their coordinated attack. The action intensifies as the character fights against a large, stone statue, which has a menacing expression and rough, textured surface. The clip continues with another character in white robes, who seems to be casting spells or invoking some form of power. The environment transitions to a snowy battlefield where the main character battles against another warrior, engaging in dynamic combat moves. The scene is filled with dramatic lighting effects, showing the two warriors clashing amidst a snowy landscape with large statues looming in the background. The clip concludes with the two characters continuing their intense battle, with the main character executing elaborate and powerful strikes. 517.10 seconds, Discussing controls detailed caption: The clip begins with two warriors locked in combat, using long, decorated staffs to strike each other in an icy landscape. The scene is intense, with the warriors demonstrating precise and powerful movements. As they clash, the camera shifts to focus on a mystical figure wearing ornate armor and a crown, who appears to be engaging in a magical ritual or challenge. The warrior in white robes seems to be performing a spell, summoning glowing lights and energy. The camera captures the intricate details of the armor, showing dragon and lion motifs intricately etched into the metal. As the scene progresses, the warrior in white robes is seen riding a large, mystical beast, which attacks the main warrior, creating a swirling vortex of energy. The background features a mountainous landscape, adding to the epic and fantastical atmosphere. The clip concludes with the warrior in white robes standing triumphantly, holding his staff aloft, as the spell they cast manifests before them, demonstrating their power and mastery over the mystical forces. 614.53 seconds, Talking about story & characters detailed caption: The clip begins with a wide shot of a vast forest with tall trees, and mountains visible in the distance partially covered by mist. The sky is bright and clear, suggesting a peaceful morning. The scene then transitions to a close-up of a mystical, swirling black rock with intricate patterns upon its surface. A hand, presumably belonging to a character named Wukong, reaches out and touches the rock, causing it to shift and reveal a hidden passage. As the hand touches the rock, a transformation appears to begin, symbolizing awakening or revelation. Next, the scene shifts to a different location where Wukong stands on a rocky terrain with trees and a mountainous backdrop, suggesting a connection to nature and ancient wisdom. Wukong is dressed in simple, flowing robes, and with an air of contemplation, gazes off into the distance. The clip transitions to another scene where Wukong is now seen in a more rugged outfit, equipped with a large staff, indicating a readiness for adventure or confrontation. The clip captures the essence of a journey filled with mystery, self-discovery, and the pursuit of enlightenment. Throughout the clip, text appears in Chinese characters, providing context and narrative to the unfolding story, including the names 'Wukong' (in both the original script and pinyin) and 'Peng Lao Dang,' adding depth to the characters and their world. 698.69 seconds, Describing visuals & sound detailed caption: The clip begins with an ancient stone structure partially submerged in the ocean, its smooth, curved surface hinting at its historical significance. As the camera pans out, large waves crash against the structure, creating a powerful and dramatic atmosphere. The water is a deep blue, and the waves crash with great force, splashing white foam around. In the next scene, the setting shifts to a lush forest at dusk, where dense foliage is illuminated by golden light filtering through the trees. The scene captures the tranquility and beauty of nature, with the play of light and shadow creating a mystical ambiance. The forest then transforms into a rocky mountainous terrain, where a waterfall cascades down a cliff. The water is clear, and the rocks are rugged, emphasizing the raw power of nature. Finally, the clip returns to the mystical realm, where a large, ancient stone stands in a serene landscape. This stone is covered in intricate carvings and stands as a testament to the enduring legacy of the place. The camera lingers on the details of the carvings, highlighting the artistry and craftsmanship of the ancient civilization. The clip concludes with a close-up of the stone, capturing the fine details and the texture of the carvings, leaving a lasting impression of the historical and natural beauty depicted throughout the clip. 738.62 seconds, Closing detailed caption: The clip begins with a character with white hair and a beard, wearing a loose robe and a beaded necklace, extending their hand towards a large, swirling rock on a rocky mountain. The rock features intricate designs that appear to be swirling patterns. The action proceeds as the character touches the rock, and the rock starts to glow with a mystical light, suggesting a magical transformation or unlocking of power. As the character's hand makes contact with the rock, the surroundings become more luminous, highlighting the magical ambiance. The scene transitions to the character walking away from the rock, moving down a series of rocky steps into a valley. As the character descends, more characters dressed in similar attire come into view, implying a gathering or a quest. The video captures the mystical atmosphere of the mountainous landscape, emphasizing the character's interaction with the rock and the mystical surroundings. |
Smart City & MetropolisAn example of zero-shot ultra-dense pedestrian detection in the wild for a road crossing in Shibuya, Tokyo, one of the busiest areas in the world.
| Model Name | Date | LLM Backbone | Vision Encoder | Max Length | Download |
|---|---|---|---|---|---|
| LocateAnything-3B | 2026.05.26 | Qwen2.5-3B-Instruct | MoonViT-SO-400M | 25K | π€ HF Link |
| Model Name | Date | LLM Backbone | Vision Encoder | Max Length | Download |
|---|---|---|---|---|---|
| Eagle2.5-8B | 2025.04.16 | Qwen2.5-7B-Instruct | SigLIP2 | 128K | π€ HF Link |
| Model Name | Date | LLM Backbone | Vision Encoder | Max Length | Download |
|---|---|---|---|---|---|
| Eagle2-1B | 2025.01.11 | Qwen2.5-0.5B-Instruct | SigLIP | 16K | π€ HF Link |
| Eagle2-2B | 2025.01.11 | Qwen2.5-1.5B-Instruct | SigLIP | 16K | π€ HF Link |
| Eagle2-9B | 2025.01.11 | Qwen2.5-7B-Instruct | SigLIP + ConvNext | 16K | π€ HF Link |
| Eagle2-34B | 2025.01.11 | Qwen2.5-32B-Instruct | SigLIP + ConvNext | 16K | π€ HF Link |
| Model Name | Date | LLM Backbone | Vision Encoder | Download |
|---|---|---|---|---|
| Eagle-X4-8B-Plus | 2024.09.16 | Llama-3-8B-Instruct | CLIP+ConvNeXt+EVA+Pix2Str | π€ HF Link |
| Eagle-X4-13B-Plus | 2024.09.16 | vicuna-13b-v1.5 | CLIP+ConvNeXt+EVA+Pix2Str | π€ HF Link |
| Eagle-X5-34B-Plus | 2024.09.16 | Yi-34B | CLIP+ConvNeXt+EVA+Pix2Str+SAM | π€ HF Link |
| Eagle-X5-7B | 2024.09.16 | vicuna-7b-v1.5 | CLIP+ConvNeXt+EVA+Pix2Str+SAM | π€ HF Link |
| Eagle-X5-13B | 2024.09.16 | vicuna-13b-v1.5 | CLIP+ConvNeXt+EVA+Pix2Str+SAM | π€ HF Link |
| Eagle-X5-13B-Chat | 2024.09.16 | vicuna-13b-v1.5 | CLIP+ConvNeXt+EVA+Pix2Str+SAM | π€ HF Link |
If you find this project useful, please consider citing our works:
@article{wang2025locateanything,
title={LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding},
author={Shihao Wang and Shilong Liu and Yuanguo Kuang and Xinyu Wei and Yangzhou Liu and Zhiqi Li and Yunze Man and Guo Chen and Andrew Tao and Guilin Liu and Jan Kautz and Lei Zhang and Zhiding Yu},
journal={arXiv:2605.27365},
year={2026}
}@inproceedings{man2025locateanything3d,
title = {LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight},
author = {Yunze Man and Shihao Wang and Guowen Zhang and Johan Bjorck and Zhiqi Li and Liang-Yan Gui and Jim Fan and Jan Kautz and Yu-Xiong Wang and Zhiding Yu},
journal = {arXiv preprint arXiv:2511.20648},
year = {2025},
}@inproceedings{chen2025eagle2.5,
title={Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models},
author={Guo Chen and Zhiqi Li and Shihao Wang and Jindong Jiang and Yicheng Liu and Lidong Lu and De-An Huang and Wonmin Byeon and Matthieu Le and Max Ehrlich and Tong Lu and Limin Wang and Bryan Catanzaro and Jan Kautz and Andrew Tao and Zhiding Yu and Guilin Liu},
booktitle={NeurIPS},
year={2025}
}@article{li2025eagle2,
title={Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models},
author={Zhiqi Li and Guo Chen and Shilong Liu and Shihao Wang and Vibashan VS and Yishen Ji and Shiyi Lan and Hao Zhang and Yilin Zhao and Subhashree Radhakrishnan and Nadine Chang and Karan Sapra and Amala Sanjay Deshmukh and Tuomas Rintamaki and Matthieu Le and Ilia Karmanov and Lukas Voegtle and Philipp Fischer and De-An Huang and Timo Roman and Tong Lu and Jose M. Alvarez and Bryan Catanzaro and Jan Kautz and Andrew Tao and Guilin Liu and Zhiding Yu},
journal={arXiv:2501.14818},
year={2025}
}@inproceedings{shi2025eagle,
title = {Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders},
author={Min Shi and Fuxiao Liu and Shihao Wang and Shijia Liao and Subhashree Radhakrishnan and De-An Huang and Hongxu Yin and Karan Sapra and Yaser Yacoob and Humphrey Shi and Bryan Catanzaro and Andrew Tao and Jan Kautz and Zhiding Yu and Guilin Liu},
booktitle={ICLR},
year={2025}
}- The code is released under the Apache 2.0 license as found in the LICENSE file. Portions of the code in this repo are reused and subject to their original licenses. Some files have been modified, with appropriate attribution and additional license headers added where applicable.
- The pretrained model weights are released under either the CC BY-NC 4.0 License or the NVIDIA License. The models are research preview intended for non-commercial use only.
- Eagle models are improved using Qwen.
- For code contributions to Eagle, please refer to the Contribution Guide.
- Users are reminded to ensure that their use of the dataset and model weights is in compliance with all applicable laws and regulations.
- LLaVA, LLaVA-HR and InternVL: The Eagle codebase has integrated modified components from these repositories. Many thanks for the great open-source projects.
- LMMs-Eval and VLMEvalKit: We use derivatives of these repositories for evaluation. Many thanks for the wonderful tools.
- Thanks to Cambrian, LLaVA-One-Vision, The Cauldron and many other works for the great efforts in open-sourcing data.
- The team would like to give special thanks to the NVIDIA TSE Team, including Chen Fu, Yuchao Jin, Le An, and Josh Park, for their exceptional work on the optimized TensorRT and edge deployment of Eagle.


