Vision Banana

Image Generators are Generalist Vision Learners

Google DeepMind

Technical Report

Overview

🏆
Vision Banana is a SOTA unified model for both image understanding and generation.
🧠
Generative vision pretraining is an effective paradigm for visual understanding.
🔗
Image generation serves as a universal interface for diverse vision tasks.
Vision Banana overview: from generative pretraining to vision understanding

Capabilities

Hover over any image to reveal Vision Banana's generation results. On mobile, tap to toggle.

Semantic Segmentation

Hover to reveal segmentation masks

Instance Segmentation

Hover to reveal instance masks

Referring Expression Segmentation

Hover to reveal referred object masks

Monocular Metric Depth Estimation

Hover to reveal depth maps

Surface Normal Estimation

Hover to reveal surface normal maps

Results

Vision Banana achieves state-of-the-art under the zero-shot transfer setting across 2D and 3D vision tasks.

2D Understanding

Cityscapes — Semantic Segmentation mIoU ↑ Higher is better
0.842
0.442
0.478
0.520
0.652
0.699
SegMan-L
(Non Zero-Shot)
APE-D OpenSeeD X-Decoder SAM 3 Vision Banana 🍌
SA-Co/Gold — Instance Segmentation pmF1 ↑ Higher is better
0.661
0.369
0.420
0.461
0.540
0.552
SAM 3
(Non Zero-Shot)
APE-D OWLv2 Gemini 2.5 Vision Banana 🍌 DINO-X
* Evaluated on 500 randomly sampled queries.
RefCOCOg val (UMD) cIoU ↑ Higher is better
0.794
0.838
0.513
0.677
0.734
0.738
HyperSeg
+ Phi2
(Non Zero-Shot)
X-SAM
+ Phi3
(Non Zero-Shot)
HybridGL Kang
+ LLaVA
SAM 3
+ Gemini 2.5 Pro
Vision Banana 🍌
ReasonSeg val gIoU ↑ Higher is better
0.566
0.650
0.626
0.647
0.770
0.793
X-SAM
+ Phi3 3.8B
(Non Zero-Shot)
LISA-13B-LLAVA1.5(Non Zero-Shot) SegZero RSVP
+ GPT-4o
SAM 3
+ Gemini 2.5 Pro
Vision Banana 🍌
+ Gemini 2.5 Pro
Methods paired with MLLMs for reasoning.

3D Understanding

Metric Depth — Average over 6Benchmarks δ₁ ↑ Higher is better
0.715
0.802
0.823
0.882
Depth Pro MoGe-2 UniK3D Vision Banana 🍌
Vision Banana does not use camera intrinsics in training or inference.
Surface Normal — Average over 3 Benchmarks Mean Angular Error (°) ↓ Lower is better
19.606
17.168
17.017
16.558
15.549
Marigold StableNormal DSINE Lotus-2 Vision Banana 🍌

Contributors


Project Leads
Valentin Gabeur*  ·  Shangbang Long*  ·  Songyou Peng*
* Equal contribution
Core Contributors
Paul Voigtlaender  ·  Shuyang Sun  ·  Yanan Bao  ·  Karen Truong  ·  Zhicheng Wang  ·  Wenlei Zhou  ·  Jonathan T. Barron  ·  Kyle Genova  ·  Nithish Kannen  ·  Sherry Ben  ·  Yandong Li  ·  Mandy Guo  ·  Suhas Yogin
Project Advisors
Yiming Gu  ·  Huizhong Chen
Leadership Sponsors
Oliver Wang  ·  Saining Xie  ·  Howard Zhou  ·  Kaiming He  ·  Thomas Funkhouser  ·  Jean-Baptiste Alayrac  ·  Radu Soricut
Acknowledgements

We thank Xi Chen, Fei Xia, Kaushik Shivakumar, Abhishek Sinha, Phillip Lippe, Yilin Gao, Javier Rey, Sanghyun Woo, Renshen Wang, Wentao Yuan, Keran Rong, Rundi Wu, Manoj Kumar, Manli Shu, Francesco Piccinno, Ishita Dasgupta, Benigno Uria, Miki Rubinstein, Aäron van den Oord, and Jon Shlens for their helpful discussions, advice, and technical guidance.

BibTeX
@article{visionbanana2026,
  title={Image Generators are Generalist Vision Learners},
  author={Gabeur, Valentin and Long, Shangbang and Peng, Songyou and Voigtlaender, Paul and Sun, Shuyang and Bao, Yanan and Truong, Karen and Wang, Zhicheng and Zhou, Wenlei and Barron, Jonathan T and Genova, Kyle and Kannen, Nithish and Ben, Sherry and Li, Yandong and Guo, Mandy and Yogin, Suhas and Gu, Yiming and Chen, Huizhong and Wang, Oliver and Xie, Saining and Zhou, Howard and He, Kaiming and Funkhouser, Thomas and Alayrac, Jean-Baptiste and Soricut, Radu},
  journal={arXiv preprint arXiv:2604.20329},
  year={2026}
}