Vision Banana | Google DeepMind

Capabilities

Hover over any image to reveal Vision Banana's generation results. On mobile, tap to toggle.

Semantic Segmentation

Input Segmentation

Prompt: This image is a per-pixel class labeling of the input. The macaron cakes are represented by (255, 255, 0). The round plates are represented by (255, 192, 128). The slice cakes are depicted in (64, 192, 64). The flowers are shown in (128, 0, 64). The tongs are (255, 0, 192).

Input Segmentation

Prompt: Generate a visualization image of semantic segmentation, using this color mapping: {"cat ears": <255, 165, 0>, "exit sign": <0, 0, 255>, "background": <125, 0, 125>}

Input Segmentation

Prompt: Conduct per-class semantic segmentation for the given image. The sitting person are represented by (255, 255, 0). The standing and walking people are represented by (255, 192, 128). The ocean is depicted in (64, 192, 64). The street lights are in (128, 0, 64). The sky is in (255, 0, 192). The fence is in (0, 0, 255). The backpack is in (255, 0, 0).

Hover to reveal segmentation masks

Instance Segmentation

Input Segmentation

Prompt: Generate an instance segmentation visualization of this image. Each piece of garlic is colored differently.

Input Segmentation

Prompt: Generate an instance segmentation visualization of the input image. Segment all the price on the price tags, color them differently.

Input Segmentation

Prompt: Generate an instance segmentation visualization of this image. Each price tag is colored differently.

Input Segmentation

Prompt: This image shows segmentation masks for the basketballs from the input image. The background is set to #10aa05. Each basketball instance is represented by a solid circular mask, and a different color is used for each mask.

Hover to reveal instance masks

Referring Expression Segmentation

Input Segmentation

Prompt: This image shows segmentation masks from the given image. The background is black color. The chef's names in both Chinese and English are rendered as cyan color.

Input Segmentation

Prompt: A segmentation map image. The stretching cat is rendered in green, the cat that is cleaning itself is in cyan.

Input Segmentation

Prompt: This image shows segmentation masks from the given image. The background is black color. The game control device is represented by a solid yellow.

Input Segmentation

Prompt: A segmentation map image. The area that corresponds to the man in pink t shirt is rendered solid white; the other man is rendered in green.

Input Segmentation

Prompt: A segmentation map of the input image. The pig not in the glass is rendered cyan, and the pig in the glass reflection is rendered yellow.

Hover to reveal referred object masks

Monocular Metric Depth Estimation

Input Depth

Prompt: Predict the metric depth of this scene as an image. Visualized in the rainbow colormap.

Input Depth

Prompt: Predict the metric depth of this scene as an image. Visualized in the rainbow (black-red-yellow-green-cyan-blue-violet-white) color palette.

Input Depth

Prompt: Predict the metric depth of this scene as an image. Visualized in the rainbow (black-red-yellow-green-cyan-blue-violet-white) color palette.

Input Depth

Prompt: Generate a metric depth map of the provided image.

Input Depth

Prompt: Generate a metric depth map of the input image.

Input Depth

Prompt: Generate a metric depth map of the input image.

Input Depth

Prompt: Generate a metric depth map of the input image.

Hover to reveal depth maps

Surface Normal Estimation

Input Surface Normal

Prompt: Predict the surface normal of this scene.

Input Surface Normal

Prompt: Generate a surface normal map of the input image.

Input Surface Normal

Prompt: Generate a surface normal map of the input image.

Input Surface Normal

Prompt: Generate a surface normal map of the input image.

Hover to reveal surface normal maps

Results

Vision Banana achieves state-of-the-art under the zero-shot transfer setting across 2D and 3D vision tasks.

2D Understanding

Cityscapes — Semantic Segmentation mIoU ↑ Higher is better

0.842

0.442

0.478

0.520

0.652

0.699

SegMan-L
(Non Zero-Shot) APE-D OpenSeeD X-Decoder SAM 3 Vision Banana

SA-Co/Gold — Instance Segmentation pmF1 ↑ Higher is better

0.661

0.369

0.420

0.461

0.540

0.552

SAM 3
(Non Zero-Shot) APE-D OWLv2 Gemini 2.5 Vision Banana DINO-X

* Evaluated on 500 randomly sampled queries.

RefCOCOg val (UMD) cIoU ↑ Higher is better

0.794

0.838

0.513

0.677

0.734

0.738

HyperSeg
+ Phi2
(Non Zero-Shot) X-SAM
+ Phi3
(Non Zero-Shot) HybridGL Kang
+ LLaVA SAM 3
+ Gemini 2.5 Pro Vision Banana

ReasonSeg val gIoU ↑ Higher is better

0.566

0.650

0.626

0.647

0.770

0.793

X-SAM
+ Phi3 3.8B
(Non Zero-Shot) LISA-13B-LLAVA1.5(Non Zero-Shot) SegZero RSVP
+ GPT-4o SAM 3
+ Gemini 2.5 Pro Vision Banana
+ Gemini 2.5 Pro

Methods paired with MLLMs for reasoning.

3D Understanding

Metric Depth — Average over 6Benchmarks δ₁ ↑ Higher is better

0.715

0.802

0.823

0.882

Depth Pro MoGe-2 UniK3D Vision Banana

Vision Banana does not use camera intrinsics in training or inference.

Surface Normal — Average over 3 Benchmarks Mean Angular Error (°) ↓ Lower is better

19.606

17.168

17.017

16.558

15.549

Marigold StableNormal DSINE Lotus-2 Vision Banana

Contributors

Project Leads

Valentin Gabeur* · Shangbang Long* · Songyou Peng*

* Equal contribution

Core Contributors

Paul Voigtlaender · Shuyang Sun · Yanan Bao · Karen Truong · Zhicheng Wang · Wenlei Zhou · Jonathan T. Barron · Kyle Genova · Nithish Kannen · Sherry Ben · Yandong Li · Mandy Guo · Suhas Yogin

Project Advisors

Yiming Gu · Huizhong Chen

Leadership Sponsors

Oliver Wang · Saining Xie · Howard Zhou · Kaiming He · Thomas Funkhouser · Jean-Baptiste Alayrac · Radu Soricut

Acknowledgements

We thank Xi Chen, Fei Xia, Kaushik Shivakumar, Abhishek Sinha, Phillip Lippe, Yilin Gao, Javier Rey, Sanghyun Woo, Renshen Wang, Wentao Yuan, Keran Rong, Rundi Wu, Manoj Kumar, Manli Shu, Francesco Piccinno, Ishita Dasgupta, Benigno Uria, Miki Rubinstein, Aäron van den Oord, and Jon Shlens for their helpful discussions, advice, and technical guidance.

BibTeX

@article{visionbanana2026,
  title={Image Generators are Generalist Vision Learners},
  author={Gabeur, Valentin and Long, Shangbang and Peng, Songyou and Voigtlaender, Paul and Sun, Shuyang and Bao, Yanan and Truong, Karen and Wang, Zhicheng and Zhou, Wenlei and Barron, Jonathan T and Genova, Kyle and Kannen, Nithish and Ben, Sherry and Li, Yandong and Guo, Mandy and Yogin, Suhas and Gu, Yiming and Chen, Huizhong and Wang, Oliver and Xie, Saining and Zhou, Howard and He, Kaiming and Funkhouser, Thomas and Alayrac, Jean-Baptiste and Soricut, Radu},
  journal={arXiv preprint arXiv:2604.20329},
  year={2026}
}

Overview

Capabilities

Semantic Segmentation

Instance Segmentation

Referring Expression Segmentation

Monocular Metric Depth Estimation

Surface Normal Estimation

Results

2D Understanding

3D Understanding

Contributors