Hover over any image to reveal Vision Banana's generation results. On mobile, tap to toggle.



Hover to reveal segmentation masks




Hover to reveal instance masks




Hover to reveal referred object masks







Hover to reveal depth maps
Hover to reveal surface normal maps
Vision Banana achieves state-of-the-art under the zero-shot transfer setting across 2D and 3D vision tasks.
We thank Xi Chen, Fei Xia, Kaushik Shivakumar, Abhishek Sinha, Phillip Lippe, Yilin Gao, Javier Rey, Sanghyun Woo, Renshen Wang, Wentao Yuan, Keran Rong, Rundi Wu, Manoj Kumar, Manli Shu, Francesco Piccinno, Ishita Dasgupta, Benigno Uria, Miki Rubinstein, Aäron van den Oord, and Jon Shlens for their helpful discussions, advice, and technical guidance.
@article{visionbanana2026,
title={Image Generators are Generalist Vision Learners},
author={Gabeur, Valentin and Long, Shangbang and Peng, Songyou and Voigtlaender, Paul and Sun, Shuyang and Bao, Yanan and Truong, Karen and Wang, Zhicheng and Zhou, Wenlei and Barron, Jonathan T and Genova, Kyle and Kannen, Nithish and Ben, Sherry and Li, Yandong and Guo, Mandy and Yogin, Suhas and Gu, Yiming and Chen, Huizhong and Wang, Oliver and Xie, Saining and Zhou, Howard and He, Kaiming and Funkhouser, Thomas and Alayrac, Jean-Baptiste and Soricut, Radu},
journal={arXiv preprint arXiv:2604.20329},
year={2026}
}