Human-level 3D shape perception
emerges from multi-view learning

Tyler Bonnen, Jitendra Malik, and Angjoo Kanazawa

University of California, Berkeley

Paper Code Data

tl;dr: neural networks trained on multi-view sensory data
are the first to match human-level 3D shape inferences

How do humans perceive the three-dimensional structure of objects from two-dimensional visual inputs?

Understanding this ability has been a longstanding goal for both the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we show that human-level 3D shape inferences emerge in neural networks trained using a visual-spatial learning objective over large-scale naturalistic sensory data. These results provide a bridge between cognitive theories and current practice in deep learning, revealing a novel route towards more human-like vision models.

Let's begin with a simple test from the cognitive sciences

To understand 3D perception, cognitive scientists have developed well-controlled experimental tasks. For example, in the example trials below, two images depict the same object from different viewpoints, while one depicts a different object. Can you identify which is the different object? Take a moment to decide, then click the odd one out.

This task design lets us evaluate 3D perception using arbitrary objects, which provides a good estimate of our 'zero-shot' visual abilities. For example, we can parametrically vary the task difficulty (e.g., increasing between-object similarity, lighting, viewpoints) in a way that enables us to disentangle 3D shape perception from other visual properties (e.g., texture).

Standard vision models fail at this task

Humans can reliably perform the tasks above, even when they're challenging. Standard vision models fail, especially when they're challenging We’ve observed this failure across a wide range of architectures, model sizes, training objectives, and datasets. Understanding this failure has both technical and conceptual implications for the cognitive sciences.

Model performance vs. scale on MOCHI benchmark

Scaling up model size improves performance on 3D perception tasks, but even the largest models fall far short of humans. These data come from our prior work on Multi-view Object Consistency in Humans and Image models (MOCHI), a large-scale dataset using many stimulus types (like the ones above) and behavioral data from hundreds of people.

Cognitive theories offer competing interpretations of model failure

There are two prominent theories of visual development that interpret model failure in different ways.

Nativists argue that perceiving 3D structure requires built-in, domain-specific knowledge—innate priors that provide the constraints necessary for learning. Under this view, models fail because they lack the right inductive biases.

Empiricists argue that 3D perception emerges from general-purpose learning over sensory experience. Under this view, models might fail because they learn from the wrong kind of data, which doesn't reflect human visual experience.

What kind of sensory data do we learn from?

Our visual experiences are inherently sequential, multi-modal, and spatially structured. For example, retinal inputs, binocular depth, and self-motion signals are abundant in human vision, and might provide powerful self-supervision signals to guide perceptual learning. There is a rich history in the cognitive sciences characterizing the developmental stages associated with these different sensory signals. In recent years, head-mounted cameras have made it possible to capture these sensory data in unprecedented detail. If we hope to build models of human perception, these data types provide an unexplored resource for modeling human visual learning.

Vision (retinal)

Depth (stereo)

Self-motion (vestibular)

Developmental psychologists (e.g., Bria Long and Mike Frank) have developed powerful methods and datasets to understand the visual experiences of developing children. For illustrative purposes, here we visualize headcam data provided by Bria Long, alongside depth and camera motion signals that we have automatically extracted.

A new class of models leverages these visual-spatial sensory data

A recent class of vision transformers learns from structured visual-spatial data. Concretely, given sets of images from different viewpoints, these models (e.g., DUSt3R, MASt3R, π³, and VGGT) learn to predict spatial information associated with these images, such as depth, camera pose, and geometric correspondence. This modeling strategy has explicitly aimed to remove hard-coded inductive biases and have geometric understanding emerge from the predictive relationship between images and spatial information. In a sense, these models are the empiricist's ideal: they must learn the geometric structure of the environment given only visual-spatial data that are analogous to human sensory signals.

Input images

Predicted depth

Predicted cameras

Given a sequence of images (left), multi-view transformers like VGGT learn to predict associated spatial information, such as per-frame depth maps (center) and camera poses (right). These signals are analogous so sensory data available to humans.

We develop a zero-shot evaluation framework for 'multi-view' models

To evaluate this novel class of vision transformers we develop a series of zero-shot metrics. To estimate model performance, for a given trial, we encode all pairwise combinations of images and extract the model’s internal confidence estimate (a measure of uncertainty used during training). We average across these pairs and determine the model-selected 'oddity' as the image with the lowest average pairwise confidence, then compare to ground truth. This gives us trial-level behavioral readouts with no fine-tuning, no task-specific training.

Evaluation protocol: encode image pairs, extract pairwise uncertainty, identify the non-matching object as the pair with lowest confidence

Estimating model performance. For each trial (left), we encode all image pairs (center left) and extract the model’s uncertainty (center right), which is a model response used during training. We average over these pairs to determine the non-matching object (right). We used the confidence margin (Δ) to predict human error patterns, and an independent 'solution layer' analysis to predict human reaction time data.

There is an emergent alignment between model and human perception

These multi-view transformers are the first vision models match human performance on 3D shape inferences. Using the zero-shot evaluation approach outlined above, we find that VGGT matches human-level accuracy, while large vision models trained only on static images lag far behind (left). Given this correspondence, we develop a series of independent model readouts to probe the granularity of human-model alignment. We find that human error patterns are predicted by the confidence margin (center), and human reaction time is correlated with the amount of compute (evident in its 'solution layer') the model requires to arrive at the correct answer (right). Critically, this human-model correspondence emerges from multi-view learning alone, without training on any experimental behavior or images.

VGGT matches human accuracy, predicts error patterns, and correlates with reaction time

Left: VGGT matches human accuracy and dramatically outperforms standard vision models used in the cognitive science. Center: Model confidence margin predicts human error patterns. Right: Model solution layer predicts human RT. Hover to highlight.

Models appear to use hierarchical correspondence to solve this task

These behavioral correspondences raise a mechanistic question: what kind of internal representation supports this human-like performance? To investigate, we visualize VGGT’s cross-image attention across layers. For each query point on image A, we show where the model attends on the matching object (A′) versus the non-matching object (B). In early layers, attention is diffuse and undifferentiated—the model has not yet distinguished the two objects. By intermediate layers, each query point on A elicits focused attention to the corresponding spatial location on A′—the same part of the object, seen from a different angle—while attention on B remains scattered. The model appears to represent object similarity through part-level spatial correspondence that emerges progressively across the network’s depth.

Trial:

Layer: 0

0 (early) 12 23 (late)

mask object

Source (A)

A′ (match)

B (oddity)

Bridging cognitive theory and machine learning

We find that neural networks trained on multi-view correspondence —with no exposure to human experimental data— predict the accuracy, error patterns, and temporal dynamics of human 3D shape inferences. Critically, our zero-shot evaluation approach rules out the possibility that this correspondence is an artifact of task-specific training or linear re-weighting. Rather, the design and optimization of this model leads to a natural alignment with human behavior. It is striking that the first models to match human performance closely align with empiricist theories of perceptual learning.

These findings provide a computational bridge between cognitive theory and current practices in deep learning. The empiricist claim that perception emerges from general-purpose learning over structured sensory experience—rather than from innate, domain-specific knowledge—is, in many ways, a precursor to the prevailing deep learning paradigm. The emergent alignment between multi-view transformers and human perception suggests that this is not only an intellectual lineage, but an opportunity to formalize and evaluate longstanding theories of human perception.

Open questions and future directions

Our work raises exciting questions at the intersection of cognitive science, neuroscience, and computer science. What learning objectives and data distributions are essential for human-like 3D perception? How can we design models that better reflect the architectural and algorithmic constraints on human vision? Can we learn from infant-scale data? These are questions that extend well beyond 3D shape inferences. We're actively pursuing these ideas and are always looking for new collaborators interested in this multi-disciplinary approach. Feel free to reach out: bonnen@berkeley.edu

Citation

@article{bonnen2026human,
    title={Human-level 3D shape perception emerges from multi-view learning},
    author={Bonnen, Tyler and Malik, Jitendra and Kanazawa, Angjoo},
    year={2026} 
}

This work is supported by the NIH (Award F99NS125816) and the UC Presidential Postdoctoral Fellowship Award.

REFERENCES

Bonnen, T., Wagner, A. D. & Yamins, D. L. K. (2025). Medial temporal cortex supports object perception by integrating over visuospatial sequences. Cognition, 262, 106135. doi
Bonnen, T., Fu, S., Bai, Y., O’Connell, T., Friedman, Y., Kanwisher, N., Tenenbaum, J. B. & Efros, A. A. (2024). Evaluating Multiview Object Consistency in Humans and Image Models. Advances in Neural Information Processing Systems (NeurIPS). arXiv
O’Connell, T. P., Bonnen, T., Friedman, Y., Tewari, A., Sitzmann, V., Tenenbaum, J. B. & Kanwisher, N. (2025). Approximating Human-Level 3D Visual Inferences With Deep Neural Networks. Open Mind, 9, 305–324. doi
Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W.-S. & Nguyen, A. (2019). Strike (with) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv
Ollikka, N., Abbas, A., Perin, A., Kilpeläinen, M. & Deny, S. (2025). A comparison between humans and AI at recognizing objects in unusual poses. Transactions on Machine Learning Research (TMLR). arXiv
Bonnen, T., Peterlinz, R., Kanazawa, A. & Efros, A. A. (2024). Evaluating the perceptual alignment between generative visual models and human observers on 3D shape inferences. Conference on Cognitive Computational Neuroscience (CCN).
Bowers, J. S., Malhotra, G., Dujmović, M., Llera Montero, M., Tsvetkov, C., Biscione, V., Puebla, G., Adolfi, F., Hummel, J. E., Heaton, R. F., Evans, B. D., Mitchell, J. & Blything, R. (2023). Deep problems with neural network models of human vision. Behavioral and Brain Sciences, 46, e385. doi
Spelke, E. S. (1990). Principles of object perception. Cognitive Science, 14(1), 29–56. doi
Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40, e253. doi
Helmholtz, H. von (1867). Handbuch der physiologischen Optik. Leipzig: Leopold Voss.
Smith, L. B., Jayaraman, S., Clerkin, E. & Yu, C. (2018). The developing infant creates a curriculum for statistical learning. Trends in Cognitive Sciences, 22(4), 325–336. doi
Long, B., Sparks, R. Z., Xiang, V., Stojanov, S., Yin, Z., Keene, G. E., Tan, A. W. M., Feng, S. Y., Zhuang, C., Marchman, V. A. et al. (2024). The BabyView dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences. arXiv:2406.10447. arXiv
Sullivan, J., Mei, M., Perfors, A., Wojcik, E. & Frank, M. C. (2021). SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind, 5, 20–29. doi
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B. & Revaud, J. (2024). DUSt3R: Geometric 3D Vision Made Easy. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20697–20709.
Leroy, V., Cabon, Y. & Revaud, J. (2024). Grounding Image Matching in 3D with MASt3R. European Conference on Computer Vision (ECCV).
Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C. & He, T. (2025). π³: Permutation-Equivariant Visual Geometry Learning. arXiv:2507.13347. arXiv
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C. & Novotny, D. (2025). VGGT: Visual Geometry Grounded Transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv
Bonnen, T., Yamins, D. L. K. & Wagner, A. D. (2021). When the ventral visual stream is not enough: A deep learning account of medial temporal lobe involvement in perception. Neuron, 109(17), 2755–2766. doi