Human-level 3D shape perception
emerges from multi-view learning

Tyler Bonnen, Jitendra Malik, and Angjoo Kanazawa

University of California, Berkeley


tl;dr: neural networks trained on multi-view sensory data
are the first to match human-level 3D shape inferences

1

How do humans perceive the three-dimensional structure of objects from two-dimensional visual inputs?

Understanding this ability has been a longstanding goal for both the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we show that human-level 3D shape inferences emerge naturally when training neural networks using a visual-spatial learning objective over naturalistic sensory data. These results provide a bridge between cognitive theories and current practice in deep learning, providing a novel route towards more human-like vision models.

Let's begin with a simple test from the cognitive sciences

To understand 3D perception, the cognitive sciences have developed incisive tests. This is a good one: In the example trials below, two images depict the same object from different viewpoints, while one depicts a different object. Can you identify which is the different object? Take a moment to decide, then click the 'oddity' when you think you've got it.

This task design lets us evaluate 3D perception using arbitrary objects, which provides a good estimate of our 'zero-shot' visual abilities. For example, we can parametrically vary the task difficulty (e.g., increasing between-object similarity, lighting, viewpoints) in a way that enables us to disentangle 3D shape perception from other visual properties (e.g., texture).

Standard vision models fail at this task

Humans can reliably infer the 3D structure of objects like the ones above, but standard vision models (e.g., DINOv2, CLIP) fail on these tasks, even when we scale up model size. We’ve observed this failure across a wide range of architectures, training objectives, and datasets. Understanding model failure has technical and conceptual implications for the cognitive sciences.

Model performance vs. scale on MOCHI benchmark

Scaling up model size improves performance on 3D perception tasks, but even the largest models fall far short of humans. These data come from our prior work on Multi-view Object Consistency in Humans and Image models (MOCHI), a large-scale dataset using many stimulus types (like the ones above) and behavioral data from hundreds of people.

Cognitive theories offer competing interpretations of model failure

There are two cognitive theories of visual development that have different interpretations of why models fail on 3D tasks:

Nativists argue that perceiving 3D structure requires built-in, domain-specific knowledge—innate priors that provide the constraints necessary for learning. Under this view, models fail because they lack the right inductive biases.

Empiricists argue that 3D perception emerges from general-purpose learning over natural sensory experience. Under this view, models fail because they learn from the wrong kind of data, which doesn't reflect natural visual experience.

There is over a century of empirical data related to these theories, but we lack computational methods to evaluate them.

What kind of sensory data do we learn from?

As infants, we generate structured multi-sensory experience that unfolds over time that is inherently sequential. For example, retinal inputs, binocular depth, and self-motion signals might provide powerful self-supervision signals to guide perceptual learning. There is a rich history in the cognitive sciences characterizing the developmental stages associated with these different sensory signals. In recent years, head-mounted cameras have made it possible to capture these sensory data in unprecedented detail. That is, if we hope to build models of human perception we don't just need to learn from different amounts of data, but different data types.

Vision (retinal)
Depth (stereo)
Self-motion (vestibular)

Developmental psychologists (e.g., Bria Long and Mike Frank) have developed powerful methods and datasets to understand the visual experiences of developing children. For illustrative purposes, here we visualize headcam data provided by Bria Long, alongside depth and camera motion signals that we have automatically extracted.

A new class of models leverages these visual-spatial learning signals

A recent class of vision transformers learns from structured visual-spatial data. Concretely, given sets of images from different viewpoints, these models (e.g., DUSt3R, MASt3R, π3, and VGGT) learn to predict spatial information associated with these images, such as depth, camera pose, and geometric correspondence. This modeling strategy has explicitly aimed to remove hard-coded inductive biases and have geometric understanding emerge from the predictive relationship between images and spatial information. In a sense, these models are the empiricist's ideal: they must learn the geometric structure of the environment given only visual-spatial data that are analogous to human sensory signals.

Input images
Predicted depth
Predicted cameras

Given a sequence of images (left), multi-view transformers like VGGT learn to predict associated spatial information, such as per-frame depth maps (center) and camera poses (right). These signals are analogous so sensory data available to humans.

We develop a zero-shot evaluation framework for 'multi-view' models

To evaluate this novel class of vision transformers we develop a series of zero-shot metrics. To estimate model performance, for a given trial, we encode all pairwise combinations of images and extract the model’s internal confidence estimate (a measure of uncertainty used during training). We average across these pairs and determine the model-selected 'oddity' as the image with the lowest average pairwise confidence, then compare to ground truth. This gives us trial-level behavioral readouts with no fine-tuning, no task-specific training.

Evaluation protocol: encode image pairs, extract pairwise uncertainty, identify the non-matching object as the pair with lowest confidence

Estimating model performance. For each trial (left), we encode all image pairs (center left) and extract the model’s uncertainty (center right), which is a model response used during training. We average over these pairs to determine the non-matching object (right). We used the confidence margin (Δ) to predict human error patterns, and an independent 'solution layer' analysis to predict human reaction time data.

There is an emergent alignment between model and human perception

These multi-view transformers are the first vision models match human performance on 3D shape inferences. Using the zero-shot evaluation approach outlined above, we find that VGGT matches human-level accuracy, while large vision models trained only on static images lag far behind (left). Given this correspondence, we develop a series of independent model readouts to probe the granularity of human-model alignment. We find that human error patterns are predicted by the confidence margin (center), and human reaction time is correlated with the amount of compute (evident in its 'solution layer') the model requires to arrive at the correct answer (right). Critically, this human-model correspondence emerges from multi-view learning alone, without training on any experimental behavior or images.

VGGT matches human accuracy, predicts error patterns, and correlates with reaction time

Left: VGGT matches human accuracy and dramatically outperforms standard vision models used in the cognitive science. Center: Model confidence margin predicts human error patterns. Right: Model solution layer predicts human RT. Hover to highlight.

Models appear to use hierarchical correspondence to solve this task

These behavioral correspondences raise a mechanistic question: what kind of internal representation supports this human-like performance? To investigate, we visualize VGGT’s cross-image attention across layers. For each query point on image A, we show where the model attends on the matching object (A′) versus the non-matching object (B). In early layers, attention is diffuse and undifferentiated—the model has not yet distinguished the two objects. By intermediate layers, each query point on A elicits focused attention to the corresponding spatial location on A′—the same part of the object, seen from a different angle—while attention on B remains scattered. The model appears to represent object similarity not through abstract category-level features, but through concrete, part-level spatial correspondence that emerges progressively across the network’s depth.


0 (early) 12 23 (late)
Source (A)
A′ (match)
B (oddity)

Bridging cognitive theory and machine learning

We find that neural networks trained on multi-view correspondence —with no exposure to human experimental data— predict the accuracy, error patterns, and temporal dynamics of human 3D shape inferences. Critically, our zero-shot evaluation approach rules out the possibility that this correspondence is an artifact of task-specific training or linear re-weighting. Rather, the design and optimization of this model leads to a natural alignment with human behavior. It is striking that the first models to match human performance closely align with empiricist theories of perceptual learning.

These findings provide a computational bridge between cognitive theory and current practices in deep learning. The empiricist claim that perception emerges from general-purpose learning over structured sensory experience—rather than from innate, domain-specific knowledge—is, in many ways, a precursor to the prevailing deep learning paradigm. The emergent alignment between multi-view transformers and human perception suggests that this is not only an intellectual lineage, but an opportunity to formalize and evaluate longstanding theories of human perception.

Open questions and future directions

Our work raises exciting questions at the intersection of cognitive science, neuroscience, and computer science. What learning objectives and data distributions are essential for human-like 3D perception? How can we design models that better reflect the architectural and algorithmic constraints on human vision? Can we learn from infant-scale data? These are questions that extend well beyond 3D shape inferences. We're actively pursuing these ideas and are always looking for new collaborators interested in this multi-disciplinary approach. Feel free to reach out: bonnen@berkeley.edu