What Moves the Eyes: Doubling Mechanistic Model Performance Using Deep Networks to Discover and Test Cognitive Hypotheses
Abstract
Understanding how humans move their eyes to gather visual information is a central question in neuroscience, cognitive science, and vision research. While recent deep learning (DL) models achieve state-of-the-art performance in predicting human scanpaths, their underlying decision processes remain opaque. At an opposite end of the modeling spectrum, cognitively inspired mechanistic models aim to explain scanpath behavior through interpretable cognitive mechanisms but lag far behind in predictive accuracy. In this work, we bridge this gap by using a high-performing deep model - DeepGaze III - to discover and test mechanisms that improve a leading mechanistic model, SceneWalk. By identifying individual fixations where DeepGaze III succeeds and SceneWalk fails, we isolate behaviorally meaningful discrepancies and use them to motivate targeted extensions of the mechanistic framework. These include time-dependent temperature scaling, saccadic momentum and an adaptive cardinal attention bias: Simple, interpretable additions that substantially boost predictive performance. With these extensions, SceneWalk's explained variance on the MIT1003 dataset doubles from 35% to 70%, setting a new state of the art in mechanistic scanpath prediction. Our findings show how performance-optimized neural networks can serve as tools for cognitive model discovery, offering a new path toward interpretable and high-performing models of visual behavior.
TL;DR
A systematic fixation-level comparison of a performance-optimized DNN scanpath model and a mechanistic cognitive model reveals behaviourally relevant mechanisms that can be added to the mechanistic model to substantially improve performance.
Video Presentation
Introduction
Science often faces a choice:
Build models primarily designed to predict, or models that compactly explain. But what if we used them
in
synergy?
Our paper tackles this head-on. We combine a deep network (DeepGaze III) with an interpretable
mechanistic
model (SceneWalk).
💡 Our idea: Use the deep model not just to chase performance, but as a tool for scientific discovery.
We isolate "controversial fixations" where DeepGaze's likelihood vastly exceeds SceneWalk's.
These reveal where the mechanistic model fails to capture predictable patterns.
From these systematic failures, we isolated three critical mechanisms SceneWalk was missing. The data pointed to known cognitive principles, but revealed critical new nuances. Our method showed us not just what was missing, but how to formulate it to match human behavior. 👇
New mechanisms
Time dependent temperature scaling. We found that DeepGaze shows higher confidence (less entropy) early on (a) and predicts fixations in more salient locations (b) in agreement with the empirical data. SceneWalk is not showing this effect. To address this, we introduced fixation-index- dependent temperature scaling (c), modeled with exponential decay (d), which improves predictions for both early and late fixations (e).
Saccadic momentum: Controversial fixations suggested that DeepGaze sometimes prefers saccades to continue in the same direction. We confirmed such a saccadic momentum effect especially after long saccades (a) and later in scanpaths (b). These effects remain even after controlling for the distribution of salient objects (dashed lines), indicating a genuine directional bias. We added a saccadic momentum and return mechanism to SceneWalk (c) modulated by previous saccade length (d) and fixation index (e), improving predictions for "ongoing saccades" controversial fixations (f).
Horizontal and left-wards attention bias: Controversial fixations suggested that DeepGaze predicts early saccades to go to the left. This is indeed the case in the data (a) and the DeepGaze predictions (b). The effect persists when DeepGaze is run with uniform saliency maps (c) while SceneWalk (d) shows a too weak effect that increases over time. Therefore, we added a cardinal attention bias to SceneWalk, than can additionally have a left-asymmetry (e) and can adapt over time (f, g). This improves predictions on the relevant controversial fixations (f).
Results
These 3 mechanisms double SceneWalk's explained variance on the MIT1003 dataset (from 35 % → 70 %)! We
closed over 56 % of the gap to deep networks, setting a new State-of-the-Art for mechanistic scanpath
prediction.
Conceptually: Deep neural networks should be viewed as scientific instruments. They tell us what is
predictable in human behavior.
We then use that information to ask why, building fully interpretable models that approach the
performance of their black-box counterparts.
BibTeX
@article{dagostino2024what,
title={What Moves the Eyes: Doubling Mechanistic Model Performance Using Deep Networks to Discover and Test Cognitive Hypotheses},
author={D'Agostino, Federico and Schwetlick, Lisa and Bethge, Matthias and Kümmerer, Matthias},
journal={NeurIPS},
year={2025}
}