From Coarse Attention to Fine-Grained Gaze: A Two-stage 3D Fully Convolutional Network for Predicting Eye Gaze in First Person Video

Zehua Zhang, Sven Bambach, David Crandall, Chen Yu
British Machine Vision Conference (BMVC) 2018
[download paper] [visit website]

Abstract: While predicting where people will look when viewing static scenes has been well- studied, a more challenging problem is to predict gaze within the first-person, ego-centric field of view as people go about daily life. This problem is difficult because where a person looks depends not just on their visual surroundings, but also on the task they have in mind, their own internal state, their past gaze patterns and actions, and non-visual cues (e.g., sounds) that might attract their attention. Using data from head-mounted cameras and eye trackers that record people's egocentric fields of view and gaze, we propose and learn a two-stage 3D fully convolutional network to predict gaze in each egocentric frame. The model estimates a coarse attention region in the first stage, combining it with spatial and temporal features to predict a more precise gaze point in the second stage. We evaluate on a public dataset in which adults carry out specific tasks as well as on a new challenging dataset in which parents and toddlers freely interact with toys and each other, and demonstrate that our model outperforms state-of-the-art baselines.