| Applying artificial vision models to human scene understanding
E. M. Aminoff, M. Toneva, A. Shrivastava, X. Chen, I. Misra, A. Gupta, and M. J. Tarr
Frontiers in Computational Neuroscience, 2015
How do we understand the complex patterns of neural responses that underlie scene understanding? Studies of the network of brain regions held to be scene-selective—the parahippocampal/lingual region (PPA), the retrosplenial complex (RSC), and the occipital place area (TOS)—have typically focused on single visual dimensions (e.g., size), rather than the high-dimensional feature space in which scenes are likely to be neurally represented. Here we leverage well-specified artificial vision systems to explicate a more complex understanding of how scenes are encoded in this functional network. We correlated similarity matrices within three different scene-spaces arising from: (1) BOLD activity in scene-selective brain regions; (2) behavioral measured judgments of visually-perceived scene similarity; and (3) several different computer vision models. These correlations revealed: (1) models that relied on mid- and high-level scene attributes showed the highest correlations with the patterns of neural activity within the scene-selective network; (2) NEIL and SUN—the models that best accounted for the patterns obtained from PPA and TOS—were different from the GIST model that best accounted for the pattern obtained from RSC; (3) The best performing models outperformed behaviorally-measured judgments of scene similarity in accounting for neural data. One computer vision method—NEIL (“Never-Ending-Image-Learner”), which incorporates visual features learned as statistical regularities across web-scale numbers of scenes—showed significant correlations with neural activity in all three scene-selective regions and was one of the two models best able to account for variance in the PPA and TOS. We suggest that these results are a promising first step in explicating more fine-grained models of neural scene understanding, including developing a clearer picture of the division of labor among the components of the functional scene-selective brain network.
|Scene-Space Encoding within the Functional Scene-Selective Network
E. M. Aminoff, M. Toneva, A. Gupta, and M. J. Tarr
Journal of Vision, 2015
High-level visual neuroscience has often focused on how different visual categories are encoded in the brain. For example, we know how the brain responds when viewing scenes as compared to faces or other objects - three regions are consistently engaged: the parahippocampal/lingual region (PPA), the retrosplenial complex (RSC), and the occipital place area/transverse occipital sulcus (TOS). Here we explore the fine-grained responses of these three regions when viewing 100 different scenes. We asked: 1) Can neural signals differentiate the 100 exemplars? 2) Are the PPA, RSC, and TOS strongly activated by the same exemplars and, more generally, are the "scene-spaces" representing how scenes are encoded in these regions similar? In an fMRI study of 100 scenes we found that the scenes eliciting the greatest BOLD signal were largely the same across the PPA, RSC, and TOS. Remarkably, the orderings, from strongest to weakest, of scenes were highly correlated across all three regions (r = .82), but were only moderately correlated with non-scene selective brain regions (r = .30). The high similarity across scene-selective regions suggests that a reliable and distinguishable feature space encodes visual scenes. To better understand the potential feature space, we compared the neural scene-space to scene-spaces defined by either several different computer vision models or behavioral measures of scene similarity. Computer vision models that rely on more complex, mid- to high-level visual features best accounted for the pattern of BOLD signal in scene-selective regions and, interestingly, the better-performing models exceeded the performance of our behavioral measures. These results suggest a division of labor where the representations within the PPA and TOS focus on visual statistical regularities within scenes, whereas the representations within the RSC focus on a more high-level representation of scene category. Moreover, the data suggest the PPA mediates between the processing of the TOS and RSC.
|Towards a model for mid-level feature representation of scenes
M. Toneva, E. M. Aminoff, A. Gupta, and M. Tarr
Oral presentation. WIML NIPS workshop, 2014
Never Ending Image Learner (NEIL) is a semi-supervised learning algorithm that continuously pulls images from the web and learns relationships among them. NEIL has classified over 400,000 images into 917 scene categories using 84 dimensions - termed “attributes”. These attributes roughly correspond to mid-level visual features whose differential combinations define a large scene space. As such, NEIL’s small set of attributes offers a candidate model for the psychological and neural representation of scenes. To investigate this, we tested for significant similarities between the structure of scene space defined by NEIL and the structure of scene space defined by patterns of human BOLD responses as measured by fMRI. The specific scenes in our study were selected by reducing the number of attributes to the 39 that best accounted for variance in NEIL’s scene-attribute co-classification scores. Fifty scene categories were then selected such that each category scored highly on a different set of at most 3 of the 39 attributes. We then selected the two most representative images of the corresponding high-scoring attributes from each scene category, resulting in a total of 100 stimuli used. Canonical correlation analyses (CCA) was used to test the relationship between measured BOLD patterns within the functionally-defined parahippocampal region and NEIL’s representation of each stimulus as a vector containing stimulus-attribute co-classification scores on the 39 attributes. CCA revealed significant similarity between the local structures of the fMRI data and the NEIL representations for all participants. In contrast, neither the entire set of 84 attributes nor 39 randomly-chosen attributes produced significant results using this CCA method. Overall, our results indicate that subsets of the attributes learned by NEIL are effective in accounting for variation in the neural encoding of scenes – as such they represent a first pass compositional model of mid-level features for scene representation.
| An Exploration of Social Grouping: Effects of Behavioral Mimicry, Appearance, and Eye Gaze
A. Nawroj, M. Toneva, H. Admoni, and B. Scassellati
Oral presentation. In proceedings of the 36th Annual Conference of the Cognitive Science Society (Cogsci 2014)
People naturally and easily establish social groupings based on appearance, behavior, and other nonverbal signals. However, psychologists have yet to understand how these varied signals interact. For example, which factor has the strongest effect on establishing social groups? What happens when two of the factors conflict? Part of the difficulty of answering these questions is that people are unique and stochastic stimuli. To address this problem, we use robots as a visually simple and precisely controllable platform for examining the relative in- fluence of social grouping features. We examine how behavioral mimicry, similarity of appearance, and direction of gaze influence peoples’ perception of which group a robot belongs to. Experimental data shows that behavioral mimicry has the most dominant influence on social grouping, though this influence is modulated by appearance. Non-mutual gaze was found to be a weak modulator of the perception of grouping. These results provide insight into the phenomenon of social grouping, and suggest areas for future exploration.
| The Physical Presence of a Robot Tutor Increases Cognitive Learning Gains
D. Leyzberg, S. Spaulding, M. Toneva, and B. Scassellati
Poster. In Proceedings of the 34th Annual Conference of the Cognitive Science Society (Cogsci 2012)
We present the results of a 100 participant study on the role
of a robot's physical presence in a robot tutoring task. Participants
were asked to solve a set of puzzles while being provided
occasional gameplay advice by a robot tutor. Each participant
was assigned one of five conditions: (1) no advice,
(2) robot providing randomized advice, (3) voice of the robot
providing personalized advice, (4) video representation of the
robot providing personalized advice, or (5) physically-present
robot providing personalized advice. We assess the tutor's effectiveness
by the time it takes participants to complete the
puzzles. Participants in the robot providing personalized advice
group solved most puzzles faster on average and improved
their same-puzzle solving time significantly more than participants
in any other group. Our study is the first to assess the
effect of the physical presence of a robot in an automated tutoring
interaction. We conclude that physical embodiment can
produce measurable learning gains.
|Robot gaze does not reflexively cue human attention
H. Admoni, C. Bank, J. Tan, M. Toneva, and B. Scassellati
Poster. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society (Cogsci 2011)
Joint visual attention is a critical aspect of typical human interactions.
Psychophysics experiments indicate that people exhibit
strong reflexive attention shifts in the direction of another
person's gaze, but not in the direction of non-social cues such
as arrows. In this experiment, we ask whether robot gaze elicits
the same reflexive cueing effect as human gaze. We consider
two robots, Zeno and Keepon, to establish whether differences
in cueing depend on level of robot anthropomorphism. Using
psychophysics methods for measuring attention by analyzing
time to identification of a visual probe, we compare attention
shifts elicited by five directional stimuli: a photograph of
a human face, a line drawing of a human face, Zeno's gaze,
Keepon's gaze and an arrow. Results indicate that all stimuli
convey directional information, but that robots fail to elicit attentional
cueing effects that are evoked by non-robot stimuli,
regardless of robot anthropomorphism.