Vision and Autonomous Systems Seminar

  • Gates Hillman Centers
  • Traffic21 Classroom 6501
  • Research Scientist
  • School of Interactive Computing
  • Georgia Institute of Technology

Towards Goal-Driven Visually Grounded Dialog Agents

Communication between human users and artificial intelligences is essential for human-AI cooperative tasks. For these collaborations to extend into real environments, artificial agents must be able to perceive their environment (visually, aurally, tactilely, etc.) and to communicate with humans about it in order to accomplish mutual goals. For example, a user might talk with an expert agent in order to learn about some entity in the world (i.e. User: "What kind of bird is that?" AI: "It is a blue jay." User: "How can you tell?" AI: "Its blue crown and wings give it away!") or to sift through large datasets (i.e. User: "Has anyone entered this hallway in the last month?" AI: "Yes, 127 instances are logged on camera." User: "Were any of them carrying a black bag?"). In this talk, I will focus on a line of recent work developing agents that engage in visually-grounded, question-answer based dialogs like those in the examples above -- a task we call Visual Dialog. First, I will provide an overview of the Visual Dialog task and the data collection effort culminating in the VisDial dataset of over 1.2 million rounds of dialog. I will then describe a number of deep agents trained for this task and highlight some challenges faced by these supervised models. Then I will discuss follow-up work in which we address many of these challenges by modeling Visual Dialog as a cooperative game between agents in a reinforcement learning setting -- learning dialog agent policies end-to-end, from pixels to multi-agent, multi-round dialog to game reward. Finally, I'll discuss EmbodiedQA, a recent effort to extend beyond static images and ground similar agents into full environments.

Please visit  or  for more information and to interact with a live demo of one of our Visual Dialog agents.

Stefan Lee is a Research Scientist in the School of Interactive Computing at Georgia Tech collaborating closely with Dhruv Batra and Devi Parikh on problems at the intersection of computer vision and natural language processing. He received his PhD from Indiana University in 2016 under David Crandall and was awarded the Bradley Postdoctoral Fellowship at Virginia Tech. His work has been presented at NIPS, ICCV, CVPR, EMNLP (including a best paper award), and ICCP and he has held visiting research positions at Virginia Tech, INRIA Willow, and UC Berkeley.

For More Information, Please Contact: