Artifical Intelligence Seminar

  • Gates Hillman Centers
  • ASA Conference Room 6115
  • Ph.D. Student
  • Machine Learning Department
  • Carnegie Mellon University

Embodied Multimodal Multitask Learning

Recent efforts on training visual navigation agents conditioned on language using deep reinforcement learning have been successful in learning policies for two multimodal tasks: learning to follow navigational instructions and embodied question answering. We aim to learn a multitask model capable of jointly learning both tasks, and transferring knowledge of words and their grounding in visual objects across tasks. The proposed model uses a novel Dual-Attention unit to disentangle the knowledge of words in the textual representations and visual objects in the visual representations, and align them with each other. This disentangled task-invariant alignment of representations facilitates grounding and knowledge transfer across both tasks. We show that the proposed model outperforms a range of baselines on both tasks in simulated 3D environments. We also show that this disentanglement of representations makes our model modular, interpretable, and allows for zero-shot transfer to instructions containing new words by leveraging object detectors.

The AI Seminar is generously sponsored by Apple.

For More Information, Please Contact: