CMU Helps Compile Largest Collection of First-Person Videos

Egocentric Data Valuable for Training Computer Vision Models

By Aaron Aupperlee

Aaron Aupperlee

Senior Director of Media Relations, School of Computer Science
aaupperlee@cmu.edu
412-268-9068

CMU researchers helped compile and will have access to the largest collection of point-of-view videos in the world — like those of this arborist doing his job — as part of Ego4D. The collection of footage will help AI understand the world from a first-person perspective.

Researchers at Carnegie Mellon University helped compile and will have access to the largest collection of point-of-view videos in the world. These videos could enable artificial intelligence to understand the world from a first-person point of view and unlock a new wave of virtual assistants, augmented reality and robotics.

Until now, most of the video used to train computer vision models came from the third-person point of view. The first-person, or egocentric, video included in this collection will allow researchers to train computer vision systems to see the world as humans do.

"For the first time, we'll have enough data to be able to teach computers to see what we see," said Kris Kitani, an associate research professor in the Robotics Institute who led CMU's efforts to collect data. "The applications of this data are endless, from teaching robots how your hands manipulate objects to enabling a virtual assistant to help you find your keys."

The egocentric video was collected by a consortium of more than a dozen universities and academic institutions worldwide brought together by Facebook AI. Participants wore head-mounted cameras and were trained on how to use them. About 700 participants in nine countries collected more than 2,200 hours of video. The amount of footage collected through this project, known as Ego4D, is about 20 times the number of hours previously accessible to researchers.

Kitani and his group contributed more than 500 hours of footage to Ego4D. The CMU researchers gave landscapers, woodworkers, mechanics, artists, dog walkers, painters, contractors and other workers cameras. Participants near CMU's campus in Rwanda wore cameras to record cleaning, cooking, washing dishes, gardening and other tasks.

"They just recorded their day at work," Kitani said.

Both humans and algorithms checked the videos for identifying images such as faces or license plates and scrubbed them. The universities and institutions then uploaded the videos to a server hosted by the Common Visual Data Foundation, a nonprofit established to enable open, community-driven research in computer vision.

The videos will be a rich data source for Kitani's work with computer vision, robotics and object manipulation. Kitani wants to train computer vision systems on what objects look like before and after a task — like a piece of wood before and after it is cut or a shirt before and after it is folded. Robots can then devise their own methods for cutting wood or folding a shirt.

"It doesn't matter how a robot cuts wood. What is important is that the wood is cut properly," Kitani said. "If we can understand how an object has changed, then we can train a robot to take an object from one state to another."

Read more about Ego4D in this blog post from Facebook AI.