CMU Multi-Modal Activity Dataset Annotations

For questions about the annotation process please contact Ekaterina Taralova at etaralova AT


Please note: there was a difference between the videos provided on the main website and the annotation offsets provided on this page. The videos currently available on the website have the following frame offsets (if you have used the older version of the videos, the offsets are still reported below). Please note that the annotations were performed using the wearable camera (the frame offsets are with respect to the wearable camera frames).


Some of the videos are missing from the main dataset webpage - I only have access to half-resolution videos, so I am posting a link to these in the following table, and I have contacted the main team to see if we can find the videos. Update will be posted here, or email me if you would like to be notified: etaralova @

Subject IDAnnotation filesStarting frame*Ending frame**
S06 S06_Brownie.avi (half-resolution) Annotations (zip) no offset, if using the half-resolution file -
S07 Annotations (zip) 508 10309
S08 Annotations (zip) 300 9000
S09 Annotations (zip) 226 13334
S10 S10_Brownie.avi (half-resolution) Annotations (zip) no offset, if using half-resolution file -
S12 Annotations (zip) 400 (updated 11/03) 15233
S13 Annotations (zip) 290 20151
S14 Annotations (zip) 386 11705
S16 Annotations (zip) 168 12338
S17 Annotations (zip) 236 11518
S18 Annotations (zip) 316 12088
S19 Annotations (zip) 354 14970
S20 Annotations (zip) 212 10576
S22 Annotations (zip) 262 17315
S23 S23_Brownie.avi (half-resolution) Annotations (zip) no offset, if using half-resolution file -
S24 Annotations (zip) 360 12391

About the annotation process

These annotations were made by looking at the first-person videos (wearable camera). The annotators had a list of options for the labels, where each label consists of four optional fields: verb, object1, preposition, object2. The annotations we provide here are from one annotator, however we have more annotations from two other people coming up.

A snapshot of the annotation tool can be found here (in collaboration with Moritz Tenorth, TUM). A new annotation tool for Mechanical Turk is being developed by Alex Sorokin, UIUC/CMU (in collaboration with our lab and Moritz Tenorth, TUM). More information will be available soon.

About the data files

In each zip provided, the "labels.dat" file contains 3 columns - the first is the starting frame of the action, the second is the ending frame of the action, and the third is the action label in the following format: "verb-object1-preposition-object2". The file "unique_labels.dat" contains one column, where each row is a class ID corresponding to one of the actions among all annotated subjects, one per frame (the video was recorded at 30fps).

About synchronization with sensors

The annotations start from the "starting frame" specified in the table below, which is the point in time when the subject turns on/off the light used for synchronization. Thus, the first row/frame in the annotation files corresponds to the value of the "starting frame."

About the dataset

The first-person videos and other sensors can be downloaded from

Subject IDAnnotation filesStarting frame*Ending frame**
S06 Annotations (zip) 1192 12010
S07 Annotations (zip) 1936 11737
S08 Annotations (zip) 1232 9932
S09 Annotations (zip) 1877 14985
S10 Annotations (zip) 1001 14060
S12 Annotations (zip) 1707 16540
S13 Annotations (zip) 919 20780
S14 Annotations (zip) 1910 13229
S16 Annotations (zip) 1596 13766
S17 Annotations (zip) 1464 12746
S18 Annotations (zip) 1198 12970
S19 Annotations (zip) 1200 15816
S20 Annotations (zip) 445 10809
S22 Annotations (zip) 1180 18233
S23 Annotations (zip) 1186 13964
S24 Annotations (zip) 841 12872

* The "starting frame" is relative to the first frame of the first-person video, when the video is decomposed into single frames (30fps). This corresponds the to frame when the subject turns on and off the light switch which is used for synchronization (i.e., the initial setup and calibration frames which contain no actions are skipped).

** The "ending frame" is the last frame for which annotations are available. This corresponds to the last action that the subject performs (i.e., the frames where the subject walks back to the middle of the room are skipped, as they don't contain recipe-related actions).

For more information, see (note: I am now publishing under Ekaterina H. Taralova):

Temporal Segmentation and Activity Classification from First-person Sensing
Ekaterina H. Spriggs, Fernando De la Torre Frade, and Martial Hebert,
IEEE Workshop on Egocentric Vision, CVPR 2009, June, 2009. Abstract. Download paper (PDF).