With the widespread availability of low-cost devices capable of photo shooting and high-volume video recording, we are facing explosion of both image and video data. The sheer volume of such visual data poses both challenges and opportunities in machine learning and computer vision research. In image classification, most of previous research has focused on small to medium-scale data sets, containing objects from dozens of categories. However, we could easily access images spreading thousands of categories. Unfortunately, despite the well-known advantages and recent advancements of multi-class classification techniques in machine learning, complexity concerns have driven most research on such super large-scale data set back to simple methods such as nearest neighbor search, one-vs-one or one-vs-rest approach. However, facing image classification problem with such huge task space, it is no surprise that these classical algorithms, often favored for their simplicity, will be brought to their knees not only because of the training time and storage cost they incur, but also because of the conceptual awkwardness of such algorithms in massive multi-class paradigms. Therefore, it is our goal to directly address the bigness of image data, not only the large number of training images and high-dimensional image features, but also the large task space. Specifically, we present algorithms capable of efficiently and effectively training classifiers that could differentiate tens of thousands of image classes.
Similar to images, one of the major difficulties in video analysis is also the huge amount of data, in the sense that videos could be hours long or even endless. However, it is often true that only a small portion of video contains important information. Consequently, algorithms that could automatically detect unusual events within streaming or archival video would significantly improve the efficiency of video analysis and save valuable human attention for only the most salient contents. Moreover, given lengthy recorded videos, such as those captured by digital cameras on mobile phones, or surveillance cameras, most users do not have the time or energy to edit the video such that only the most salient and interesting part of the original video is kept. To this end, we also develop algorithm for automatic video summarization. Finally, we propose to study supervised video summarization, where user generated summary videos are provided and used as source of supervised information. For example, we could manually summarize a subset of videos for certain types of events, such as “birthday party videos”, “wedding videos”, etc. For videos capturing similar event, those manually summarized videos would then be used as supervision or side information to how summarization should be constructed for this type of videos. Therefore, the goal of this thesis can be summarized as follows: We aim to design machine learning algorithms to automatically analyze and understand large-scale image and video data. Specifically, we design algorithms to address the bigness in image categorization – not only in the form of large number of data points and/or high-dimensional features, but also the large task space. We aim to scale our algorithm to image collections with tens of thousands of classes. We also propose algorithm to address the bigness of video stream, with hours in temporal length, or even endless, and automatically distill such videos to identify interesting events and summarize its contents.
Eric Xing (Chair)
Kristen Grauman (Department of Computer Science, UT Austin)
diane [atsymbol] cs ~replace-with-a-dot~ cmu ~replace-with-a-dot~ edu