Joint Summarization of Large Sets of Web Images and Videos
We are working on a journal version. We will post the code after the Journal submission.
Motivation of Research
The objective of this research is to jointly summarize large sets of online Flickr images and YouTube user videos. Since their characteristics are different yet complementary, using both media is mutually rewarding (i.e. help each other).
Let’s take a look at why collections of images help video summarization with an example of Fig.1.(a). One major issue of videos is that they often contain redundant and noisy information such as backlit subjects, motion blurs, overexposure, and full of trivial backgrounds like sky or water. However, usually pictures are more carefully taken so that they capture the subjects from canonical viewpoints in a more semantically meaningful way. Therefore, by using simple similarity votes from crowds of fly fishing images, we can get rid of such noisy, redundant, or semantically meaningless parts of videos.
In the reverse direction, collections of videos help story-based image summarization (See Fig.1.(b)). Here’s an example of Flickr photo stream. One issue of still images is that they are fragmentally recorded, so the sequential structure is often missing even between consecutive images in a single photo stream. However, videos are motion pictures, which convey temporal smoothness between frames. Therefore, we leverage sets of videos to discover underlying sequential structure as a coherent thread of storyline.
Method and Experiments
The video summarization is achieved by diversity ranking on the similarity graphs between images and video frames. The storyline graphs is created by the inference of sparse time-varying directed graphs from a set of photo streams with assistance of videos.
For evaluation, we collect the datasets of 20 outdoor activities, consisting of 2.7 millions Flickr images and 16 thousands YouTube videos. We evaluate our algorithm via crowdsourcing using Amazon Mechanical Turk. In our experiments, we demonstrate that the proposed joint summarization approach outperforms other baselines and our own methods using videos or images only.