Highlight detection from videos has been widely studied due to the fast growth of video contents. However, most existing approaches to highlight detection, either handcraft feature based or deep learning based, heavily rely on human-curated training data, which is very expensive to obtain and, thus, hinders the scalability to large datasets and unlabeled video categories. We observe that the largely available Web images can be applied as a weak supervision for highlight detection. For example, the top-ranked images in reference to the query 'skiing' returned by a search engine may contain considerable positive samples of 'skiing' highlights. Motivated by this observation, we propose a novel triplet deep ranking approach to video highlight detection using Web images as a weak supervision. The approach handles the relative preference of highlight scores between highlighting frames, nonhighlighting frames, and Web images by the triplet ranking constraints. Our approach can iteratively train two interdependent deep models (i.e., a triplet highlight model and a pairwise noise model) to deal with the noisy Web images in a single framework. We train the two models with relative preferences to generalize the capability regardless of the categories of training data. Therefore, our approach is fully category independent and exploits weakly supervised Web images. We evaluate our approach on two challenging datasets and achieve impressive results compared with the state-of-the-art pairwise ranking support vector machines, a robust recurrent autoencoder, and spatial deep convolution neural networks. We also empirically verify through cross-dataset evaluation that our category-independent model is fairly generalizable even if two different datasets do not share exactly the same categories.
Bibliographical noteFunding Information:
Manuscript received March 18, 2017; revised August 23, 2017; accepted January 31, 2018. Date of publication February 14, 2018; date of current version August 14, 2018. This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2016R1A2B4009246) and in part by the MSIP (The Ministry of Science, ICT and Future Planning), Korea and Microsoft Research, under ICT/SW Creative research program supervised by the IITP (Institute for Information & Communications Technology Promotion) (IITP-2015-R2212-15-0015). Part of this work was performed when Hoseong Kim was a research intern at Microsoft Research Asia. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Rita Cucchiara. (Corresponding author: Tao Mei.) H. Kim and H. Byun are with the Department of Computer Science, Yon-sei University, Seoul 03722, South Korea (e-mail: firstname.lastname@example.org; email@example.com).
© 1999-2012 IEEE.
All Science Journal Classification (ASJC) codes
- Signal Processing
- Media Technology
- Computer Science Applications
- Electrical and Electronic Engineering