Exploiting web images for video highlight detection with triplet deep ranking

Hoseong Kim, Tao Mei, Hyeran Byun, Ting Yao

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Highlight detection from videos has been widely studied due to the fast growth of video contents. However, most existing approaches to highlight detection, either handcraft feature based or deep learning based, heavily rely on human-curated training data, which is very expensive to obtain and, thus, hinders the scalability to large datasets and unlabeled video categories. We observe that the largely available Web images can be applied as a weak supervision for highlight detection. For example, the top-ranked images in reference to the query 'skiing' returned by a search engine may contain considerable positive samples of 'skiing' highlights. Motivated by this observation, we propose a novel triplet deep ranking approach to video highlight detection using Web images as a weak supervision. The approach handles the relative preference of highlight scores between highlighting frames, nonhighlighting frames, and Web images by the triplet ranking constraints. Our approach can iteratively train two interdependent deep models (i.e., a triplet highlight model and a pairwise noise model) to deal with the noisy Web images in a single framework. We train the two models with relative preferences to generalize the capability regardless of the categories of training data. Therefore, our approach is fully category independent and exploits weakly supervised Web images. We evaluate our approach on two challenging datasets and achieve impressive results compared with the state-of-the-art pairwise ranking support vector machines, a robust recurrent autoencoder, and spatial deep convolution neural networks. We also empirically verify through cross-dataset evaluation that our category-independent model is fairly generalizable even if two different datasets do not share exactly the same categories.

Original languageEnglish
Article number8291744
Pages (from-to)2415-2426
Number of pages12
JournalIEEE Transactions on Multimedia
Volume20
Issue number9
DOIs
Publication statusPublished - 2018 Sep 1

Fingerprint

Search engines
Convolution
Support vector machines
Scalability
Neural networks
Deep learning

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Media Technology
  • Computer Science Applications
  • Electrical and Electronic Engineering

Cite this

Kim, Hoseong ; Mei, Tao ; Byun, Hyeran ; Yao, Ting. / Exploiting web images for video highlight detection with triplet deep ranking. In: IEEE Transactions on Multimedia. 2018 ; Vol. 20, No. 9. pp. 2415-2426.
@article{1cbb7a1181e7419a814dce36ca944f8e,
title = "Exploiting web images for video highlight detection with triplet deep ranking",
abstract = "Highlight detection from videos has been widely studied due to the fast growth of video contents. However, most existing approaches to highlight detection, either handcraft feature based or deep learning based, heavily rely on human-curated training data, which is very expensive to obtain and, thus, hinders the scalability to large datasets and unlabeled video categories. We observe that the largely available Web images can be applied as a weak supervision for highlight detection. For example, the top-ranked images in reference to the query 'skiing' returned by a search engine may contain considerable positive samples of 'skiing' highlights. Motivated by this observation, we propose a novel triplet deep ranking approach to video highlight detection using Web images as a weak supervision. The approach handles the relative preference of highlight scores between highlighting frames, nonhighlighting frames, and Web images by the triplet ranking constraints. Our approach can iteratively train two interdependent deep models (i.e., a triplet highlight model and a pairwise noise model) to deal with the noisy Web images in a single framework. We train the two models with relative preferences to generalize the capability regardless of the categories of training data. Therefore, our approach is fully category independent and exploits weakly supervised Web images. We evaluate our approach on two challenging datasets and achieve impressive results compared with the state-of-the-art pairwise ranking support vector machines, a robust recurrent autoencoder, and spatial deep convolution neural networks. We also empirically verify through cross-dataset evaluation that our category-independent model is fairly generalizable even if two different datasets do not share exactly the same categories.",
author = "Hoseong Kim and Tao Mei and Hyeran Byun and Ting Yao",
year = "2018",
month = "9",
day = "1",
doi = "10.1109/TMM.2018.2806224",
language = "English",
volume = "20",
pages = "2415--2426",
journal = "IEEE Transactions on Multimedia",
issn = "1520-9210",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "9",

}

Exploiting web images for video highlight detection with triplet deep ranking. / Kim, Hoseong; Mei, Tao; Byun, Hyeran; Yao, Ting.

In: IEEE Transactions on Multimedia, Vol. 20, No. 9, 8291744, 01.09.2018, p. 2415-2426.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Exploiting web images for video highlight detection with triplet deep ranking

AU - Kim, Hoseong

AU - Mei, Tao

AU - Byun, Hyeran

AU - Yao, Ting

PY - 2018/9/1

Y1 - 2018/9/1

N2 - Highlight detection from videos has been widely studied due to the fast growth of video contents. However, most existing approaches to highlight detection, either handcraft feature based or deep learning based, heavily rely on human-curated training data, which is very expensive to obtain and, thus, hinders the scalability to large datasets and unlabeled video categories. We observe that the largely available Web images can be applied as a weak supervision for highlight detection. For example, the top-ranked images in reference to the query 'skiing' returned by a search engine may contain considerable positive samples of 'skiing' highlights. Motivated by this observation, we propose a novel triplet deep ranking approach to video highlight detection using Web images as a weak supervision. The approach handles the relative preference of highlight scores between highlighting frames, nonhighlighting frames, and Web images by the triplet ranking constraints. Our approach can iteratively train two interdependent deep models (i.e., a triplet highlight model and a pairwise noise model) to deal with the noisy Web images in a single framework. We train the two models with relative preferences to generalize the capability regardless of the categories of training data. Therefore, our approach is fully category independent and exploits weakly supervised Web images. We evaluate our approach on two challenging datasets and achieve impressive results compared with the state-of-the-art pairwise ranking support vector machines, a robust recurrent autoencoder, and spatial deep convolution neural networks. We also empirically verify through cross-dataset evaluation that our category-independent model is fairly generalizable even if two different datasets do not share exactly the same categories.

AB - Highlight detection from videos has been widely studied due to the fast growth of video contents. However, most existing approaches to highlight detection, either handcraft feature based or deep learning based, heavily rely on human-curated training data, which is very expensive to obtain and, thus, hinders the scalability to large datasets and unlabeled video categories. We observe that the largely available Web images can be applied as a weak supervision for highlight detection. For example, the top-ranked images in reference to the query 'skiing' returned by a search engine may contain considerable positive samples of 'skiing' highlights. Motivated by this observation, we propose a novel triplet deep ranking approach to video highlight detection using Web images as a weak supervision. The approach handles the relative preference of highlight scores between highlighting frames, nonhighlighting frames, and Web images by the triplet ranking constraints. Our approach can iteratively train two interdependent deep models (i.e., a triplet highlight model and a pairwise noise model) to deal with the noisy Web images in a single framework. We train the two models with relative preferences to generalize the capability regardless of the categories of training data. Therefore, our approach is fully category independent and exploits weakly supervised Web images. We evaluate our approach on two challenging datasets and achieve impressive results compared with the state-of-the-art pairwise ranking support vector machines, a robust recurrent autoencoder, and spatial deep convolution neural networks. We also empirically verify through cross-dataset evaluation that our category-independent model is fairly generalizable even if two different datasets do not share exactly the same categories.

UR - http://www.scopus.com/inward/record.url?scp=85042121805&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85042121805&partnerID=8YFLogxK

U2 - 10.1109/TMM.2018.2806224

DO - 10.1109/TMM.2018.2806224

M3 - Article

AN - SCOPUS:85042121805

VL - 20

SP - 2415

EP - 2426

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

SN - 1520-9210

IS - 9

M1 - 8291744

ER -