Tracking Persons-of-Interest via Unsupervised Representation Adaptation

Shun Zhang, Jia Bin Huang, Jongwoo Lim, Yihong Gong, Jinjun Wang, Narendra Ahuja, Ming Hsuan Yang

Research output: Contribution to journalArticle

Abstract

Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often can appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we automatically generate a large number of training samples using the contextual constraints for a given video, and further adapt the pre-trained face CNN to the characters in the specific videos using discovered training samples. The embedding feature space is fine-tuned so that the Euclidean distance in the space corresponds to the semantic face similarity. To this end, we devise a symmetric triplet loss function which optimizes the network more effectively than the conventional triplet loss. With the learned discriminative features, we apply an EM clustering algorithm to link tracklets across multiple shots to generate the final trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques.

Original languageEnglish
JournalInternational Journal of Computer Vision
DOIs
Publication statusAccepted/In press - 2019 Jan 1

Fingerprint

Neural networks
Target tracking
Clustering algorithms
Lighting
Semantics
Trajectories

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence

Cite this

Zhang, Shun ; Huang, Jia Bin ; Lim, Jongwoo ; Gong, Yihong ; Wang, Jinjun ; Ahuja, Narendra ; Yang, Ming Hsuan. / Tracking Persons-of-Interest via Unsupervised Representation Adaptation. In: International Journal of Computer Vision. 2019.
@article{89672cde82574eec9f4873514b28516f,
title = "Tracking Persons-of-Interest via Unsupervised Representation Adaptation",
abstract = "Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often can appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we automatically generate a large number of training samples using the contextual constraints for a given video, and further adapt the pre-trained face CNN to the characters in the specific videos using discovered training samples. The embedding feature space is fine-tuned so that the Euclidean distance in the space corresponds to the semantic face similarity. To this end, we devise a symmetric triplet loss function which optimizes the network more effectively than the conventional triplet loss. With the learned discriminative features, we apply an EM clustering algorithm to link tracklets across multiple shots to generate the final trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques.",
author = "Shun Zhang and Huang, {Jia Bin} and Jongwoo Lim and Yihong Gong and Jinjun Wang and Narendra Ahuja and Yang, {Ming Hsuan}",
year = "2019",
month = "1",
day = "1",
doi = "10.1007/s11263-019-01212-1",
language = "English",
journal = "International Journal of Computer Vision",
issn = "0920-5691",
publisher = "Springer Netherlands",

}

Tracking Persons-of-Interest via Unsupervised Representation Adaptation. / Zhang, Shun; Huang, Jia Bin; Lim, Jongwoo; Gong, Yihong; Wang, Jinjun; Ahuja, Narendra; Yang, Ming Hsuan.

In: International Journal of Computer Vision, 01.01.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Tracking Persons-of-Interest via Unsupervised Representation Adaptation

AU - Zhang, Shun

AU - Huang, Jia Bin

AU - Lim, Jongwoo

AU - Gong, Yihong

AU - Wang, Jinjun

AU - Ahuja, Narendra

AU - Yang, Ming Hsuan

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often can appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we automatically generate a large number of training samples using the contextual constraints for a given video, and further adapt the pre-trained face CNN to the characters in the specific videos using discovered training samples. The embedding feature space is fine-tuned so that the Euclidean distance in the space corresponds to the semantic face similarity. To this end, we devise a symmetric triplet loss function which optimizes the network more effectively than the conventional triplet loss. With the learned discriminative features, we apply an EM clustering algorithm to link tracklets across multiple shots to generate the final trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques.

AB - Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often can appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we automatically generate a large number of training samples using the contextual constraints for a given video, and further adapt the pre-trained face CNN to the characters in the specific videos using discovered training samples. The embedding feature space is fine-tuned so that the Euclidean distance in the space corresponds to the semantic face similarity. To this end, we devise a symmetric triplet loss function which optimizes the network more effectively than the conventional triplet loss. With the learned discriminative features, we apply an EM clustering algorithm to link tracklets across multiple shots to generate the final trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques.

UR - http://www.scopus.com/inward/record.url?scp=85072210398&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072210398&partnerID=8YFLogxK

U2 - 10.1007/s11263-019-01212-1

DO - 10.1007/s11263-019-01212-1

M3 - Article

AN - SCOPUS:85072210398

JO - International Journal of Computer Vision

JF - International Journal of Computer Vision

SN - 0920-5691

ER -