Robust visual tracking via convolutional networks without training

Kaihua Zhang, Qingshan Liu, Yi Wu, Ming Hsuan Yang

Research output: Contribution to journalArticle

231 Citations (Scopus)

Abstract

Deep networks have been successfully applied to visual tracking by learning a generic representation offline from numerous training images. However, the offline training is time-consuming and the learned generic representation may be less discriminative for tracking specific objects. In this paper, we present that, even without offline training with a large amount of auxiliary data, simple two-layer convolutional networks can be powerful enough to learn robust representations for visual tracking. In the first frame, we extract a set of normalized patches from the target region as fixed filters, which integrate a series of adaptive contextual filters surrounding the target to define a set of feature maps in the subsequent frames. These maps measure similarities between each filter and useful local intensity patterns across the target, thereby encoding its local structural information. Furthermore, all the maps together form a global representation, via which the inner geometric layout of the target is also preserved. A simple soft shrinkage method that suppresses noisy values below an adaptive threshold is employed to de-noise the global representation. Our convolutional networks have a lightweight structure and perform favorably against several state-of-the-art methods on the recent tracking benchmark data set with 50 challenging videos.

Original languageEnglish
Article number7410052
Pages (from-to)1779-1792
Number of pages14
JournalIEEE Transactions on Image Processing
Volume25
Issue number4
DOIs
Publication statusPublished - 2016 Apr

Fingerprint

Adaptive filters

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Graphics and Computer-Aided Design

Cite this

Zhang, Kaihua ; Liu, Qingshan ; Wu, Yi ; Yang, Ming Hsuan. / Robust visual tracking via convolutional networks without training. In: IEEE Transactions on Image Processing. 2016 ; Vol. 25, No. 4. pp. 1779-1792.
@article{0a7d436a986c49d08905e39a1f8cfe0d,
title = "Robust visual tracking via convolutional networks without training",
abstract = "Deep networks have been successfully applied to visual tracking by learning a generic representation offline from numerous training images. However, the offline training is time-consuming and the learned generic representation may be less discriminative for tracking specific objects. In this paper, we present that, even without offline training with a large amount of auxiliary data, simple two-layer convolutional networks can be powerful enough to learn robust representations for visual tracking. In the first frame, we extract a set of normalized patches from the target region as fixed filters, which integrate a series of adaptive contextual filters surrounding the target to define a set of feature maps in the subsequent frames. These maps measure similarities between each filter and useful local intensity patterns across the target, thereby encoding its local structural information. Furthermore, all the maps together form a global representation, via which the inner geometric layout of the target is also preserved. A simple soft shrinkage method that suppresses noisy values below an adaptive threshold is employed to de-noise the global representation. Our convolutional networks have a lightweight structure and perform favorably against several state-of-the-art methods on the recent tracking benchmark data set with 50 challenging videos.",
author = "Kaihua Zhang and Qingshan Liu and Yi Wu and Yang, {Ming Hsuan}",
year = "2016",
month = "4",
doi = "10.1109/TIP.2016.2531283",
language = "English",
volume = "25",
pages = "1779--1792",
journal = "IEEE Transactions on Image Processing",
issn = "1057-7149",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "4",

}

Robust visual tracking via convolutional networks without training. / Zhang, Kaihua; Liu, Qingshan; Wu, Yi; Yang, Ming Hsuan.

In: IEEE Transactions on Image Processing, Vol. 25, No. 4, 7410052, 04.2016, p. 1779-1792.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Robust visual tracking via convolutional networks without training

AU - Zhang, Kaihua

AU - Liu, Qingshan

AU - Wu, Yi

AU - Yang, Ming Hsuan

PY - 2016/4

Y1 - 2016/4

N2 - Deep networks have been successfully applied to visual tracking by learning a generic representation offline from numerous training images. However, the offline training is time-consuming and the learned generic representation may be less discriminative for tracking specific objects. In this paper, we present that, even without offline training with a large amount of auxiliary data, simple two-layer convolutional networks can be powerful enough to learn robust representations for visual tracking. In the first frame, we extract a set of normalized patches from the target region as fixed filters, which integrate a series of adaptive contextual filters surrounding the target to define a set of feature maps in the subsequent frames. These maps measure similarities between each filter and useful local intensity patterns across the target, thereby encoding its local structural information. Furthermore, all the maps together form a global representation, via which the inner geometric layout of the target is also preserved. A simple soft shrinkage method that suppresses noisy values below an adaptive threshold is employed to de-noise the global representation. Our convolutional networks have a lightweight structure and perform favorably against several state-of-the-art methods on the recent tracking benchmark data set with 50 challenging videos.

AB - Deep networks have been successfully applied to visual tracking by learning a generic representation offline from numerous training images. However, the offline training is time-consuming and the learned generic representation may be less discriminative for tracking specific objects. In this paper, we present that, even without offline training with a large amount of auxiliary data, simple two-layer convolutional networks can be powerful enough to learn robust representations for visual tracking. In the first frame, we extract a set of normalized patches from the target region as fixed filters, which integrate a series of adaptive contextual filters surrounding the target to define a set of feature maps in the subsequent frames. These maps measure similarities between each filter and useful local intensity patterns across the target, thereby encoding its local structural information. Furthermore, all the maps together form a global representation, via which the inner geometric layout of the target is also preserved. A simple soft shrinkage method that suppresses noisy values below an adaptive threshold is employed to de-noise the global representation. Our convolutional networks have a lightweight structure and perform favorably against several state-of-the-art methods on the recent tracking benchmark data set with 50 challenging videos.

UR - http://www.scopus.com/inward/record.url?scp=84964645658&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84964645658&partnerID=8YFLogxK

U2 - 10.1109/TIP.2016.2531283

DO - 10.1109/TIP.2016.2531283

M3 - Article

AN - SCOPUS:84964645658

VL - 25

SP - 1779

EP - 1792

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

SN - 1057-7149

IS - 4

M1 - 7410052

ER -