Recognizing human-human interaction activities using visual and textual information

Sunyoung Cho, Sooyeong Kwak, Hyeran Byun

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

We exploit textual information for recognizing human-human interaction activities in YouTube videos. YouTube videos are generally accompanied by various types of textual information, such as title, description, and tags. In particular, since some of the tags describe the visual content of the video, making good use of tags can aid activity recognition in the video. The proposed method uses two-fold information for activity recognition: (i) visual information: correlations among activities, human poses, configurations of human body parts, and image features extracted from visual content and (ii) textual information: correlations with activities extracted from tags. For tag analysis we discover a set of relevant tags and extract the meaningful words. Correlations between words and activities are learned from expanded tags obtained from tags of related videos. We develop a model that jointly captures two-fold information for activity recognition. We consider the model as a structured learning task with latent variables, and estimate the parameters of the model by using a non-convex minimization procedure. The proposed approach is evaluated using a dataset that consists of highly challenging real world videos and their assigned tags collected from YouTube. Experimental results demonstrate that by exploiting the visual and textual information in a structured framework, the proposed method can significantly improve the activity recognition results.

Original languageEnglish
Pages (from-to)1840-1848
Number of pages9
JournalPattern Recognition Letters
Volume34
Issue number15
DOIs
Publication statusPublished - 2013 Jan 1

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence

Cite this

@article{4e7e7f3f41cc48559f9279edc7e37ea9,
title = "Recognizing human-human interaction activities using visual and textual information",
abstract = "We exploit textual information for recognizing human-human interaction activities in YouTube videos. YouTube videos are generally accompanied by various types of textual information, such as title, description, and tags. In particular, since some of the tags describe the visual content of the video, making good use of tags can aid activity recognition in the video. The proposed method uses two-fold information for activity recognition: (i) visual information: correlations among activities, human poses, configurations of human body parts, and image features extracted from visual content and (ii) textual information: correlations with activities extracted from tags. For tag analysis we discover a set of relevant tags and extract the meaningful words. Correlations between words and activities are learned from expanded tags obtained from tags of related videos. We develop a model that jointly captures two-fold information for activity recognition. We consider the model as a structured learning task with latent variables, and estimate the parameters of the model by using a non-convex minimization procedure. The proposed approach is evaluated using a dataset that consists of highly challenging real world videos and their assigned tags collected from YouTube. Experimental results demonstrate that by exploiting the visual and textual information in a structured framework, the proposed method can significantly improve the activity recognition results.",
author = "Sunyoung Cho and Sooyeong Kwak and Hyeran Byun",
year = "2013",
month = "1",
day = "1",
doi = "10.1016/j.patrec.2012.10.022",
language = "English",
volume = "34",
pages = "1840--1848",
journal = "Pattern Recognition Letters",
issn = "0167-8655",
publisher = "Elsevier",
number = "15",

}

Recognizing human-human interaction activities using visual and textual information. / Cho, Sunyoung; Kwak, Sooyeong; Byun, Hyeran.

In: Pattern Recognition Letters, Vol. 34, No. 15, 01.01.2013, p. 1840-1848.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Recognizing human-human interaction activities using visual and textual information

AU - Cho, Sunyoung

AU - Kwak, Sooyeong

AU - Byun, Hyeran

PY - 2013/1/1

Y1 - 2013/1/1

N2 - We exploit textual information for recognizing human-human interaction activities in YouTube videos. YouTube videos are generally accompanied by various types of textual information, such as title, description, and tags. In particular, since some of the tags describe the visual content of the video, making good use of tags can aid activity recognition in the video. The proposed method uses two-fold information for activity recognition: (i) visual information: correlations among activities, human poses, configurations of human body parts, and image features extracted from visual content and (ii) textual information: correlations with activities extracted from tags. For tag analysis we discover a set of relevant tags and extract the meaningful words. Correlations between words and activities are learned from expanded tags obtained from tags of related videos. We develop a model that jointly captures two-fold information for activity recognition. We consider the model as a structured learning task with latent variables, and estimate the parameters of the model by using a non-convex minimization procedure. The proposed approach is evaluated using a dataset that consists of highly challenging real world videos and their assigned tags collected from YouTube. Experimental results demonstrate that by exploiting the visual and textual information in a structured framework, the proposed method can significantly improve the activity recognition results.

AB - We exploit textual information for recognizing human-human interaction activities in YouTube videos. YouTube videos are generally accompanied by various types of textual information, such as title, description, and tags. In particular, since some of the tags describe the visual content of the video, making good use of tags can aid activity recognition in the video. The proposed method uses two-fold information for activity recognition: (i) visual information: correlations among activities, human poses, configurations of human body parts, and image features extracted from visual content and (ii) textual information: correlations with activities extracted from tags. For tag analysis we discover a set of relevant tags and extract the meaningful words. Correlations between words and activities are learned from expanded tags obtained from tags of related videos. We develop a model that jointly captures two-fold information for activity recognition. We consider the model as a structured learning task with latent variables, and estimate the parameters of the model by using a non-convex minimization procedure. The proposed approach is evaluated using a dataset that consists of highly challenging real world videos and their assigned tags collected from YouTube. Experimental results demonstrate that by exploiting the visual and textual information in a structured framework, the proposed method can significantly improve the activity recognition results.

UR - http://www.scopus.com/inward/record.url?scp=84885035732&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84885035732&partnerID=8YFLogxK

U2 - 10.1016/j.patrec.2012.10.022

DO - 10.1016/j.patrec.2012.10.022

M3 - Article

AN - SCOPUS:84885035732

VL - 34

SP - 1840

EP - 1848

JO - Pattern Recognition Letters

JF - Pattern Recognition Letters

SN - 0167-8655

IS - 15

ER -