We exploit textual information for recognizing human-human interaction activities in YouTube videos. YouTube videos are generally accompanied by various types of textual information, such as title, description, and tags. In particular, since some of the tags describe the visual content of the video, making good use of tags can aid activity recognition in the video. The proposed method uses two-fold information for activity recognition: (i) visual information: correlations among activities, human poses, configurations of human body parts, and image features extracted from visual content and (ii) textual information: correlations with activities extracted from tags. For tag analysis we discover a set of relevant tags and extract the meaningful words. Correlations between words and activities are learned from expanded tags obtained from tags of related videos. We develop a model that jointly captures two-fold information for activity recognition. We consider the model as a structured learning task with latent variables, and estimate the parameters of the model by using a non-convex minimization procedure. The proposed approach is evaluated using a dataset that consists of highly challenging real world videos and their assigned tags collected from YouTube. Experimental results demonstrate that by exploiting the visual and textual information in a structured framework, the proposed method can significantly improve the activity recognition results.
Bibliographical noteFunding Information:
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MEST) ( 2012R1A2A2A01046155 ).
All Science Journal Classification (ASJC) codes
- Signal Processing
- Computer Vision and Pattern Recognition
- Artificial Intelligence