Putting humans in a scene: Learning affordance in 3D indoor environments

Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming Hsuan Yang, Jan Kautz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

43 Citations (Scopus)


Affordance modeling plays an important role in visual understanding. In this paper, we aim to predict affordances of 3D indoor scenes, specifically what human poses are afforded by a given indoor environment, such as sitting on a chair or standing on the floor. In order to predict valid affordances and learn possible 3D human poses in indoor scenes, we need to understand the semantic and geometric structure of a scene as well as its potential interactions with a human. To learn such a model, a large-scale dataset of 3D indoor affordances is required. In this work, we build a fully automatic 3D pose synthesizer that fuses semantic knowledge from a large number of 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes. With the data created by the synthesizer, we introduce a 3D pose generative model to predict semantically plausible and physically feasible human poses within a given scene (provided as a single RGB, RGB-D, or depth image). We demonstrate that our human affordance prediction method consistently outperforms existing state-of-the-art methods.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
PublisherIEEE Computer Society
Number of pages9
ISBN (Electronic)9781728132938
Publication statusPublished - 2019 Jun
Event32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 - Long Beach, United States
Duration: 2019 Jun 162019 Jun 20

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919


Conference32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
Country/TerritoryUnited States
CityLong Beach

Bibliographical note

Funding Information:
In this work, we propose to predict where and what human poses can be put in 3D scenes using a two stage pipeline. We develop a 3D pose synthesizer that can produce millions of ground truth poses in 3D scenes automatically by fusing semantic and geometric knowledge from the Sitcom dataset [27] and a 3D scene dataset [26, 30]. Then we learn an end-to-end generative model that predicts both locations and gestures of human poses that are semantically plausible and geometrically feasible. Experimental results demonstrate the effectiveness of our proposed method against the stage-of-the-art human affordance prediction method. Acknowledgement. We thank Soumyadip Sengupta and Jinwei Gu for providing the SUNCG-PBR dataset. This work is supported in part by the NSF CAREER Grant #1149783.

Publisher Copyright:
© 2019 IEEE.

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition


Dive into the research topics of 'Putting humans in a scene: Learning affordance in 3D indoor environments'. Together they form a unique fingerprint.

Cite this