Affordance modeling plays an important role in visual understanding. In this paper, we aim to predict affordances of 3D indoor scenes, specifically what human poses are afforded by a given indoor environment, such as sitting on a chair or standing on the floor. In order to predict valid affordances and learn possible 3D human poses in indoor scenes, we need to understand the semantic and geometric structure of a scene as well as its potential interactions with a human. To learn such a model, a large-scale dataset of 3D indoor affordances is required. In this work, we build a fully automatic 3D pose synthesizer that fuses semantic knowledge from a large number of 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes. With the data created by the synthesizer, we introduce a 3D pose generative model to predict semantically plausible and physically feasible human poses within a given scene (provided as a single RGB, RGB-D, or depth image). We demonstrate that our human affordance prediction method consistently outperforms existing state-of-the-art methods.
|Title of host publication||Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019|
|Publisher||IEEE Computer Society|
|Number of pages||9|
|Publication status||Published - 2019 Jun|
|Event||32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 - Long Beach, United States|
Duration: 2019 Jun 16 → 2019 Jun 20
|Name||Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition|
|Conference||32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019|
|Period||19/6/16 → 19/6/20|
Bibliographical noteFunding Information:
In this work, we propose to predict where and what human poses can be put in 3D scenes using a two stage pipeline. We develop a 3D pose synthesizer that can produce millions of ground truth poses in 3D scenes automatically by fusing semantic and geometric knowledge from the Sitcom dataset  and a 3D scene dataset [26, 30]. Then we learn an end-to-end generative model that predicts both locations and gestures of human poses that are semantically plausible and geometrically feasible. Experimental results demonstrate the effectiveness of our proposed method against the stage-of-the-art human affordance prediction method. Acknowledgement. We thank Soumyadip Sengupta and Jinwei Gu for providing the SUNCG-PBR dataset. This work is supported in part by the NSF CAREER Grant #1149783.
All Science Journal Classification (ASJC) codes
- Computer Vision and Pattern Recognition