In this work we present SUGO, a depth video-based system for translating sign language to text using a smartphone's front camera. While exploiting depth-only videos offer benefits such as being less privacy-invasive compared to using RGB videos, it introduces new challenges which include dealing with low video resolutions and the sensors' sensitiveness towards user motion. We overcome these challenges by diversifying our sign language video dataset to be robust to various usage scenarios via data augmentation and design a set of schemes to emphasize human gestures from the input images for effective sign detection. The inference engine of SUGO is based on a 3-dimensional convolutional neural network (3DCNN) to classify a sequence of video frames as a pre-trained word. Furthermore, the overall operations are designed to be light-weight so that sign language translation takes place in real-time using only the resources available on a smartphone, with no help from cloud servers nor external sensing components. Specifically, to train and test SUGO, we collect sign language data from 20 individuals for 50 Korean Sign Language words, summing up to a dataset of ∼5,000 sign gestures and collect additional in-the-wild data to evaluate the performance of SUGO in real-world usage scenarios with different lighting conditions and daily activities. Comprehensively, our extensive evaluations show that SUGO can properly classify sign words with an accuracy of up to 91% and also suggest that the system is suitable (in terms of resource usage, latency, and environmental robustness) to enable a fully mobile solution for sign language translation.
|Journal||Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies|
|Publication status||Published - 2021 Jun|
Bibliographical noteFunding Information:
The authors would like to thank all the study participants and the members of the Gyeonggi-Do Sign Language Education Institute. The authors would especially like to express gratitude to Mr. Gwangchul Park for his help in offering detailed application needs on a sign language user’s perspective, and for being a wonderful sign language teacher to the research team. This work was financially supported by the National Research Foundation of Korea (NRF) Grant funded by the Ministry of Science and ICT (No. 2021R1A2C4002380), and by the Yonsei University Research Fund (Grant No. 2020-22-0513).
© 2021 ACM.
All Science Journal Classification (ASJC) codes
- Human-Computer Interaction
- Hardware and Architecture
- Computer Networks and Communications