Sign language is used as a visual form of communication among the deaf and is considered as an official language in many countries. While there has been many efforts to achieve efficient translation between sign and verbal languages, many of these previous work can be applied in non-mobile context or exploit RGB images which can potentially invade the users' privacy. This work presents our preliminary efforts in designing a mobile device-based sign language translation system using depth-only images. Our system performs image processing on the smartphone-collected depth images to emphasize the subject's hand and upper body gestures and exploits a convolutional neural network for feature extraction. The series of features gathered from word-representing videos are passed through a Long-Short Term Memory (LSTM) model for word-level sign language translation. We train and test our system using a total of 2,200 samples collected from 26 people for 17 words. The classification accuracy of our proposed system using the self-collected data achieves 92% with an efficient image preprocessing phase.