Estimating depth from a single monocular image is a fundamental problem in computer vision. Traditional methods for such estimation usually require complicated and sometimes labor-intensive processing. In this paper, we propose a new perspective for this problem and suggest a new gradient-domain learning framework which is much simpler and more efficient. Inspired by the observation that there is substantial co-occurrence of image edges and depth discontinuities in natural scenes, we learn the relationship between local appearance features and corresponding depth gradients by making use of the K-means clustering algorithm within the image feature space. We then encode each cluster centroid with its associated depth gradients, which defines visual-depth words that model the image-depth relationship very well. This enables one to estimate the scene depth for an arbitrary image by simply selecting proper depth gradients from a compact dictionary of visual-depth words, followed by a Poisson surface reconstruction. Experimental results demonstrate that the proposed gradient-domain approach outperforms state-of-the-art methods both qualitatively and quantitatively and is generic over (unseen) scene categories which are not used for training.