We present a multi-scale deep convolutional neural network (CNN) for the task of automatic 2D-to-3D conversion. Traditional methods, which make a virtual view from a reference view, consist of separate stages i.e., depth (or disparity) estimation for the reference image and depth image-based rendering (DIBR) with estimated depth. In contrast, we reformulate the view synthesis task as an image reconstruction problem with a spatial transformer module and directly make stereo image pairs with a unified CNN framework without ground-truth depth as a supervision. We further propose a multi-scale deep architecture to capture the large displacements between images from coarse-level and enhance the detail from fine-level. Experimental results demonstrate the effectiveness of the proposed method over state-of-the-art approaches both qualitatively and quantitatively on the KITTI driving dataset.