This paper presents a novel convolutional neural network (CNN) based image compression framework via scalable auto-encoder (SAE). Specifically, our SAE based deep image codec consists of hierarchical coding layers, each of which is an end-to-end optimized auto-encoder. The coarse image content and texture are encoded through the first (base) layer while the consecutive (enhance) layers iteratively code the pixel-level reconstruction errors between the original and former reconstructed images. The proposed SAE structure alleviates the need to train multiple models for different bit-rate points by recently proposed auto-encoder based codecs. The SAE layers can be combined to realize multiple rate points, or to produce a scalable stream. The proposed method has similar rate-distortion performance in the low-to-medium rate range as the state-of-the-art CNN based image codec (which uses different optimized networks to realize different bit rates) over a standard public image dataset. Furthermore, the proposed codec generates better perceptual quality in this bit rate range.