Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter re-sponses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be bro-ken down into two steps: (1) a coding step, which per-forms a pointwise transformation of the descriptors into a representation better adapted to the task, and (2) a pool-ing step, which summarizes the coded features over larger neighborhoods. Several combinations of coding and pool-ing schemes have been proposed in the literature. The goal of this paper is threefold. We seek to establish the rela-tive importance of each step of mid-level feature extrac-tion through a comprehensive cross evaluation of several types of coding modules (hard and soft vector quantization, sparse coding) and pooling schemes (by taking the aver-age, or the maximum), which obtains state-of-the-art per-formance or better on several recognition benchmarks. We show how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding. We provide theoretical and empirical insight into the remarkable performance of max pooling. By teasing apart components shared by modern mid-level feature ex-tractors, our approach aims to facilitate the design of better recognition architectures.