As the rate of single-thread CPU performance improvement per generation has diminished due to lower transistor-speed scaling and energy related issues, researchers and industry have shifted their interest towards multi-core and many-core architectures for improving performance. Comparisons between optimized applications for parallel architectures have been quantified many times in the literature, but contradictory results have been reported mainly due to biased methods of evaluating and comparing these architectures. In this paper, we present memory-oblivious optimizations of the widely used Discrete Wavelet Transform (DWT), and provide detailed comparisons of the algorithm on Intel and AMD multi-core CPUs, Nvidia many-core GPUs, as well as the Intel's Xeon Phi many-core coprocessor. Our results indicate that, compared to their respective non-optimized single thread implementations, memory-oblivious optimization delivers up to 17.9×-197.2× performance improvement for the various architectures examined. Furthermore, compared to the state-of-the-art, the presented CPU and GPU memory-oblivious implementations are 2.6× and 1.3× faster respectively than the fastest implementations of DWT currently available in the literature. No comparison to the state-of-the-art can be made for the Xeon Phi, as, to the best of our knowledge, this is the first study that optimizes the DWT for this newfangled architecture.