TY - GEN
T1 - Efficient parallelization of the Discrete Wavelet Transform algorithm using memory-oblivious optimizations
AU - Keliris, Anastasis
AU - Dimitsas, Vasilis
AU - Kremmyda, Olympia
AU - Gizopoulos, Dimitris
AU - Maniatakos, Michail
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/12/4
Y1 - 2015/12/4
N2 - As the rate of single-thread CPU performance improvement per generation has diminished due to lower transistor-speed scaling and energy related issues, researchers and industry have shifted their interest towards multi-core and many-core architectures for improving performance. Comparisons between optimized applications for parallel architectures have been quantified many times in the literature, but contradictory results have been reported mainly due to biased methods of evaluating and comparing these architectures. In this paper, we present memory-oblivious optimizations of the widely used Discrete Wavelet Transform (DWT), and provide detailed comparisons of the algorithm on Intel and AMD multi-core CPUs, Nvidia many-core GPUs, as well as the Intel's Xeon Phi many-core coprocessor. Our results indicate that, compared to their respective non-optimized single thread implementations, memory-oblivious optimization delivers up to 17.9×-197.2× performance improvement for the various architectures examined. Furthermore, compared to the state-of-the-art, the presented CPU and GPU memory-oblivious implementations are 2.6× and 1.3× faster respectively than the fastest implementations of DWT currently available in the literature. No comparison to the state-of-the-art can be made for the Xeon Phi, as, to the best of our knowledge, this is the first study that optimizes the DWT for this newfangled architecture.
AB - As the rate of single-thread CPU performance improvement per generation has diminished due to lower transistor-speed scaling and energy related issues, researchers and industry have shifted their interest towards multi-core and many-core architectures for improving performance. Comparisons between optimized applications for parallel architectures have been quantified many times in the literature, but contradictory results have been reported mainly due to biased methods of evaluating and comparing these architectures. In this paper, we present memory-oblivious optimizations of the widely used Discrete Wavelet Transform (DWT), and provide detailed comparisons of the algorithm on Intel and AMD multi-core CPUs, Nvidia many-core GPUs, as well as the Intel's Xeon Phi many-core coprocessor. Our results indicate that, compared to their respective non-optimized single thread implementations, memory-oblivious optimization delivers up to 17.9×-197.2× performance improvement for the various architectures examined. Furthermore, compared to the state-of-the-art, the presented CPU and GPU memory-oblivious implementations are 2.6× and 1.3× faster respectively than the fastest implementations of DWT currently available in the literature. No comparison to the state-of-the-art can be made for the Xeon Phi, as, to the best of our knowledge, this is the first study that optimizes the DWT for this newfangled architecture.
UR - http://www.scopus.com/inward/record.url?scp=84959346263&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959346263&partnerID=8YFLogxK
U2 - 10.1109/PATMOS.2015.7347583
DO - 10.1109/PATMOS.2015.7347583
M3 - Conference contribution
AN - SCOPUS:84959346263
T3 - Proceedings - 2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS 2015
SP - 25
EP - 32
BT - Proceedings - 2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th International Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS 2015
Y2 - 1 September 2015 through 4 September 2015
ER -