TY - GEN
T1 - A massively parallel adaptive fast-multipole method on heterogeneous architectures
AU - Lashuk, Ilya
AU - Chandramowlishwaran, Aparna
AU - Langston, Harper
AU - Nguyen, Tuan Anh
AU - Sampath, Rahul
AU - Shringarpure, Aashay
AU - Vuduc, Richard
AU - Ying, Lexing
AU - Zorin, Denis
AU - Biros, George
PY - 2009
Y1 - 2009
N2 - We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.
AB - We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.
UR - http://www.scopus.com/inward/record.url?scp=74049157044&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=74049157044&partnerID=8YFLogxK
U2 - 10.1145/1654059.1654118
DO - 10.1145/1654059.1654118
M3 - Conference contribution
AN - SCOPUS:74049157044
SN - 9781605587448
T3 - Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09
BT - Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09
T2 - Conference on High Performance Computing Networking, Storage and Analysis, SC '09
Y2 - 14 November 2009 through 20 November 2009
ER -