TY - GEN
T1 - A framework for low communication approaches for large scale 3D convolution
AU - Kulkarni, Anuva
AU - Kovacevic, Jelena
AU - Franchetti, Franz
N1 - Publisher Copyright:
© 2022 Owner/Author.
PY - 2022/8/29
Y1 - 2022/8/29
N2 - Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-To-All communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-To-All communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-To-All exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke's law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-Agnostic specification and optimization for algorithmic approaches similar to ours.
AB - Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-To-All communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-To-All communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-To-All exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke's law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-Agnostic specification and optimization for algorithmic approaches similar to ours.
KW - Fast Fourier Transform
KW - GPU
KW - Green's functions
KW - Scalable Convolutions
KW - scientific simulations
UR - http://www.scopus.com/inward/record.url?scp=85147433837&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147433837&partnerID=8YFLogxK
U2 - 10.1145/3547276.3548626
DO - 10.1145/3547276.3548626
M3 - Conference contribution
AN - SCOPUS:85147433837
T3 - ACM International Conference Proceeding Series
BT - 51st International Conference on Parallel Processing, ICPP 2022 - Workshop Proceedings
PB - Association for Computing Machinery
T2 - 51st International Conference on Parallel Processing, ICPP 2022
Y2 - 29 August 2022 through 1 September 2022
ER -