CommunicationAvoiding Parallel SparseDense MatrixMatrix Multiplication
Penporn Koanantakool,
Ariful Azad,
Aydin Buluc,
Dmitriy Morozov,
SangYun Oh,
Leonid Oliker,
Katherine Yelick.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 842853, 2016. 

IPDPS

Abstract
Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel matrix multiplication considered only both dense or both sparse matrix operands. This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparsedense matrixmatrix multiplication. We also present new communicationavoiding algorithms based on a 1D decomposition, called 1.5D, which — while suboptimal in densedense and sparsesparse cases — outperform the 2D and 3D variants both theoretically and in practice for sparsedense multiplication. Our analysis separates onetime costs from per iteration costs in an iterative machine learning context. Experiments demonstrate speedups up to 100x over a baseline 3D SUMMA implementation and show parallel scaling over 10 thousand cores.