The most efficient distributed matrix multiplication at large scale is the SUMMA algorithm. The nature of SUMMA requires that the matrices be memory-resident, otherwise the the bottleneck in the operation is memory reading and writing, rather than approaching the machine’s FLOPS rate. And since matrix multiplication of matrices with dimension N is O(N^3), large matrix mutliplication is already slow enough that it is much cheaper to buy enough memory than to wait for the computation to finish. Extrapolate how long it would take to multiply when N=40,000 and up (easy to fit in memory) and you will understand what I mean.
If you have an example of a matrix too large for your memory, but you have sufficient FLOPS to compute the result in less than an hour or so, please provide the example, I would be glad to consider it.