Shapeoblivious Parallel Compiler Optimization for Matrix Computations
ZOU Yan-yan1,AN Hong1,CUI Hui-min2,ZHOU Jun-rui1
1(School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China)2(Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China)
Abstract:Matrix computations play an important role in scientific computing. Traditional compiler optimizations can greatly improve the performance of the general matrix multiplication, however, for the special matrix multiplication (such as triangle matrix, banded matrix) the performance keeps still very poor even with deep compiler optimizations, i.e., only 1% of the domain experts′ handtuned performance. In this paper, we present a patternbased compiler optimization methodology, which regards the matrix multiply as a pattern and defines a specialized optimization strategy for the pattern, which works both for general and special matrix multiplication. The key step of the optimization strategy is data layout reorganization, coupled with loop optimizations, i.e., loop tiling, etc. Data layout optimization reorganizes the matrix data according to the memory access order to improve data locality. Experimental results show that our Patternbased Compiler Optimization achieves nearpeak performance for both general and special matrix multiplication, with 34% and 43X speedup over Intel′s compiler (icc), and our approach exhibits good scalability.