Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Authors:	Yi Wang Linfeng Pan Zili Shao Yong Guan Minyi Guo

Affiliation:	1. Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong 2. Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China 3. College of Information Engineering, Capital Normal University, Beijing, China

Abstract:	Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core.

Keywords:
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏