High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Authors:	Arslan Munir Farinaz Koushanfar Ann Gordon-Ross Sanjay Ranka

Affiliation:	1. Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA 2. Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA 3. NSF Center for High-Performance Reconfigurable Computing (CHREC), University of Florida, Gainesville, FL, USA 4. Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA

Abstract:	Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera’s TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs.

Keywords:
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏