首页 | 本学科首页   官方微博 | 高级检索  
     


Accelerating engineering software on modern multi-core processors
Affiliation:1. Department of Multimedia and M-Commerce, Kainan University, Taiwan;2. Department of Information Communication, and Innovation Center for Big Data and Digital Convergence, Yuan Ze University, 135 Yuan-Tung Rd., Chung-Li 32003, Taiwan;1. Thrombosis & Atherosclerosis Research Institute (TAARI), McMaster University, Hamilton, Ontario, Canada;2. Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, Canada;3. Division of Haematology-Oncology, Department of Pediatrics, McMaster University, Hamilton, Ontario, Canada;4. Department of Pediatrics, The Hong Kong University Shenzhen Hospital, China;5. Division of Haematology & Thromboembolism, Department of Medicine, McMaster University, Hamilton, Ontario, Canada;1. School of Aerospace Engineering, Xiamen University, Xiamen 361005, PR China;2. Department of Materials Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA;3. College of Materials, and Research Centre of Materials Design and Applications, Xiamen University, Xiamen 361005, PR China;4. Fujian Key Laboratory of Materials Genome, Xiamen 361005, PR China
Abstract:Recent multi-core designs migrated from Symmetric Multi Processing to cache coherent Non Uniform Memory Access architectures. In this paper we discuss performance issues that arise when designing parallel Finite Element programs for a 64-core ccNUMA computer and explore solutions for these issues. We first present the overview of the computer architecture and show that highly parallel code that does not take into account the aspects of the system memory organization scales poorly, achieving only 2.8× speedup when running with 64 threads. Then, we discuss how we identified the sources of overhead and evaluate three possible solutions for the problem. We show that the first solution does not require the application’s code to be modified, however, the speedup achieved is only 10.6×. The second solution enables the performance to scale up to 30.9×, however, it requires the programmer to manually schedule threads and allocate related data on local CPUs and memory banks and rely on ccNUMA aware libraries that are not portable across operating systems. Also, we propose and evaluate “copy-on-thread”, an alternative solution that enables the performance to scale up to 25.5× without relying on specialized libraries nor requiring specific data allocation and thread scheduling. Finally, we argue that the issues reported only happen for large data sets and conclude the paper with recommendations to help programmers to design algorithms and programs that perform well on such kind of machine.
Keywords:Parallel programming  Parallel processing  Cache-coherent Non Uniform Memory Access  Finite Element Methods  Multi-core processors  Shared memory
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号