首页 | 本学科首页   官方微博 | 高级检索  
     


LAPT: A locality-aware page table for thread and data mapping
Affiliation:1. Informatics Institute, Federal University of Rio Grande do Sul, Porto Alegre, Brazil;2. Department of Informatics and Statistics, Federal University of Santa Catarina, Florianópolis, Brazil;1. Carnegie Mellon University, United States;2. University of Michigan, United States;3. Massachusetts Institute of Technology, United States;4. Advanced Micro Devices, United States;1. Ecole Normale Supérieure de Lyon, CNRS & INRIA, Lyon, France;2. University of Tennessee, Knoxville, TN 37996, USA;1. Computer Science Institute, University of Campinas — UNICAMP — Brazil;2. Computer Science Departament, University of Alberta, Canada;1. Departamento de Ingeniería y Ciencia de los Computadores, Universitat Jaume I, Castellón, Spain;2. Barcelona Supercomputing Center (BSC-CNS) and Artificial Intelligence Research Institute (IIIA), Spanish National Research Council (CSIC), Barcelona, Spain;3. Institute of Computational Mathematics, TU Braunschweig, Braunschweig, Germany;4. Instituto de Computación, Universidad de la República, Montevideo, Uruguay
Abstract:The performance and energy efficiency of current systems is influenced by accesses to the memory hierarchy. One important aspect of memory hierarchies is the introduction of different memory access times, depending on the core that requested the transaction, and which cache or main memory bank responded to it. In this context, the locality of the memory accesses plays a key role for the performance and energy efficiency of parallel applications. Accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. With information about the memory access pattern, pages can be migrated to the NUMA nodes that access them (data mapping), and threads that communicate can be migrated to the same node (thread mapping).In this paper, we present LAPT, a hardware-based mechanism to store the memory access pattern of parallel applications in the page table. The operating system uses the detected memory access pattern to perform an optimized thread and data mapping during the execution of the parallel application. Experiments with a wide range of parallel applications (from the NAS and PARSEC Benchmark Suites) on a NUMA machine showed significant performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average).
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号