期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Communication‐aware thread mapping using the translation lookaside buffer

Eduardo H. M. Cruz Matthias Diener Philippe O. A. Navaux 《Concurrency and Computation》2015,27(17):4970-4992

Threads of parallel applications need to communicate in order to fulfill their tasks. The communication performance between the cores in modern multi‐core architectures differs because of the memory and interconnection hierarchies. In these architectures, it is important to map the threads of parallel applications by taking into account the communication between them, to improve their performance and energy consumption. In parallel applications based on shared memory, communication is implicit, which makes it difficult to detect the communication pattern between the threads. In this paper, we introduce a new lightweight mechanism to detect the communication pattern between threads of shared memory applications using the translation lookaside buffer. Our mechanism relies on hardware features, which make it transparent to the programmer and allow the detection to be performed by the operating system during the execution of the application. We also developed a heuristic mapping algorithm that uses the detected pattern to dynamically map the threads to cores. Experiments were performed with applications from the NAS‐OMP and PARSEC parallel benchmark suites in a simulated machine as well as a real machine. Results show that our mechanism can substantially improve parallel application performance, as well as processor and DRAM energy consumption. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

2.

Adaptive thread mapping strategies for transactional memory applications

Márcio Castro Luís Fabrício W. Góes Jean-François Méhaut 《Journal of Parallel and Distributed Computing》2014

Transactional Memory (TM) is a programmer friendly alternative to traditional lock-based concurrency. Although it intends to simplify concurrent programming, the performance of the applications still relies on how frequent they synchronize and the way they access shared data. These aspects must be taken into consideration if one intends to exploit the full potential of modern multicore platforms. Since these platforms feature complex memory hierarchies composed of different levels of cache, applications may suffer from memory latencies and bandwidth problems if threads are not properly placed on cores. An interesting approach to efficiently exploit the memory hierarchy is called thread mapping. However, a single fixed thread mapping cannot deliver the best performance when dealing with a large range of transactional workloads, TM systems and platforms. In this article, we propose and implement in a TM system a set of adaptive thread mapping strategies for TM applications to tackle this problem. They range from simple strategies that do not require any prior knowledge to strategies based on Machine Learning techniques. Taking the Linux default strategy as baseline, we achieved performance improvements of up to 64.4% on a set of synthetic applications and an overall performance improvement of up to 16.5% on the standard STAMP benchmark suite. 相似文献

3.

LAPT: A locality-aware page table for thread and data mapping

《Parallel Computing》2016

The performance and energy efficiency of current systems is influenced by accesses to the memory hierarchy. One important aspect of memory hierarchies is the introduction of different memory access times, depending on the core that requested the transaction, and which cache or main memory bank responded to it. In this context, the locality of the memory accesses plays a key role for the performance and energy efficiency of parallel applications. Accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. With information about the memory access pattern, pages can be migrated to the NUMA nodes that access them (data mapping), and threads that communicate can be migrated to the same node (thread mapping).In this paper, we present LAPT, a hardware-based mechanism to store the memory access pattern of parallel applications in the page table. The operating system uses the detected memory access pattern to perform an optimized thread and data mapping during the execution of the parallel application. Experiments with a wide range of parallel applications (from the NAS and PARSEC Benchmark Suites) on a NUMA machine showed significant performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average). 相似文献

4.

Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Jorge González-Domínguez^{Author Vitae} Guillermo L. Taboada Author VitaeBasilio B. Fraguela Author Vitae María J. Martín Author VitaeJuan Touriño Author Vitae 《Computers & Electrical Engineering》2012,38(2):258-269

Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification. 相似文献

5.

GPU-Friendly Parallel Genome Matching with Tiled Access and Reduced State Transition Table

Yunho Oh Doohwan Oh Won W. Ro 《International journal of parallel programming》2013,41(4):526-551

In this paper, we propose a new parallel genome matching algorithm using graphics processing units (GPUs). Our proposed approach is based on the Aho–Corasick algorithm and it was developed based on a consideration of the architectural features of existing GPUs with a hundred or more cores. Thus, we provide an appropriate task partitioning method that runs on multiple threads and we fully utilize the cache memory and the shared memory structures available in GPUs. Especially, we propose a tiled access method for rapid data transfer from the global memory to the shared memory. We also provide new models for cache-friendly state transition table to improve performance of pattern matching operations on GPUs. The maximum throughput we achieved in various experiments was 15.3 Gbps. Moreover, we showed that our proposed design outperformed an earlier approach with a 15.4 % performance improvement. 相似文献

6.

面向多线程多道程序的加权共享Cache划分 总被引：5，自引：1，他引：4

所光杨学军《计算机学报》2008,31(11)

并行应用在共享Cache结构的多核处理器执行时,会因为对共享Cache的冲突访问而产生性能下降和执行时间不确定的现象.共享Cache划分技术可以把共享Cache互斥地分配给多个进程使用,是解决该问题的有效方法.由于线程间的数据共享,线程数目不同的应用对共享Cache的利用率不同,但传统的以失效率最低为目标的共享Cache划分算法(例如UCP)没有区分应用线程数目的不同.文中设计了一种面向多线程多道程序的加权共享Cache划分框架(Weighted Cache Partitioning,WCP),包括面向应用的失效率监控器和加权Cache划分算法.失效率监控器以进程为单位动态监控在不同的Cache容量下应用的失效率;而加权Cache划分算法扩展了传统的失效率最优的Cache划分算法,根据应用线程数目的不同在进行Cache划分时给应用赋予不同的权值,以使具有更多线程的应用获得更多的共享Cache,从而提高系统的整体性能.实验结果表明:加权Cache划分算法虽然失效率有所增高,但却改进了IPC吞吐率、加权加速比和公平性.在由科学和工程计算应用组成的多道程序测试用例中,WCP-1的IPC吞吐率比以失效率最低为目标函数的共享Cache划分算法最高高出10.8%,平均高出5.5%. 相似文献

7.

Virtual-Cache: A cache-line borrowing technique for efficient GPU cache architectures

《Microprocessors and Microsystems》2021

GPUs provide megabytes of registers and shared memories to maintain the contexts for thousands of threads and enable fast data sharing amongst threads of a thread block, respectively. Besides, GPUs employ L1 cache to provide the high bandwidth service for memory requests. However, the average L1 cache capacity per thread is very limited, resulting in cache thrashing which in turn impairs the performance. Meanwhile, many registers and shared memories are unassigned to any warps or thread blocks. Moreover, registers and shared memories that are assigned can be idle when warps or thread blocks are finished. Exploiting the above insights, we propose Virtual-Cache to cost-effectively increase the effective size of L1 cache by utilizing the unassigned and released registers and shared memories as cache-lines in this paper. Specifically, we leverage the unassigned registers and shared memories to serve cache requests directly. Regarding the registers assigned to a warp, they can work as cache-lines after the warp completes the execution and before they are accessed again by a new launched warp. Regarding the shared memories of a thread block, they are enabled to serve cache requests when the thread block is finished till they are referenced by shared memory instructions of the relaunched thread block. The register file, shared memory and L1 cache are physically independent but logically unified as a large virtual cache with redesigned cache-line management. We develop the control and data path for the register file, making the register file accessible for cache requests by borrowing an operand collector to serve the cache requests. We also expand the control and data path for the shared memory to serve the cache requests. Our evaluation results show that Virtual-Cache makes the performance improved by 28% over the previously proposed cache management technique for cache-sensitive applications. 相似文献

8.

Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

Luís Fabrício Wanderley Góes Christiane Pousa Ribeiro Márcio Castro Jean-François Méhaut Murray Cole Marcelo Cintra 《International journal of parallel programming》2014,42(2):365-382

Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines. 相似文献

9.

Amdahl’s law for multithreaded multicore processors

Hao Che Minh Nguyen 《Journal of Parallel and Distributed Computing》2014

In this paper, we conduct performance scaling analysis of multithreaded multicore processors (MMPs) for parallel computing. We propose a thread-level closed-queuing network model covering a fairly large design space, accounting for hardware scaling models, coarse-grain, fine-grain, and simultaneous multithreading (SMT) cores, shared resources, including cache, memory, and critical sections. We then derive a closed-form solution for this model in terms of speedup performance measure. This solution makes it possible to analyze performance scaling properties of MMPs along multiple dimensions. In particular, we show that for the parallelizable part of the workload, the speedup, in the absence of resource contention, is no longer just a linear function of parallel processing unit counts, as predicted by Amdahl’s law, but also a strong function of workload characteristics, ranging from strong memory-bound to strong CPU-bound workloads. We also find that with core multithreading, super linear speedup, higher than that predicted by Amdahl’s law, may be achieved for the parallelizable part of the workload, if core threads exhibit strong cache affinity and the workload is strongly memory-bound. Then, we derive a tight speedup upper bound in the presence of both memory resource contention and critical section for multicore processors with single-threaded cores. This speedup upper bound indicates that with resource contention among threads, whether it is due to shared memory or critical section, a sequential term is guaranteed to emerge from the parallelizable part of the workload, fundamentally limiting the scalability of multicore processors for parallel computing, in addition to the sequential part of the workload, as dictated by Amdahl’s law. As a result, to improve speedup performance for MMPs, one should strive to enhance memory parallelism and confine critical sections as locally as possible, e.g., to the smallest possible number of threads in the same core. 相似文献

10.

存储模型仿真器的设计与实现 总被引：2，自引：1，他引：1

吴俊敏杨超陈国良张淼辉门珂《计算机研究与发展》2005,42(3):394-403

存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS．基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能．相似文献

11.

存储器替换机制及其实现

A·Chaib 胡铭曾《计算机工程与设计》2000,21(6):24-27

提出一种解决PACT01一种结合动态可编程逻辑阵列(DPGA)的处理器的新型体系结制中cache的一致性与同步性问题的算法,并且解决多线程支持的快速上下文切换及快速用户级操作问题。存储器替换机制是解决cache的一致性问题及当cache未命中时从局部或远程存储器到cacbe存储器的数据替换问题的一种硬件实现方法,产生冲突的原因是由于多线程并行的写入／读取的位置相同和读或写的位置相同。文中选择的是相联映射策略,同时也选择了最少最近使用LRU算法,即在cache未命中时替换最少最近使用的参考块,为实现LRU算法设置了与每块相对应的计数器。相似文献

12.

Architectural support for thread communications in multi-core processors

Sevin Varoglu Stephen Jenks 《Parallel Computing》2011,37(1):26-41

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors. 相似文献

13.

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Lee Jaejin Jung Changhee Lim Daeseob Solihin Yan 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(9):1309-1324

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle. 相似文献

14.

共享多端口数据Cache结构：SMPDCA

黄光奇李子木周兴铭窦勇《计算机学报》2001,24(12):1318-1323

随着半导体工艺技术的飞速发展,单芯片多处理器（Single-Chip Multiprocessor,SCMP)结构将是一条提高处理器性能的有效途径。该文在分析SCMP结构的特点的基础上,提出了SCMP的一种结构实现：共享多端口数据Cache结构（Shared Multi-Ported Data Cache Architecture,SMPDCA).SMPDCA结构具有三个突出的优点：最小的通信延迟、没有Cache一致性维护开销和数据Cache命中率提高。模拟结果表明,与数据Cache私有的结构相比,SMPDCA结构的煅出优点使得应用程序的性能得到了明显的提高,特别是对于改善处理器之间的通信与交互比较多的应用程序的性能具有最为明显的效果。相似文献

15.

Mapping Parallel Application Communication Topology to Rhombic Overlapping-Cluster Multiprocessors

Hoganson Kenneth 《The Journal of supercomputing》2000,17(1):67-90

This paper extends research into rhombic overlapping-connectivity interconnection networks into the area of parallel applications. As a foundation for a shared-memory non-uniform access bus-based multiprocessor, these interconnection networks create overlapping groups of processors, buses, and memories, forming a clustered computer architecture where the clusters overlap. This overlapping-membership characteristic is shown to be useful for matching parallel application communication topology to the architecture's bandwidth characteristics. Many parallel applications can be mapped to the architecture topology so that most or all communication is localized within an overlapping cluster, at the low latency of processor direct to cache (or memory) over a bus. The latency of communication between parallel threads does not degrade parallel performance or limit the graininess of applications. Parallel applications can execute with good speedup and scaling on a proposed architecture which is designed to obtain maximum advantage from the overlapping-cluster characteristic, and also allows dynamic workload migration without moving the instructions or data. Scalability limitations of bus-based shared-memory multiprocessors are overcome by judicious workload allocation schemes, that take advantage of the overlapping-cluster memberships. Bus-based rhombic shared-memory multiprocessors are examined in terms of parallel speedup models to explain their advantages and justify their use as a foundation for the proposed computer architecture. Interconnection bandwidth is maximized with bi-directional circular and segmented overlapping buses. Strategies for mapping parallel application communication topologies to rhombic architectures are developed. Analytical models of enhanced rhombic multiprocessor performance are developed with a unique bandwidth modeling technique, and are compared with the results of simulation. 相似文献

16.

A moving threads processor architecture MTPA

M. Forsell V. Leppänen 《The Journal of supercomputing》2011,57(1):5-19

Moving threads is a new kind of approach for multicore processor architectures. Traditionally, each thread stays in the core where it is created, and data is moved from the main memory via caches to each core and thread. In the moving threads approach, each core can access only a certain portion of the main memory via its local memory block, and thus extremely lightweight threads are moved between the cores. As a consequence, all kinds of cache coherence problems and need for read reply messages are eliminated. Also Lamport’s sequential consistency of shared memory multiprocessor systems is achieved for free. In this paper, we propose a processor architecture (MTPA) for the moving threads paradigm. We describe the overall structure, operation, instruction set, and thread management mechanism as well as evaluate the proposed architecture with different functional unit settings with simulations and give early silicon area and power consumption estimates. 相似文献

17.

The Nexus Approach to Integrating Multithreading and Communication

《Journal of Parallel and Distributed Computing》1996,37(1):70-82

Lightweight threads have an important role to play in parallel systems: they can be used to exploit shared-memory parallelism, to mask communication and I/O latencies, to implement remote memory access, and to support task-parallel and irregular applications. In this paper, we address the question of how to integrate threads and communication in high-performance distributed-memory systems. We propose an approach based on global pointer and remote service request mechanisms, and explain how these mechanisms support dynamic communication structures, asynchronous messaging, dynamic thread creation and destruction, and a global memory model via interprocessor references. We also explain how these mechanisms can be implemented in various environments. Our global pointer and remote service request mechanisms have been incorporated in a runtime system called Nexus that is used as a compiler target for parallel languages and as a substrate for higher-level communication libraries. We report the results of performance studies conducted using a Nexus implementation; these results indicate that Nexus mechanisms can be implemented efficiently on commodity hardware and software systems. 相似文献

18.

一种基于Inter-warp异构性的缓存管理与内存调度机制

方娟魏泽琳于婷雯《计算机工程与科学》2019,41(5):788-795

在GPU中,一个warp内的所有线程在锁步中执行相同的指令。某些线程的内存请求可以得到快速处理,而其余请求会经历较长时间。在最慢的请求完成之前,warp不能执行下一条指令,导致内存发散。对GPU中warp间的异构性进行了研究,实现并优化了一种基于inter warp异构性的缓存管理机制和内存调度策略,以减少内存发散和缓存排队延迟的负面影响。根据缓存命中率将warp分类,以驱动后面的3个组件：（1）基于warp类型的缓存旁路技术组件,使低缓存利用率的warp进入旁路,不访问L2缓存;（2）基于warp类型的缓存插入/提升策略组件,防止来自高缓存利用率warp的数据被过早清除;（3）基于warp类型的内存控制器组件,优先处理从高缓存利用率的warp接收到的请求,并优先处理来自相同warp的请求。基于warp间异构性的缓存管理和内存调度机制在8种不同的GPGPU应用中,与基准GPU相比,平均加速18.0％。相似文献

19.

A NUCA Substrate for Flexible CMP Cache Sharing 总被引：1，自引：0，他引：1

Jaehyuk Huh J. Changkyu Kim C. Shafi H. Lixin Zhang L. Burger D. Keckler S.W. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1028-1040

We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses. 相似文献

20.

Lightweight dynamic partitioning for last-level cache of multicore processor on real system

Ludan Zhang Yi Liu Rui Wang Depei Qian 《The Journal of supercomputing》2014,69(2):547-560

With rapid development of multi/many-core processors, contention in shared cache becomes more and more serious that restricts performance improvement of parallel programs. Recent researches have employed page coloring mechanism to realize cache partitioning on real system and to reduce contentions in shared cache. However, page coloring-based cache partitioning has some side effects, one is page coloring restricts memory space that an application can allocate, from which may lead to memory pressure, another is changing cache partition dynamically needs massive page copying which will incur large overhead. To make page coloring-based cache partition more practical, this paper proposes a malloc allocator-based dynamic cache partitioning mechanism with page coloring. Memory allocated by our malloc allocator can be dynamically partitioned among different applications according to partitioning policy. Only coloring the dynamically allocated pages can remit memory pressure and reduce page copying overhead led by re-coloring compared to all-page coloring. To further alleviate the overhead, we introduce minimum distance page copying strategy and lazy flush strategy. We conduct experiments on real system to evaluate these strategies and results show that they work well for reducing cache misses and re-coloring overhead. 相似文献