期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The Coming Wave of Multithreaded Chip Multiprocessors

James Laudon Lawrence Spracklen 《International journal of parallel programming》2007,35(3):299-330

The performance of microprocessors has increased exponentially for over 35 years. However, process technology challenges, chip power constraints, and difficulty in extracting instruction-level parallelism are conspiring to limit the performance of future individual processors. To address these limits, the computer industry has embraced chip multiprocessing (CMP), predominately in the form of multiple high-performance superscalar processors on the same die. We explore the trade-off between building CMPs from a few high-performance cores or building CMPs from a large number of lower-performance cores and argue that CMPs built from a larger number of lower-performance cores can provide better performance and performance/Watt on many commercial workloads. We examine two multi-threaded CMPs built using a large number of processor cores: Sun’s Niagara and Niagara 2 processors. We also explore the programming issues for CMPs with large number of threads. The programming model for these CMPs is similar to the widely used programming model for symmetric multiprocessors (SMPs), but the greatly reduced costs associated with communication of data through the on-chip shared secondary cache allows for more fine-grain parallelism to be effectively exploited by the CMP. Finally, we present performance comparisons between Sun’s Niagara and more conventional dual-core processors built from large superscalar processor cores. For several key server workloads, Niagara shows significant performance and even more significant performance/Watt advantages over the CMPs built from traditional superscalar processors. 相似文献

2.

SCMP: A Single-Chip Message-Passing Parallel Computer

Baker James M. Gold Brian Bucciero Mark Bennett Sidney Mahajan Rajneesh Ramachandran Priyadarshini Shah Jignesh 《The Journal of supercomputing》2004,30(2):133-149

As technology improves and transistor feature sizes continue to shrink, the effects of on-chip interconnect wire latencies on processor clock speeds will become more important. In addition, as we reach the limits of instruction-level parallelism that can be extracted from application programs, there will be an increased emphasis on thread-level parallelism. To continue to improve performance, computer architects will need to focus on architectures that can efficiently support thread-level parallelism while minimizing the length of on-chip interconnect wires. The SCMP (Single-Chip Message-Passing) parallel computer system is one such architecture. The SCMP system includes up to 64 processors on a single chip, connected in a 2-D mesh with nearest neighbor connections. Memory is included on-chip with the processors and the architecture includes hardware support for communication and the execution of parallel threads. Since there are no global signals or shared resources between the processors, the length of the interconnect wires will be determined by the size of the individual processors, not the size of the entire chip. Avoiding long interconnect wires will allow the use of very high clock frequencies, which, when coupled with the use of multiple processors, will offer tremendous computational power. 相似文献

3.

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Lee Jaejin Jung Changhee Lim Daeseob Solihin Yan 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(9):1309-1324

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle. 相似文献

4.

Architectural Support and Mechanisms for Object Caching in Dynamic Multithreaded Computations

《Journal of Parallel and Distributed Computing》1999,58(2):260-300

High-level parallel programming models supporting dynamic fine-grained threads in a global object space are becoming increasingly popular for expressing irregular applications based on sophisticated adaptive algorithms and pointer-based data structures. However, implementing these multithreaded computations on scalable parallel machines poses significant challenges, particularly with respect to object caching. Object caching techniques must be able to tolerate unresponsive processors and protocol handler occupancy delays. This paper examines whether these challenges can be offset by leveraging responsive general-purpose communication architectural features (such as remote memory access and atomic operations), possibly compensating for the lack of more sophisticated hardware primitives by relying upon increased involvement of the run-time system and the compiler. A detailed performance analysis of four irregular applications, using the Illinois Concert System on the Cray T3D and the SGI Origin 2000, finds that existing software distributed shared memory (DSM) systems are capable of delivering good performance only in the presence of a high level of responsive communication architecture support (specifically, support for remote atomic operations). Recognizing that this situation stems from the synchronous request–reply nature of DSM protocols, we present a composable object caching framework, called view caching, which exploits knowledge of application data access semantics to construct custom protocols that require reduced processor synchronization. View caching protocols are more tolerant to responsiveness and occupancy delays and are able to exploit even lower level responsive communication primitives (such as nonatomic remote memory accesses) for a performance benefit. 相似文献

5.

同时多线程处理器上的动态分支预测器设计方案研究

任建安虹路放梁博《计算机科学》2006,33(3):239-243

同时多线程处理器（SMT）每个周期能够从多个线程中发射指令执行,从而大大地提高了超标量微处理器的指令吞吐量,但多个线程的同时执行也带来了许多硬件资源的共享冲突问题.其中,多个线程共享分支预测硬件的方案会对分支预测精度产生较大的影响.研究SMT处理器中分支处理方案对于处理器整体性能的影响,对于指导SMT处理器的设计是十分重要的.本文利用SMT处理器模拟器,针对各线程运行独立应用的SMT结构实验评估了几种著名的分支预测方案;给出了在单线程和多线程情况下,分支预测方案对分支预测精度和处理器整体性能的影响的分析;总结出在这样的SMT结构中,各线程拥有独立的预测器是一种较好的选择,并且由于各独立预测器可以采用小而简单的结构,所以不会带来太多的硬件开销. 相似文献

6.

An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

Christian Fensch Marcelo Cintra 《International journal of parallel programming》2011,39(3):271-295

The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. In this article, we investigate a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. This mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. We extend our previous work by investigating in detail the impact of system design parameters and extending the system to support multi-level cache hierarchies. Results show that the choice of implementation of multi-level cache hierarchies can have a significant impact on performance. 相似文献

7.

The Nexus Approach to Integrating Multithreading and Communication

《Journal of Parallel and Distributed Computing》1996,37(1):70-82

Lightweight threads have an important role to play in parallel systems: they can be used to exploit shared-memory parallelism, to mask communication and I/O latencies, to implement remote memory access, and to support task-parallel and irregular applications. In this paper, we address the question of how to integrate threads and communication in high-performance distributed-memory systems. We propose an approach based on global pointer and remote service request mechanisms, and explain how these mechanisms support dynamic communication structures, asynchronous messaging, dynamic thread creation and destruction, and a global memory model via interprocessor references. We also explain how these mechanisms can be implemented in various environments. Our global pointer and remote service request mechanisms have been incorporated in a runtime system called Nexus that is used as a compiler target for parallel languages and as a substrate for higher-level communication libraries. We report the results of performance studies conducted using a Nexus implementation; these results indicate that Nexus mechanisms can be implemented efficiently on commodity hardware and software systems. 相似文献

8.

A Study of the EARTH-MANNA Multithreaded System

Herbert H. J. Hum Olivier Maquelin Kevin B. Theobald Xinmin Tian Guang R. Gao Laurie J. Hendren 《International journal of parallel programming》1996,24(4):319-348

Multithreaded architectures have been proposed for future multiprocessor systems. However, some open issues remain. Can multithreading be supported in a multiprocessor so that it can tolerate synchronization and communication latencies, with little intrusion on the performance of sequentially-executed code? How much does such support contribute to scalable performance when communication and synchronization demands are high? In this paper, we describe the design of EARTH, an architecture which addresses these issues. Each processor in EARTH has an off-the-shelf Execution Unit (EU) for executing threads, and an ASIC Synchronization Unit (SU) supporting dataflow-like thread synchronizations, scheduling, and remote requests. In preparation for an implementation of the SU, we have emulated a basic EARTH model on MANNA 2.0, an existing multiprocessor whose hardware configuration closely matches EARTH. This EARTH-MANNA testbed is fully functional, enabling us to experiment with large benchmarks with impressive speed. With this platform, we demonstrate that multithreading support can be efficiently implemented (with little emulation overhead) in a multiprocessor without a major impact on uniprocessor performance. Also, we measure how much basic multithreading support can help in tolerating increasing communication/synchronization demands. 相似文献

9.

Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

《Future Generation Computer Systems》2014

Modern processor architectures are increasingly complex and heterogeneous, often requiring software solutions tailored to the specific hardware characteristics of each processor model. In this article, we address this problem by targeting two processors featuring Simultaneous MultiThreading (SMT) to improve the occupancy of their internal execution units through a sustained stream of instructions coming from more than one thread. We target the AMD Bulldozer and IBM POWER7 processors as case studies for specific hardware-oriented performance optimizations that increase the variety of instructions sent to each core to maximize the occupancy of all its execution units. WorkOver, presented in this article, improves thread scheduling by increasing the performance of floating point-intensive workloads on Linux-based operating systems. WorkOver is a user-space monitoring tool that automatically identifies FPU-intensive threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 20% can be achieved by simply allowing WorkOver to monitor applications and schedule their threads, without any modification of the workload. 相似文献

10.

Asynchronous problems on SIMD parallel computers

Shu W. Wu M.-Y. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(7):704-713

One of the essential problems in parallel computing is: Can SIMD machines handle asynchronous problems? This is a difficult, unsolved problem because of the mismatch between asynchronous problems and SIMD architectures. We propose a solution to let SIMD machines handle general asynchronous problems. Our approach is to implement a runtime support system which can run MIMD-like software on SIMD hardware. The runtime support system, named P kernel, is thread-based. There are two major advantages of the thread-based model. First, for application problems with irregular and/or unpredictable features, automatic scheduling can move some threads from overloaded processors to underloaded processors. Second, and more importantly, the granularity of threads can be controlled to reduce system overhead. The P kernel is also able to handle bookkeeping and message management, as well as to make these low-level tasks transparent to users. Substantial performance has been obtained on Maspar MP-1 相似文献

11.

利用Itanium2的PMU部件开发程序性能分析工具

张宇峰《微机发展》2006,16(8):69-71

Itanium2处理器以寄存器组的形式提供的性能监视单元实现了在程序运行过程中捕捉微结构事件的功能。文中介绍了以Linux为Itanium2的性能监视单元提供的接口perfmon为基础的开发相对高端的性能分析工具的方法,以实现对这些由性能监视硬件提供的数据进行综合处理利用。相似文献

12.

Thread prioritization: A thread scheduling mechanism for multiple-context parallel processors

Stuart Fiske William J. Dally 《Future Generation Computer Systems》1995,11(6):503-518

Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given time. In this paper we show that both decisions are important, and that incorrect choices can lead to serious performance degradation. We propose thread prioritization as a means of guiding both levels of scheduling. Each thread has a priority that can change dynamically, and that the scheduler uses to allocate as many computation resources as possible to critical threads. We briefly describe its implementation, and we show simulation performance results for a number of simple benchmarks in which synchronization performance is critical. 相似文献

13.

一种有效的同时多线程处理器取指控制机制 总被引：1，自引：0，他引：1

何立强刘志勇《计算机学报》2006,29(4):535-543

同时多线程处理器通过每时钟周期从多个运行的线程取指令执行,极大地提高了处理器的性能.分支预测器的预测精度和取指策略的效率是影响同时多线程处理器性能的关键.通过将一个基于值的分支预测器和一个基于线程推进速度的取指策略相结合,提出一种新的取指控制机制.该结构的硬件开销较小,实现复杂度较低.实验结果表明,该取指控制机制有效地提高了处理器的性能,其相对于传统取指控制机制的性能加速比为28%且该加速比也高于目前基于流缓冲区和基于分支分类器的取指控制机制. 相似文献

14.

Generation, Optimization, and Evaluation of Multithreaded Code

Lucas Roh Walid A. Najjar A.P.Wim Böhm 《Journal of Parallel and Distributed Computing》1996,32(2):188

The recent advent of multithreaded architectures holds many promises: the exploitation of intrathread locality and the latency tolerance of multithreaded synchronization can result in a more efficient processor utilization and higher scalability. The challenge for a code generation scheme is to make effective use of the underlying hardware by generating large threads with a large degree of internal locality without limiting the program level parallelism or increasing latency. Top-down code generation, where threads are created directly from the compiler's intermediate form, is effective at creating a relatively large thread. However, having only a limited view of the code at any one time limits the quality of threads generated. These top-down generated threads can therefore be optimized by global, bottom-up optimization techniques. In this paper, we introduce the Pebbles multithreaded model of computation and analyze a code generation scheme whereby top-down code generation is combined with bottom-up optimizations. We evaluate the effectiveness of this scheme in terms of overall performance and specific thread characteristics such as size, length, instruction level parallelism, number of inputs, and synchronization costs. 相似文献

15.

CMP Support for Large and Dependent Speculative Threads 总被引：1，自引：0，他引：1

Colohan C.B. Ailamaki A.C. Steffan J.G. Mowry T.C. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1041-1054

Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and which have minimal dependences between them. However, recent work has shown that TLS can offer compelling performance improvements when targeting much larger speculative threads of more than 50,000 dynamic instructions per thread with many frequent data dependences between them. To support such large and dependent speculative threads, the hardware must be able to buffer the additional speculative state and must also address the more challenging problem of tolerating the resulting cross-thread data dependences. In this paper, we present a chip-multiprocessor (CMP) support for large speculative threads that integrates several previous proposals for the TLS hardware. We also present a support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing speculative execution. Through an evaluation that exploits the proposed hardware support in the database domain, we find that the transaction response time for three of the five transactions from TPC-C (on a simulated four-processor chip-multiprocessor) speed up by a factor of 1.9 to 2.9. 相似文献

16.

异构多核下兼顾应用公平性和能耗的调度方法研究

杨亚琪栾钟治杨海龙杨姝钱德沛《计算机工程与科学》2016,38(5):848-856

异构多核处理器通常由高性能的大核和低能耗的小核组成,在其上进行合理的线程调度可以有效地提高资源利用率,节省能耗。之前论文提出的大小核上的公平性调度并没有考虑核上有不同频率/电压状态的情况,而现在支持DVFS调节的处理器越来越普遍,因此很有必要将线程间公平度的计算进行扩展和改进。提出在每个核有若干种不同的DVFS状态时异构多核处理器上线程公平度的计算方法,对已有的性能预测模型进行改进,采用自适应算法调整模型中的系数,并在此基础上提出了一种调度策略,维持各线程之间的公平度和处理器功率满足提前设定的阈值,同时选取能效最优化的配置,实现减小应用运行能耗的目的。实验结果表明,与所提出的调度策略相比,采用static、DVFS-only、swap-only三种调度方法时,在总的运行时间几乎相同的情况下,平均要多产生20%以上能耗,对于有些应用甚至达到了50%。相似文献

17.

Unified reliability estimation and management of NoC based chip multiprocessors

Alexandre Yasuo Yamamoto Cristinel Ababei 《Microprocessors and Microsystems》2014

We present a new architecture level unified reliability evaluation methodology for chip multiprocessors (CMPs). The proposed reliability estimation (REST) is based on a Monte Carlo algorithm. What distinguishes REST from the previous work is that both the computational and communication components are considered in a unified manner to compute the reliability of the CMP. We utilize REST tool to develop a new dynamic reliability management (DRM) scheme to address time-dependent dielectric breakdown and negative-bias temperature instability aging mechanisms in network-on-chip (NoC) based CMPs. Designed as a control loop, the proposed DRM scheme uses an effective neural network based reliability estimation module. The neural-network predictor is trained using the REST tool. We investigate how system’s lifetime changes when the NoC as the communication unit of the CMP is considered or not during the reliability evaluation process and find that differences can be as high as 60%. Full-system based simulations using a customized GEM5 simulator show that reliability can be improved by up to 52% using the proposed DRM scheme in a best-effort scenario with 2–9% performance penalty (using a user set target lifetime of 7 years) over the case when no DRM is employed. 相似文献

18.

Verification of STM on relaxed memory models

Rachid Guerraoui Thomas A. Henzinger Vasu Singh 《Formal Methods in System Design》2011,39(3):297-331

相似文献

19.

支持多核并行程序确定性重放的高效访存冲突记录方法 总被引：2，自引：0，他引：2

刘磊黄河唐志敏《计算机研究与发展》2012,49(1):64-75

多核系统中并行程序执行过程的不确定性给程序调试带来了很大的困难.准确记录初始执行中冲突访存的次序是并行程序确定性重放的基础.提出了通过建立精确happens-before关系记录访存冲突的方法.此方法利用简洁高效的地址冲突检测机制确定冲突访存操作在执行中所处happens-before序关系的位置,可以抑制部分记录信息的产生,从而有效减少记录信息.与其他方式方法相比,可以进一步压缩17%的记录条数.采用逻辑向量时钟描述冲突访存操作间的happens-before关系,与采用标量时钟相比,可以避免happens-before关系的误识,降低重放执行时并行度的损失. 相似文献

20.

Threaded Runtime Support for Execution of Fine Grain Parallel Code on Coarse Grain Multiprocessors

Richard Neves Robert B. Schnabel 《Journal of Parallel and Distributed Computing》1997,42(2):128

The goal of this research is to provide systems support that allows fine grain, data parallel code to execute efficiently on much coarser grain multiprocessors. The task of writing parallel applications is simplified by allowing the programmer to assume a number of processors convenient to the algorithm being implemented. This paper describes and evaluates a runtime approach that efficiently manages thousands of virtual processors per actual processor. The limits in using user-level threads as fine grain virtual processors are identified. Key techniques used are tight integration and specialization of scheduling, communication, optimized context switching, and fine-tuned stack management. A prototype of this runtime approach is evaluated by comparing implementations of three problems, a smoothing kernel of a thin-layer Navier–Stokes code, a five point stencil problem, and a block bordered system of linear equations on an Intel Paragon multiprocessor and on a network of DEC Alpha workstations. The additional cost relative to an efficient manually contracted code can be as low as 15% for granularities of 50 floating point operations per virtual processor and is typically 5–20% for granularities of about 100 floating point operations per virtual processor. The overhead is analyzed in detail to show the costs of scheduling, communication, context switching, reduced memory performance, and insuring data consistency. The implementation and analysis indicate that fine grain code can be efficiently executed on a coarse grain multiprocessor using very lightweight, specialized threads. 相似文献