期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Energy-efficient multithreading for a hierarchical heterogeneous multicore through locality-cognizant thread generation

Patrick A. La Fratta Peter M. Kogge 《Journal of Parallel and Distributed Computing》2013

Energy costs have become increasingly problematic for high performance processors, but the rising number of cores on-chip offers promising opportunities for energy reduction. Further, emerging architectures such as heterogeneous multicores present new opportunities for improved energy efficiency. While previous work has presented novel memory architectures, multithreading techniques, and data mapping strategies for reducing energy, consideration to thread generation mechanisms that take into account data locality for this purpose has been limited. This study presents methodologies for the joint partitioning of data and threads to parallelize sequential codes across an innovative heterogeneous multicore processor called the Passive/Active Multicore (PAM) for reducing energy consumption from on-chip data transport and cache access components while also improving execution time. Experimental results show that the design with automatic thread partitioning offered reductions in energy-delay product (EDP) of up to 48%. 相似文献

2.

SMA:前瞻性多线程体系结构 总被引：4，自引：1，他引：3

肖刚周兴铭徐明邓鹍《计算机学报》1999,22(6):582-590

提出了一种新的ＩＬＰ处理器体系结构－前瞻性多线程体系的结构,简称ＳＭＡ．它结合了前瞻性执行机制和多线程执行机制,以整个线程为长步进行前瞻性执行,多个线程并行执行并且共享处理器硬件资源,这样,处理器既通过组合每个线程的指令窗口形成一个大的动态指令窗口,开发出程序中更大的ＩＬＰ,又利用多线程执行机制屏蔽各种长延迟操作,达到较高的资源利用率;介绍了ＳＭＡ执行模型,并讨论了ＳＭＡ处理器的实现和其中的关键技相似文献

3.

Helper threads via virtual multithreading

《Micro, IEEE》2004,24(6):74-82

Memory latency dominates the performance of many applications on modern processors, despite advances in caches and prefetching techniques. Numerous prefetching techniques, both in hardware and software, try to alleviate the memory bottleneck. One such technique, known as helper threading improves single-thread performance on a simultaneous multithreaded architecture (SMT), which shares processor resources, including caches, among logical threads. It uses otherwise idle hardware thread contexts to execute speculative threads on behalf of the main thread. Helper threading accelerates a program by exploiting a processor's multithreading capability to run assist threads. Based on the helper threading usage model, virtual multithreading (VMT), a form of switch-on-event user-level multithreading, can improve performance for real-world workloads with a wall-clock speedup of 5.0 to 38.5 percent 相似文献

4.

同时多线程处理器上的动态分支预测器设计方案研究

任建安虹路放梁博《计算机科学》2006,33(3):239-243

同时多线程处理器（SMT）每个周期能够从多个线程中发射指令执行,从而大大地提高了超标量微处理器的指令吞吐量,但多个线程的同时执行也带来了许多硬件资源的共享冲突问题.其中,多个线程共享分支预测硬件的方案会对分支预测精度产生较大的影响.研究SMT处理器中分支处理方案对于处理器整体性能的影响,对于指导SMT处理器的设计是十分重要的.本文利用SMT处理器模拟器,针对各线程运行独立应用的SMT结构实验评估了几种著名的分支预测方案;给出了在单线程和多线程情况下,分支预测方案对分支预测精度和处理器整体性能的影响的分析;总结出在这样的SMT结构中,各线程拥有独立的预测器是一种较好的选择,并且由于各独立预测器可以采用小而简单的结构,所以不会带来太多的硬件开销. 相似文献

5.

Software-directed register deallocation for simultaneousmultithreaded processors

Lo J.L. Parekh S.S. Eggers S.J. Levy H.M. Tullsen D.M. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(9):922-933

This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better interthread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deal location. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to: 1) free registers immediately upon their last use, and 2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs 相似文献

6.

EDGE结构上一种通过超块重组加速单线程应用的方法

魏学超安虹毛梦捷《小型微型计算机系统》2012,(10):2249-2254

Explicit Data Graph Execution(EDGE)ISA是一种专门为类数据流驱动的分片式众核处理器而设计的指令集体系结构.相较于传统的采用控制流驱动的处理器,EDGE结构以超块(Hyperblock)而不是单个指令作为其执行单位,在超块内部实现数据流执行,超块之间按照推测序保持控制流执行,有利于挖掘指令级并行性.但是,EDGE编译器按照程序的串行执行顺序组织超块,超块间和超块内部受限于数据依赖,削弱了整个程序运行时的潜在数据级并行性和线程级并行性,不利于发挥EDGE分片式结构的优势.本文通过分析EDGE编译器超块组织的特点,结合EDGE结构特有的执行模型,提出一种普适性的超块组织框架来模拟EDGE结构上多线程运行的效果,进一步挖掘EDGE结构运行串行单线程程序时的指令级并行性.本文选用TRIPS微处理器作为EDGE结构的实例处理器,利用矩阵乘法等三个实验验证了我们所提出的框架的可行性,实验结果表明这些应用在TRIPS上获得了较好的性能提升. 相似文献

7.

A comparison of concurrent programming and cooperative multithreading under load balancing applications

Justin T. Maris Aaron W. Keen Takashi Ishihara Ronald A. Olsson 《Concurrency and Computation》2004,16(4):345-369

Two models of thread execution are the general concurrent programming execution model (CP) and the cooperative multithreading execution model (CM). CP provides nondeterministic thread execution where context switches occur arbitrarily. CM provides threads that execute one at a time until they explicitly choose to yield the processor. This paper focuses on a classic application to reveal the advantages and disadvantages of load balancing during thread execution under CP and CM styles; results from a second classic application were similar. These applications are programmed in two different languages (SR and Dynamic C) on different hardware (standard PCs and embedded system controllers). An SR‐like run‐time system, DesCaRTeS, was developed to provide interprocess communication for the Dynamic C implementations. This paper compares load balancing and non‐load balancing implementations; it also compares CP and CM style implementations. The results show that in cases of very high or very low workloads, load balancing slightly hindered performance; and in cases of moderate workload, both SR and Dynamic C implementations of load balancing generally performed well. Further, for these applications, CM style programs outperform CP style programs in some cases, but the opposite occurs in some other cases. This paper also discusses qualitative tradeoffs between CM style programming and CP style programming for these applications. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

8.

Power-efficient error tolerance in chip multiprocessors

Rashid M.W. Tan E.J. Huang M.C. Albonesi D.H. 《Micro, IEEE》2005,25(6):60-70

The microprocessor industry is rapidly moving to the use of multicore chips as general-purpose processors. Whereas the current generation of chip multiprocessors (CMPs) target server applications, future desktop processors likely have tens of multithreaded cores on a single die. Various redundant multithreading (RMT) approaches exploit the multithreaded capability of current general-purpose microprocessors. These approaches replicate the entire program, running it as a separate thread using time or space redundancy. This guards the processor core against all errors, including those in combinational logic. Because RMT exploits the existing multithreaded hardware, it requires only a modest amount of additional hardware support for comparing results and, depending on the implementation, duplicating inputs. 相似文献

9.

Cell异构多核处理器上流水并行优化技术*

曹倩胡长军李士刚《计算机应用研究》2011,28(9):3344-3347

针对如何发挥异构多核处理器的优势从而提高程序执行效率的问题,提出了Cell异构多核处理器上实现线程同步流水并行和迭代同步流水并行两种优化技术,该优化技术可以有效地提高非规则写和控制结构非规则的执行速度。通过在Cell处理器上对NAS benchmarks中的IS、EP、LU以及SPEC2001中的MOLDYN进行测试,结果表明该流水并行方案有效地改善了临界区和flush操作的执行效率,明显地提高了程序的执行速度。相似文献

10.

DAFT: Decoupled Acyclic Fault Tolerance

Yun Zhang Jae W. Lee Nick P. Johnson David I. August 《International journal of parallel programming》2012,40(1):118-140

Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such faults, but software techniques are more appealing for their low cost and flexibility. Recent software proposals have not achieved widespread acceptance because they either increase register pressure, double memory usage, or are too slow in the absence of hardware extensions. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Evaluation results demonstrate that speculation allows DAFT to improves the performance of software redundant multithreading by 2.17× with no degradation of fault coverage. 相似文献

11.

Architectural support for thread communications in multi-core processors

Sevin Varoglu Stephen Jenks 《Parallel Computing》2011,37(1):26-41

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors. 相似文献

12.

温度约束多核处理器最大稳态吞吐量分析 总被引：1，自引：0，他引：1

张必英陈红松崔刚傅忠传《计算机研究与发展》2015,52(9):2083-2093

随着多核处理器功耗密度的不断增大,温度约束条件下的性能分析已经成为多核处理器早期设计优化的重要组成部分.当处理器运行不同的任务时,处理器温度具有很大的差异性,但现有研究成果并没有考虑任务差异性对处理器性能的影响.针对采用动态频率电压调节作为温度管理技术的多核处理器,为了提高在温度约束条件下稳态吞吐量的分析准确性,考虑不同任务之间的差异性,提出一种新的最大吞吐量分析方法.将任务特征引入性能分析模型,论证了当多核处理器吞吐量达到最大值时各处理器核上任务特征之间的关系,将最大稳态吞吐量分析归结为线性规划问题.仿真实验结果表明,所提方法具有较好的分析准确性,任务特征对多核处理器最大吞吐量具有非常大的影响. 相似文献

13.

嵌入式ARM多核处理器并行化方法的研究

杨川杨斌《单片机与嵌入式系统应用》2014,(7):9-12

随着嵌入式处理器技术的不断发展以及人们对嵌入式设备性能的要求越来越高,嵌入式处理器由单核时代进入多核时代。然而,传统嵌入式系统软件开发方法还是基于单核模式,并没有利用嵌入式多核处理器多核并行化的特点,没有充分发挥嵌入式多核处理器的性能。虽然在PC平台上,多核并行化方法相对更成熟,但嵌入式多核处理器在处理器数目、Cache以及总线等方面有很大不同,嵌入式平台多核并行化并不能借助PC平台的实践方法,因此基于嵌入式平台研究多核并行化的方法是很有意义的。相似文献

14.

Information-flow control on ARM and POWER multicore processors

Smith Graeme Coughlin Nicholas Murray Toby 《Formal Methods in System Design》2021,58(1-2):251-293

Weak memory models implemented on modern multicore processors are known to affect the correctness of concurrent code. They can also affect whether or not the concurrent code is secure. This is particularly the case in programs where the security levels of variables are value-dependent, i.e., depend on the values of other variables. In this paper, we illustrate how instruction reordering allowed by ARM and POWER multicore processors leads to vulnerabilities in such programs, and present a compositional, flow-sensitive information-flow logic which can be used to detect such vulnerabilities. The logic allows step-local reasoning (one instruction at a time) about a thread’s security by tracking information about dependencies between instructions which guarantee their order of occurrence. Program security can then be established from individual thread security using rely/guarantee reasoning. The logic has been proved sound with respect to existing operational semantics using Isabelle/HOL, and implemented in an automatic symbolic execution tool.

相似文献

15.

A dynamic execution time estimation model to save energy in heterogeneous multicores running periodic tasks

《Future Generation Computer Systems》2016

Nowadays, real-time embedded applications have to cope with an increasing demand of functionalities, which require increasing processing capabilities. With this aim real-time systems are being implemented on top of high-performance multicore processors that run multithreaded periodic workloads by allocating threads to individual cores. In addition, to improve both performance and energy savings, the industry is introducing new multicore designs such as ARM’s big.LITTLE that include heterogeneous cores in the same package.A key issue to improve energy savings in multicore embedded real-time systems and reduce the number of deadline misses is to accurately estimate the execution time of the tasks considering the supported processor frequencies. Two main aspects make this estimation difficult. First, the running threads compete among them for shared resources. Second, almost all current microprocessors implement Dynamic Voltage and Frequency Scaling (DVFS) regulators to dynamically adjust the voltage/frequency at run-time according to the workload behavior. Existing execution time estimation models rely on off-line analysis or on the assumption that the task execution time scales linearly with the processor frequency, which can bring important deviations since the memory system uses a different power supply.In contrast, this paper proposes the Processor–Memory (Proc–Mem) model, which dynamically predicts the distinct task execution times depending on the implemented processor frequencies. A power-aware EDF (Earliest Deadline First)-based scheduler using the Proc–Mem approach has been evaluated and compared against the same scheduler using a typical Constant Memory Access Time model, namely CMAT. Results on a heterogeneous multicore processor show that the average deviation of Proc–Mem is only by 5.55% with respect to the actual measured execution time, while the average deviation of the CMAT model is 36.42%. These results turn in important energy savings, by 18% on average and up to 31% in some mixes, in comparison to CMAT for a similar number of deadline misses. 相似文献

16.

Taxonomy of Data Prefetching for Multicore Processors

下载免费PDF全文

Surendra Byna Member IEEE Yong Chen 《计算机科学技术学报》2009,24(3):405-417

Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory.With hardware and/or software support,data prefetching brings data closer to a processor before it is actually needed.Many prefetching techniques have been developed for single-core processors.Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching t... 相似文献

17.

Performance,optimization, and fitness: Connecting applications to architectures

Mohammad A. Bhuiyan Melissa C. Smith Vivek K. Pallipuram 《Concurrency and Computation》2011,23(10):1066-1100

Recent trends involving multicore processors and graphical processing units (GPUs) focus on exploiting task‐ and thread‐level parallelism. In this paper, we have analyzed various aspects of the performance of these architectures including NVIDIA GPUs, and multicore processors such as Intel Xeon, AMD Opteron, IBM's Cell Broadband Engine. The case study used in this paper is a biological spiking neural network (SNN), implemented with the Izhikevich, Wilson, Morris–Lecar, and Hodgkin–Huxley neuron models. The four SNN models have varying requirements for communication and computation making them useful for performance analysis of the hardware platforms. We report and analyze the variation of performance with network (problem size) scaling, available optimization techniques and execution configuration. A Fitness performance model, that predicts the suitability of the architecture for accelerating an application, is proposed and verified with the SNN implementation results. The Roofline model, another existing performance model, has also been utilized to determine the hardware bottleneck(s) and attainable peak performance of the architectures. Significant speedups for the four SNN neuron models utilizing these architectures are reported; the maximum speedup of 574x was observed in our GPU implementation. Our results and analysis show that a proper match of architecture with algorithm complexity provides the best performance. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

18.

Scheduling dense linear algebra operations on multicore processors

Jakub Kurzak Hatem Ltaief Jack Dongarra Rosa M. Badia 《Concurrency and Computation》2010,22(1):15-44

State‐of‐the‐art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability to fully exploit thread‐level parallelism. At the same time, the coarse–grain dataflow model gains popularity as a paradigm for programming multicore architectures. This work looks at implementing classic dense linear algebra workloads, the Cholesky factorization, the QR factorization and the LU factorization, using dynamic data‐driven execution. Two emerging approaches to implementing coarse–grain dataflow are examined, the model of nested parallelism, represented by the Cilk framework, and the model of parallelism expressed through an arbitrary Direct Acyclic Graph, represented by the SMP Superscalar framework. Performance and coding effort are analyzed and compared against code manually parallelized at the thread level. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

19.

基于Cilk++的遗传算法并行化改造实现

杨川杨斌《微计算机应用》2012,1(5):54-60

遗传算法是模拟生物进化过程的一种计算模型,在同一代种群间进行基因的选择、交叉和变异时,具有良好的并行性.遗传算法在实际的应用中,选取的种群数目往往比较大,处理的数据量巨大,因此算法性能比较低.目前,处理器已经进入多核时代,但传统的程序还是基于单核编写,程序性能并没有随着处理器数目增加而增加.因此,通过对遗传算法进行并行化改造,使得算法能够充分利用多核处理器资源,算法的性能大大提升.并行遗传算法的实现,符合未来多核程序设计的发展方向,有利于遗传算法更广泛的运用. 相似文献

20.

VCluster: a thread‐based Java middleware for SMP and heterogeneous clusters with thread migration support

Hua Zhang Joohan Lee Ratan Guha 《Software》2008,38(10):1049-1071

Clusters, composed of symmetric multiprocessor (SMP) machines and heterogeneous machines, have become increasingly popular for high‐performance computing. Message‐passing libraries, such as message‐passing interface (MPI) and parallel virtual machine (PVM), are de facto parallel programming libraries for clusters that usually consist of homogeneous and uni‐processor machines. For SMP machines, MPI is combined with multithreading libraries like POSIX Thread and OpenMP to take advantage of the architecture. In addition to existing parallel programming libraries that are in C/C++ and FORTRAN programming languages, the Java programming language presents itself as another alternative with its object‐oriented framework, platform neutral byte code, and ever‐increasing performance. This paper presents a new parallel programming model and a library, VCluster, which implements this model. VCluster is based on migrating virtual threads instead of processes to support clusters of SMP machines more efficiently. The implementation uses thread migration, which can be used in dynamic load balancing. VCluster was developed in pure Java, utilizing the portability of Java to support clusters of heterogeneous machines. Several applications are developed to illustrate the use of this library and compare the usability and performance of VCluster with other approaches. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献