首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Parallel Computing》1999,25(13-14):1741-1783
Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed.  相似文献   

2.
Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches.This paper presents a novel global software pipelining technique,called Trace Software Pipelining,targeted to the instruction-level parallel processors such as Very Long Instruction Word (VLIW) and superscalar machines.Trace software pipelining applies a global code scheduling technique to compact the original loop body.The resulting loop is called a trace software pipelined (TSP) code.The trace softwrae pipelined code can be directly executed with special architectural support or can be transformed into a globally software pipelined loop for the current VLIW and superscalar processors.Thus,exploiting parallelism across all iterations of a loop can be completed through compacting the original loop body with any global code scheduling technique.This makes our new technique very promising in practical compilers.Finally,we also present the preliminary experimental results to support our new approach.  相似文献   

3.
Explicit Data Graph Execution(EDGE)ISA是一种专门为类数据流驱动的分片式众核处理器而设计的指令集体系结构.相较于传统的采用控制流驱动的处理器,EDGE结构以超块(Hyperblock)而不是单个指令作为其执行单位,在超块内部实现数据流执行,超块之间按照推测序保持控制流执行,有利于挖掘指令级并行性.但是,EDGE编译器按照程序的串行执行顺序组织超块,超块间和超块内部受限于数据依赖,削弱了整个程序运行时的潜在数据级并行性和线程级并行性,不利于发挥EDGE分片式结构的优势.本文通过分析EDGE编译器超块组织的特点,结合EDGE结构特有的执行模型,提出一种普适性的超块组织框架来模拟EDGE结构上多线程运行的效果,进一步挖掘EDGE结构运行串行单线程程序时的指令级并行性.本文选用TRIPS微处理器作为EDGE结构的实例处理器,利用矩阵乘法等三个实验验证了我们所提出的框架的可行性,实验结果表明这些应用在TRIPS上获得了较好的性能提升.  相似文献   

4.
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors.  相似文献   

5.
Multicore architectures are becoming the main design paradigm for current and future processors. The main reason is that multicore designs provide an effective way of overcoming instruction-level parallelism (ILP) limitations by exploiting thread-level parallelism (TLP). In addition, it is a power and complexity-effective way of taking advantage of the huge number of transistors that can be integrated on a chip. On the other hand, today's higher than ever power densities have made temperature one of the main limitations of microprocessor evolution. Thermal management in multicore architectures is a fairly new area. Some works have addressed dynamic thermal management in bi/quad-core architectures. This work provides insight and explores different alternatives for thermal management in multicore architectures with 16 cores. Schemes employing both energy reduction and activity migration are explored and improvements for thread migration schemes are proposed.  相似文献   

6.
This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Inter Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests  相似文献   

7.
This paper presents the design and implementation of a parallelization framework and OpenMP runtime support in Intel® C++ & Fortran compilers for exploiting nested parallelism in applications using OpenMP pragmas or directives. We conduct the performance evaluation of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel C++ compiler on Hyper-Threading Technology (HT) enabled multiprocessor systems. The performance results show that the multithreaded code generated by the Intel compiler achieved a speedup up to 4.69 on 4 processors with HT enabled for five different input video sequences for the H.264 encoder workload, and a 1.28 speedup on an HT enabled single-CPU system and 1.99 speedup on an HT-enabled dual-CPU system for the audio–visual speech recognition workload. The performance gain due to exploiting nested parallelism for leveraging Hyper-Threading Technology is up to 70% for two multimedia workloads under different multiprocessor system configurations. These results demonstrate that hyper-threading benefits can be achieved by exploiting nested parallelism through Intel compiler and runtime system support for OpenMP programs.  相似文献   

8.
何裕南  安虹  郭锐  梁博 《计算机科学》2007,34(1):248-254
CPU设计正在由仅开发指令级并行性的单线程单核结构转向利用线程级并行性的多线程多核结构,但至今还没有一个可移植性好并被广泛使用的开源多核处理器模拟器,限制了在这样的结构上开展高质量的研究工作。我们开发了一个多核处理器体系结构模拟器OpenCMP,用于支持当前和未来对多线程多核处理器体系结构关键技术的研究。该模拟器适当地抽象了多核处理器结构,为主流的多核处理器结构研究提供一个可扩展、灵活的模拟工具框架,包括支持对乱序、顺序的处理器核和同时多线程处理器核的模拟,以便对更大的多核设计空间进行比较性研究。本文以支持事务存储模型的多核处理器结构模拟器为例,详细描述了如何通过抽象多核结构和事务存储模型的最基本特性和组成部分,扩展单核处理器模拟器SimpleScalar,设计与实现一个多核处理器模拟器。初步研究表明,与现有的多核处理器模拟器相比,该模拟器能够较好地支持对事务存储模型和基于事务存储模型的多核处理器体系结构的研究.  相似文献   

9.
This paper presents the results of an experiment to measure empirically the remaining opportunities for exploiting loop-level parallelism that are missed by the Stanford SUIF compiler, a state-of-the-art automatic parallelization system targeting shared-memory multiprocessor architectures. For the purposes of this experiment, we have developed a run-time parallelization test called the Extended Lazy Privatizing Doall (ELPD) test, which is able to simultaneously test multiple loops in a loop nest. The ELPD test identifies a specific type of parallelism where each iteration of the loop being tested accesses independent data, possibly by making some of the data private to each processor. For 29 programs in three benchmark suites, the ELPD test was executed at run time for each candidate loop left unparallelized by the SUIF compiler to identify which of these loops could safely execute in parallel for the given program input. The results of this experiment point to two main requirements for improving the effectiveness of parallelizing compiler technology: incorporating control flow tests into analysis and extracting low-cost run-time parallelization tests from analysis results  相似文献   

10.
Exploiting the parallelism available in loops   总被引:1,自引:0,他引:1  
Lilja  D.J. 《Computer》1994,27(2):13-26
Because a loop's body often executes many times, loops provide a rich opportunity for exploiting parallelism. This article explains scheduling techniques and compares results on different architectures. Since parallel architectures differ in synchronization overhead, instruction scheduling constraints, memory latencies, and implementation details, determining the best approach for exploiting parallelism can be difficult. To indicate their performance potential, this article surveys several architectures and compilation techniques using a common notation and consistent terminology. First we develop the critical dependence ratio to determine a loop's maximum possible parallelism, given infinite hardware. Then we look at specific architectures and techniques. Loops can provide a large portion of the parallelism available in an application program, since the iterations of a loop may be executed many times. To exploit this parallelism, however, one must look beyond a single basic block or a single iteration for independent operations. The choice of technique depends on the underlying architecture of the parallel machine and the characteristics of each individual loop  相似文献   

11.
1.引言目前分布式处理系统的并行编译技术还处于研究阶段,对于实际应用程序还难以获得高性能。在程序中充分发掘并行性和极小化通信的开销是需要协同考虑的两个方面。考虑全局数据分布和计算分割时,在应用程序中仍然可能存在这样一些情况:在程序中按全局优化数据分布策略,某数组采用一种数据分布方式会取得好的效果,但对某些循环而言,该数  相似文献   

12.
This paper presents a time stamp algorithm for runtime parallelization of general DOACROSS loops that have indirect access patterns. The algorithm follows the INSPECTOR/EXECUTOR scheme and exploits parallelism at a fine-grained memory reference level. It features a parallel inspector and improves upon previous algorithms of the same generality by exploiting parallelism among consecutive reads of the same memory element. Two variants of the algorithm are considered: One allows partially concurrent reads (PCR) and the other allows fully concurrent reads (FCR). Analyses of their time complexities derive a necessary condition with respect to the iteration workload for runtime parallelization. Experimental results for a Gaussian elimination loop, as well as an extensive set of synthetic loops on a 12-way SMP server, show that the time stamp algorithms outperform iteration-level parallelization techniques in most test cases and gain speedups over sequential execution for loops that have heavy iteration workloads. The PCR algorithm performs best because it makes a better trade-off between maximizing the parallelism and minimizing the analysis overhead. For loops with light or unknown iteration loads, an alternative speculative runtime parallelization technique is preferred  相似文献   

13.
SIMD扩展部件是集成到通用处理器中的加速部件,旨在发掘多媒体和科学计算等领域程序的数据级并行.当前两种基本的向量发掘方法分别是发掘迭代间并行的Loop-based方法和发掘迭代内并行的SLP方法.Loop-aware方法是对SLP方法的改进,其思想是首先通过循环展开将迭代间并行转换为迭代内并行,使循环体内的同构语句条数足够多,再利用SLP方法进行向量发掘.但当循环展开不合法或者并行度低于向量化因子时,Loop-aware方法无法实现程序向量并行性的发掘.因此提出了向量并行度指导的循环向量化方法,依据迭代间并行度、迭代内并行度和向量化因子,构建循环向量化方法选择方案,同时提出不充分向量化方法发掘并行度低于向量化因子的循环向量并行性,最后依据向量并行度对生成的向量循环进行展开.经过标准测试集测试,向量并行度指导的循环SIMD向量化方法比Loop-aware方法识别率提升107.5%,性能提升12.1%.  相似文献   

14.
Research in the area of parallel evaluation mechanisms for logic programs have led to the proposal of a number of schemes exploiting various forms of parallelism. Many of the early models have been based on the conventional approach of organising concurrent components of computations as communicating processes. More recently, however, models based on more novel computation organisations, in particular, data-driven organisations, have been proposed. This paper describes the development of one such model, its implementation and the design of a data-driven machine to support it. The model exploits a form of parallelism known as OR-parallelism and is particularly suited to database applications, although it would also support general applications. It is envisaged that the proposed machine may be refined into an efficient database engine, which can then be a component of a more powerful and integrated logic programming machine.  相似文献   

15.
Symmetric multiprocessor systems are increasingly common, not only as high-throughput servers, but as a vehicle for executing a single application in parallel in order to reduce its execution latency. This article presents Pedigree, a compilation tool that employs a new partitioning heuristic based on the program dependence graph (PDG). Pedigree creates overlapping, potentially interdependent threads, each executing on a subset of the SMP processors that matches the thread’s available parallelism. A unified framework is used to build threads from procedures, loop nests, loop iterations, and smaller constructs. Pedigree does not require any parallel language support; it is post-compilation tool that reads in object code. The SDIO Signal and Data Processing Benchmark Suite has been selected as an example of real-time, latency-sensitive code. Its coarse-grained data flow parallelism is naturally exploited by Pedigree to achieve speedups of 1.63×/2.13× (mean/max) and 1.71×/2.41× on two and four processors, respectively. There is roughly a 20% improvement over existing techniques that exploit only data parallelism. By exploiting the unidirectional flow of data for coarse-grained pipelining, the synchronization overhead is typically limited to less than 6% for synchronization latency of 100 cycles, and less than 2% for 10 cycles. This research was supported by ONR contract numbers N00014-91-J-1518 and N00014-96-1-0347. We would like to thank the Pittsburgh Supercomputing Center for use of their Alpha systems.  相似文献   

16.
Most scientific and digital signal processing (DSP) applications are recursive or iterative. The execution of these applications on a chip multiprocessor (CMP) encounters two challenges. First, as most of the digital signal processing applications are both computation intensive and data intensive, an inefficient scheduling scheme may generate huge amount of write operation, cost a lot of time, and consume significant amount of energy. Second, because CPU speed has been increased dramatically compared with memory speed, the slowness of memory hinders the overall system performance. In this paper, we develop a Two-Level Partition (TLP) algorithm that can minimize write operation while achieving full parallelism for multi-dimensional DSP applications running on CMPs which employ scratchpad memory (SPM) as on-chip memory (e.g., the IBM Cell processor). Experiments on DSP benchmarks demonstrate the effectiveness and efficiency of the TLP algorithm, namely, the TLP algorithm can completely hide memory latencies to achieve full parallelism and generate the least amount of write operation to main memory compared with previous approaches. Experimental results show that our proposed algorithm is superior to all known methods, including the list scheduling, rotation scheduling, Partition Scheduling with Prefetching (PSP), and Iterational Retiming with Partitioning (IRP) algorithms. Furthermore, the TLP scheduling algorithm can reduce write operation to main memory by 45.35% and reduce the schedule length by 23.7% on average compared with the IRP scheduling algorithm, the best known algorithm.  相似文献   

17.
Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation.  相似文献   

18.
面向SCMP的多线程前瞻控制分析与设计   总被引:1,自引:0,他引:1       下载免费PDF全文
单芯片多处理器一直是处理器微体系结构发展的一个热点。对于通用串行应用程序,高效的线程控制方法是实现线程级前瞻、挖据线程级并行性的一个重要组成部分。本文结合一个具体的SCMP模型即Griffon,提出并实现了一种简单、高效的分布式线程控制方法。该方法易于实现,可扩展性强。实验结果表明,线程的控制可以在数个周期内实现
,能够满足片内并行处理的要求  相似文献   

19.
为挖掘可重构处理器的内在并行性,需要编译器通过分析程序的并行性来决定可重构处理器硬件最好的执行模式。为此,提出一种基于可重构处理器的并行优化算法。将有向无环图的并行计算部分映射到可重构处理器上,对任务实现3个不同层次的并行性(指令级并行、循环级并行、线程级并行)。测试结果表明,该算法使得可重构处理器在处理任务时比未用并行优化算法的性能提升1.2倍左右。  相似文献   

20.
循环分块是一种广泛用于改善数据局部性和开发并行性的程序变换优化技术.主要分为2类:固定分块技术和参数化分块技术,系统地总结了这2类技术,并分析了其优缺点.由于分块大小的选择会严重影响分块代码的性能,因此介绍分析了选择最优分块大小的各种方法.此外,总结了循环分块在多级分块、并行性开发和不完美嵌套循环等方面应用的各项技术.通过对循环分块技术当前研究现状的分析,得出如下结论:1)循环分块技术中的计算复杂度和生成代码效率问题还未得到完全解决,如何利用循环边界有效地约束迭代空间并提高数据局部性还需要更深入的研究;2)最优分块大小的选择依然是一个开放式难题,研究清楚分级存储架构中每级分块对性能的影响具有重要的意义;3)从循环分块的应用角度,如何有效地构建面向任意嵌套循环集的自动分块代码生成系统,同时充分利用深度共享存储资源和多核架构实现分块代码的高并行度,也是一个需要深入研究的问题.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号