首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 171 毫秒
1.
多核处理器并行编程模型的研究与设计   总被引:2,自引:0,他引:2  
为了在多核处理器上充分利用多核资源以提升程序性能,研究了多核处理器的体系结构和多核环境下可能影响并行程序性能的因素,实现了基于任务的并行编程模型.该模型提供了单任务数据并行和多任务并行两种并行处理方式,其中单任务数据并行使用cache块技术划分数据集,多任务并行使用任务密取的任务调度策略.用该模型实现了计算斐波那契数列的递归算法,实验结果表明,使用该模型编写多核并行程序可以达到较高的相对于串行计算的加速比.  相似文献   

2.
迭代空间交错条块并行Gauss-Seidel算法   总被引:1,自引:0,他引:1  
胡长军  张纪林  王珏  李建江 《软件学报》2008,19(6):1274-1282
针对并行GS(Gauss-Seidel)迭代算法中数据局部性差、同步和通信开销大的问题,首先改进传统GS迭代,提出了多层对称GS迭代算法.然后给出了以迭代空间条块序作为执行序的串行执行模型.该模型通过对迭代空间进行"时滞"划分,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行执行模型.该模型改进了迭代空间网格划分,并通过网格条块重排序减少了cache缺失率、通信启动和同步次数.实验结果表明,迭代空间交错条块并行算法比传统的区域分解方法和红黑排序并行算法具有更好的并行效率和可扩展性.  相似文献   

3.
为了充分利用多核处理器的硬件资源和计算能力,提出了多核并行编程技术在中文分词程序中的优化方案.根据中文分词最大正向匹配算法的特点,由传统的串行程序,改为并行程序.利用多核并行编程模式的思想,设计了一个混合并行编程模式,通过Intel的性能分析工具,找出了该算法的热点和瓶颈,对其进行优化.实验结果表明,优化过后的执行时间较原来串行程序的执行时间缩短了50%~60%,同时提高了程序的加速性能,取得了良好的效果.  相似文献   

4.
岳峰  庞建民  张一弛  余勇 《计算机工程》2012,38(24):279-282
相对于传统的串行程序移植,并行系统间的代码移植因体系结构间的巨大差异而变得极为复杂。为此,针对统一计算设备架构(CUDA)程序向其他异构多核平台的移植,提出CUDA架构到Cell的映射方案。通过模型映射、并行粒度提升、共享变量清除和运行时优化,使CUDA程序的大规模并行线程可以在Cell平台上正确执行。实验结果证明,翻译后的程序在Cell的执行效率可达到Cell平台上手动编写程序的72%。  相似文献   

5.
多处理机系统循环间数据重用的cache优化*   总被引:2,自引:0,他引:2  
cache的使用缓解了CPU和主存储器之间速度差距太大的矛盾,同时,也使cache的命中率成为影响多处理机系统性能发挥的重要因素.人们对如何加强数据的局部性,提高cache命中率,使多处理机系统的性能得到更好的发挥进行了积极的探索.但过去的工作主要集中于如何加强并行循环内的数据局部性,减少甚至消除并行循环内真假共享cache行所引起的cache抖动,对多处理机系统中循环间数据重用的开发和利用却少有论述.该文对如何开发和利用这些循环间数据重用进行了分析和讨论,并提出了一些切实可行、易于实现的方法.这些方法的  相似文献   

6.
流应用的特点以及传统处理器在处理流应用上的不足,使得支持数据并行的流处理器的设计成为当前体系结构研究领域的一个热点.文中针对Imagine流处理器体系结构的特点,提出了流分割和流压缩两种流的优化组织方法.模拟结果表明,流分割和流压缩使得流应用程序能充分利用Imagine的并行结构、流水结构和多级带宽存储结构,从而减少流程序的执行时间.  相似文献   

7.
并行编程一般分为数据并行和消息传递两种模式。比较而言,消息传递的应用更为广泛。面向消息传递FORTRAN(MPF)的自动并行工具能很大程度上缓减用户编程的压力,并具有很好的实用价值。迭代划分和局部性分析是自动并行中的重要部分。本文介绍从串行FORTRAN程序自动转换成MPF的自动并行工具FAX中的迭代划分、数组访问局部性分析及通信优化分析。  相似文献   

8.
基于Imagine体系结构,提出了一种科学程序局部性优化方法,旨在提高流程序的带宽利用率并保证Imagine强大的计算能力.关键技术在于通过对循环的计算变换和数据变换来开发体系结构的优势.对4个典型科学程序的实验表明,该优化能够有效地提高程序计算密集性且减少索引流,从而增强程序的局部性.  相似文献   

9.
刘颖  黄磊  吕方  崔慧敏  王蕾  冯晓兵 《软件学报》2016,27(8):2168-2184
异构架构迅速发展,依靠编译器来挖掘应用程序的数据局部性、充分发挥加速设备片上cache的硬件优势,是十分重要的.然而,传统的重用距离在异构背景下面临平台差异性挑战,缺乏统一的计算框架.为了更好地刻画和优化异构程序的局部性,建立了一个多平台统一的重用距离计算机制和数据布局优化框架.该框架根据应用在异构架构下的并行执行方式,从统计平均的角度提出了放松重用距离,并以OpenCL程序为例给出了它的计算方法,为多平台数据布局优化决策提供统一的依据.为了验证该方法的有效性,在Intel Xeon Phi,AMD Opteron CPU,Tilera TileGX-36这3个平台上进行了实验,结果表明,该方法在多平台上可获得至少平均1.14x的加速比.  相似文献   

10.
陈俊朴 《计算机工程》2009,35(10):33-36
网络处理器具有并行体系结构,而其高级语言往往具有串行语义。对串行程序进行并行化编译要求引入同步,而同步的优劣又影响生成代码的执行效率。针对网络处理器上的程序,提出一个对同步进行优化的程序划分算法以增加程序的并行性。实验数据表明,在一些有代表性的网络应用上,该算法可提高程序的并行性,并提升性能。  相似文献   

11.
高效的并行有限差分Stencil 算法对于求解大型线性方程组是十分重要的.针对并行有限差分Stencil 算法中数据局部性差、同步和通信开销大的问题.首先改进传统有限差分Stencil 算法,提出了多层对称遍历有限差分Stencil 算法.然后给出了以迭代空间条块序作为执行序的串行算法,通过沿时间轴对迭代空间进行时滞划分,在不改变迭代算法性质的同时,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行算法,该算法利用改进的多面体模型对迭代空间网格划分,并通过网格条块重排序减少了Cache 缺失率、通信启动和同步次数.理论分析和实验结果表明,该并行模型比传统的区域分解方法和红黑排序并行算法具有更好的数据局部性,并行效率和可扩展性.  相似文献   

12.
This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.  相似文献   

13.
Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory-access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-presented way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimCS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations  相似文献   

14.
Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown.  相似文献   

15.
Because multicore CPUs have become the standard with all major hardware manufacturers, it becomes increasingly important for programming languages to provide programming abstractions that can be mapped effectively onto parallel architectures. Stream processing is a programming paradigm where computations are expressed as independent actors that communicate via FIFO data-channels. The coarse-grained parallelism exposed in stream programs facilitates such an efficient mapping of actors onto the underlying multicore hardware. We propose a stream-parallel programming abstraction that extends object-oriented languages with stream-programming facilities. StreamPI consists of a class hierarchy for actor-specification together with a language-independent runtime system that supports the execution of stream programs on multicore architectures. We show that the language-specific part of StreamPI, i.e., the class hierarchy, can be implemented as a library-level programming language extension. A library-level extension has the advantage that an existing programming language implementation need not be touched. Legacy-code can be mixed with a stream-parallel application, and the use of sequential legacy code with actors is supported. Unlike previous approaches, StreamPI allows dynamic creation and subsequent execution of stream programs. StreamPI actors are typed. Type-safety is achieved through type-checks at stream graph creation time. We have implemented StreamPI??s language-independent runtime system and language interfaces for Ada?2005 and C++ for Intel multicore architectures. We have evaluated StreamPI for up to 16 cores on a two?CPU 8-core Intel Xeon X7560 server, and we provide a performance comparison with StreamIt?(Gordon et al. in International Conference on Architectural Support for Programming Languages and Operating Systems, 2006), which is the de facto standard for stream-parallel programming. Although our approach provides greater programming flexibility than StreamIt, the performance of StreamPI compares favorably to the static compilation model of StreamIt.  相似文献   

16.
Tiled multi-core architectures have become an important kind of multi-core design for its good scalability and low power consumption. Stream programming has been productively applied to a number of important application domains. It provides an attractive way to exploit the parallelism. However, the architecture characteristics of large amounts of cores, memory hierarchy and exposed communication between tiles have presented a performance challenge for stream programs running on tiled multi-cores. In this paper, we present StreamTMC, an efficient stream compilation framework that optimizes the execution of stream applications for the tiled multi-core. This framework is composed of three optimization phases. First, a software pipelining schedule is constructed to exploit the parallelism. Second, an efficient hybrid of SPM and cache buffer allocation algorithm and data copy elimination mechanism is proposed to improve the efficiency of the data access. Last, a communication aware mapping is proposed to reduce the network communication and synchronization overhead. We implement the StreamTMC compiler on Godson-T, a 64-core tiled architecture and conduct an experimental study to verify the effectiveness. The experimental results indicate that StreamTMC can achieve an average of 58% improvement over the performance before optimization.  相似文献   

17.
Suggestions for locality optimizations (SLO), a cache profiling tool, analyzes runtime reuse paths to find the root causes of poor data locality, and suggests the most promising code optimizations. Refactoring using the hints of the SLO analyzer doubles the average execution speed of several SPEC2000 benchmark programs.  相似文献   

18.
与 exascale 来超级计算的时代,电源效率成为了最重要的障碍造一个 exascale 系统。Dataflow 建筑学在为科学应用完成高电源效率有本国的优点。然而,最先进的 dataflow 体系结构没能为循环处理利用高并行。处理这个问题,我们建议一个 pipelining 环优化方法(PLO ) ,它在处理元素(PE ) 在环流动做重复 dataflow 的数组加速器。这个方法由二种技术,帮助建筑学的硬件重复和帮助说明的软件重复组成。在硬件重复执行模型,一个在薄片上循环控制器被设计产生循环索引,减少计算内核并且打为 pipelining 执行的一个好基础的复杂性。在软件重复实行模型,另外的环指令被论述解决重复相关性问题。经由这二种技术,准备好了每周期执行的指令的平均数字被增加使浮点联合起来忙。当这二种技术的硬件费用是可接受的时,模拟结果证明分别地,我们的建议方法平均由 2.45x 和 1.1x 在浮点效率超过静电干扰和动态循环执行模型。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号