首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 140 毫秒
1.
线程级推测(Thread-Level Speculation, TLS)是多核上一种加速串行程序的线程级自动并行化技术。循环具有规则的结构并在运行时占有大量的执行时间,因此循环是挖掘并行性的理想对象。然而,选择哪些循环并行才能提高程序的加速比是一个很难决定的问题。为了解决该问题,该文提出一种基于性能预测的循环选择方法。基于输入训练集获取程序预执行的剖析信息,同时结合各种推测因素,构建了循环结构的性能预测模型。预测结果定量评估了循环推测并行的加速比并决定该循环在运行时是否适合并行。实验结果表明,该文提出的方法能有效地预测循环并行时所蕴含的并行性,并依据预测结果准确地选择具有并行收益的循环推测并行,最终Olden基准测试集加速比性能平均提升了12.34%。  相似文献   

2.
针对CFD程序中常见的自相关循环结构,文章分析了波前并行技术不能对其进行并行化的原因,针对其相关实质,提出了自相关循环的镜像分解技术,通过消除跨迭代的反相关,实现自相关循环结构的波前并行,完成自相关循环的并行化。  相似文献   

3.
针对HEVC帧内预测角度模式算法的特点,提出实现角度预测模式的并行化方法.该方法基于BWDSP1041仿真平台通过分析角度模式算法的可并行性,提出了适合多乘法器并行计算的数据分配方式,结合处理器所搭载的硬件资源,设计了多运算部件并行工作的算法程序.实验结果表明角度预测模式20和垂直模式26在BWDSP1041上利用硬件资源的并行化实现,并行加速比分别达到161.68和344.65.该并行化算法减少了视频编码的时间,其数据分配方案对于帧内预测算法在多核和多运算部件结构上的并行化研究也具有一定的参考价值.  相似文献   

4.
并行数据库系统(Parallel Database System)是新一代高性能的数据库系统,是在大规模并行处理机(Massive Parallel Processor)和集群并行计算环境的基础上建立的数据库系统。该技术起源于20世纪70年代的数据库机(Database Machine)研究,研究的内容主要集中在关系代数操作的并行化和实现关系操作的专用硬件设计上,后该研究以失败而告终。从上世纪90年代至今,随着处理器、存储、网络等相关基础技术的发展,并行数据库技术的研究重点也转移到数据操作的时间并行性和空间并行性上。通过并行使用多个CPU和磁盘来将把诸如装载数据、建立索引、执行查询等操作并行化,以提升性能数据库系统。最关键的两个内容是并行和分布式。  相似文献   

5.
范植华  范路 《电子学报》1999,27(8):120-122
诸如PASCAL里的CASE,C里的SWITCH,FORTRAN里的计算GOTO等等语句所代表的多岔控制转移,是程序设计语言中最复杂的控制结构之一,其本身,或者与无条件GOTO的配合使用,迄今在国内外均被并行性识别排除在外,亦即无条件地保持串行,从而丧失硬件惊人的并行潜力,本文通过并行化重构,在等价地消除各种多植逻辑的基础上,进而实施对它们的并行性分析,把隐藏于其中的潜在并行性全部挖掘出来。  相似文献   

6.
亓静  刘萍 《现代电子技术》2009,32(14):10-13
基于FPGA并行性对车牌识别系统中重要组成部分--字符分割,提出一种适合硬件并行实现的结构,并在System Generator中完成了模型的建立以及优化.并行操作分为两个时间段:第一个时间段,通过循环迭代求出字符上下边界;第二个时间段,字符上下边界位置的使能控制与字符分割位置的控制信号并行作用于数据路径,产生有效像素.硬件仿真结果满足了时序要求,证实该结构的可行性.由于并行逻辑的建立,实现速度大大提高,体现出了FPGA的并行性在性能提高上的极大优势.  相似文献   

7.
范植华  范路 《电子学报》1999,27(8):120-122
诸如PASCAL里的CASE,C里的SWITCH,FORTRAN里的计算GOTO等等语句所代表的多岔控制转移,是程序设计语言中最复杂的控制结构之一.其本身,或者与无条件GOTO的配合使用,迄今在国内外均被并行性识别排除在外,亦即无条件地保持串行,从而丧失硬件惊人的并行潜力.本文通过并行化重构,在等价地消除各种多值逻辑的基础上,进而实施对它们的并行性分析,把隐藏于其中的潜在并行性全部挖掘出来.  相似文献   

8.
9905156相关距离在循环语句并行化重构中的应用[刊]/周鹏//计算机工程与应用.—1998,34(8),—54~56(C)L 为一个顺序执行的 DO 循环语句,其中包含赋值语句或 IF-THEN-ELSE 条件语句。通过数据相关性分析,计算相关距离,可以析取 L 中内在的并行性,实现 L 向 DOALL 循环的完全变换或部分变换。  相似文献   

9.
针对HEVC帧内预测Planar和DC模式算法的特点,提出实现这两种模式的并行化方法.该方法是通过分析推导Planar和DC模式算法之间的可并行性,以西安邮电大学自主设计的一款面向图形、图像应用的阵列处理器PAAG(Polymorphic Array Architecture for Graphics and Image Processing)平台为基础,采用最优的数据分配方式,合理地设计了多处理单元并行工作的算法程序.实验结果表明Planar预测模式和DC预测模式在多处理单元上的并行实现,相比于单核的串行运算速度分别提高了84%和81%,串/并行加速比分别达到6.34和5.44.该并行化算法减少了视频的编解码时间,其数据分配方案对于帧内预测算法在多核结构上的并行化研究也有一定的参考价值.  相似文献   

10.
rP31 01061771软件过程中的荆子性挖掘/李彤,(2〕王黎霞(云南大学)寿汁算机应用与软件一2001,18(5)一27一31挖掘软件过程中的并行性,使其中的活动尽量并行进行,是提高软件生产率的重要手段.文中提出了一种通过活动间相关性分析,寻找软件过程中可并行化的因素,挖拢!出可并行进行的活动,进而构造出Petri网表示的并行化的软件过程摸型的技术,获得了比较理想的并行性挖掘效果.图5参11(午)行为三个方面刻画构件.采用JB/5 ADL可以方便地进行软件体系结构的构造、细化和验证,并具有决速生成原型的能力,还支持代码框架的自动生成和系统体系结构的…  相似文献   

11.
To speed up data‐intensive programs, two complementary techniques, namely nested loops parallelization and data locality optimization, should be considered. Effective parallelization techniques distribute the computation and necessary data across different processors, whereas data locality places data on the same processor. Therefore, locality and parallelization may demand different loop transformations. As such, an integrated approach that combines these two can generate much better results than each individual approach. This paper proposes a unified approach that integrates these two techniques to obtain an appropriate loop transformation. Applying this transformation results in coarse grain parallelism through exploiting the largest possible groups of outer permutable loops in addition to data locality through dependence satisfaction at inner loops. These groups can be further tiled to improve data locality through exploiting data reuse in multiple dimensions.  相似文献   

12.
In this paper, we consider programmable tightly-coupled processor arrays consisting of interconnected small light-weight VLIW cores, which can exploit both loop-level parallelism and instruction-level parallelism. These arrays are well suited for compute-intensive nested loop applications often providing a higher power and area efficiency compared with commercial off-the-shelf processors. They are ideal candidates for accelerating the computation of nested loop programs in future heterogeneous systems, where energy efficiency is one of the most important design goals for overall system-on-chip design. In this context, we present a novel design methodology for the mapping of nested loop programs onto such processor arrays. Key features of our approach are: (1) Design entry in form of a functional programming language and loop parallelization in the polyhedron model, (2) support of zero-overhead looping not only for innermost loops but also for arbitrarily nested loops. Processors of such arrays are often limited in instruction memory size to reduce the area and power consumption. Hence, (3) we present methods for code compaction and code generation, and integrated these methods into a design tool. Finally, (4) we evaluated selected benchmarks by comparing our code generator with the Trimaran and VEX compiler frameworks. As the results show, our approach can reduce the size of the generated processor codes up to 64 % (Trimaran) and 55 % (VEX) while at the same time achieving a significant higher throughput.  相似文献   

13.
Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core.  相似文献   

14.
In the area of automatic parallelization of programs, analyzing and transforming loop nests with parametric affine loop bounds requires fundamental mathematical results. The most common geometrical model of iteration spaces, called the polytope model, is based on mathematics dealing with convex and discrete geometry, linear programming, combinatorics and geometry of numbers.In this paper, we present automatic methods for computing the parametric vertices and the Ehrhart polynomial, i.e., a parametric expression of the number of integer points, of a polytope defined by a set of parametric linear constraints.These methods have many applications in analysis and transformations of nested loop programs. The paper is illustrated with exact symbolic array dataflow analysis, estimation of execution time, and with the computation of the maximum available parallelism of given loop nests.  相似文献   

15.
Most hardware compilers apply loop pipelining to increase the parallelism achieved, but pipelining is restricted to the only innermost level in a nested loop. In this work we extend and adapt an existing outer loop pipelining approach known as single dimension software pipelining to generate schedules for field-programmable gate-array (FPGA) hardware coprocessors. Each loop level in nine test loops is pipelined and the resulting schedules are implemented in VHDL and targeted to an Altera Stratix II FPGA. The results show that the fastest solution for all but one of the loops occurs when pipelining is applied one to three levels above the innermost loop. Across the nine test loops we achieve an acceleration over the innermost loop solution of up to seven times, with a mean speedup of 3.2 times. The results suggest that inclusion of outer loop pipelining in future hardware compilers may be worthwhile as it can allow significantly improved results to be achieved at the cost of a small increase in compile time.   相似文献   

16.
本文对Wolfe86年提出的循环扭曲转换技术进行了重新认识。通过引入相关距离矩阵和相关方向矩阵概念,给出了扭曲变换多重紧嵌套循环的一般化方法。然后分析了循环扭曲对并行性和数据局部性的影响,最后讨论了它和其它转换技术之间的相互关系。  相似文献   

17.
Majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested loops. Most of the existing loop transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated loop bounds and loop indexes calculations. This paper proposes a new technique, loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where all iterations in a stripe are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for loop striping transformations. The experimental results show that loop striping always achieves better iteration period than software pipelining and loop unfolding, improving average iteration period by 50 and 54% respectively.
Edwin H.-M. ShaEmail:
  相似文献   

18.
张焕明  叶梧  冯穗力 《电讯技术》2007,47(4):166-168
LDPC码译码采用的是BP算法,但由于回路的存在,使译码重复迭代,特别是短长度的回路使LDPC码的性能下降.为此,用树图法分析了LDPC码的回路及其特性,给出了求解回路长度和所经过节点的方法,非常适合于计算机进行求解.同时也用树图的方法来构造LDPC码,可以在树生成的过程中了解其中的回路数目及长度.  相似文献   

19.
低密度奇偶校验码(LDPC)的性能取决于多种因素,包括度分布对、码字的长度以及环的分布。环的存在会影响LDPC码的译码门限和误码平层,尤其是长度比较小的环对LDPC码的性能影响很大。因此,有必要在构造LDPC码时消去长度比较小的环。文中提供了一种有效的消环算法,降低了LDPC码的误码平层。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号