期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

高伟赵荣彩于海宁张庆花《计算机科学》2016,43(1):226-231, 245

循环展开是一项常用的循环优化技术。当前针对串行程序的循环展开技术已经比较成熟,但是在实际应用中没有针对向量程序进行有效的循环展开。为了解决这个问题,提出了一种面向向量程序的循环展开技术。首先,针对向量寄存器压力和代码膨胀等限制因素,提出了一种自动计算展开因子的CUFVL算法;其次,根据向量循环展开的特点,制定了完全展开策略;最后结合CUFVL算法和完全展开策略,设计了向量循环展开的总体流程。实验结果表明,该方案能够计算出合适的展开因子,进而对向量程序进行适当的循环展开或完全展开,从而有效提升应用程序的性能。相似文献

2.

Compiler assisted architectural exploration framework for coarse grained reconfigurable arrays

Grigorios Dimitroulakos Nikos Kostaras Michalis D. Galanis Costas E. Goutis 《The Journal of supercomputing》2009,48(2):115-151

相似文献

3.

符号执行中的循环依赖分析方法

下载免费PDF全文

刘杰曹琰魏强彭建山《计算机工程》2012,38(22):24-27

符号执行方法处理循环时存在路径爆炸的问题。为此,提出一种基于归纳变量的循环依赖分析方法。通过识别循环归纳变量及符号表达式,结合边界约束条件生成可达归纳变量分支的路径约束,并采用符号化映射方法分析嵌套循环归纳变量依赖问题,从而在不展开循环的情况下生成覆盖归纳变量分支的测试用例。对开源工具Libxml2进行实验,该方法能发现其中2个while循环所引发的数组访问越界错误。相似文献

4.

On effective execution of nonuniform DOACROSS loops

Ding-Kai Chen Pen-Chung Yew 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(5):463-476

It is extremely difficult to parallelize DOACROSS loops with nonuniform loop-carried dependences. In this paper, we present a static scheduling scheme with an accompanying synchronization strategy that can execute such DOACROSS loops effectively and efficiently. Our approach uses one of the parallelization techniques called Dependence Uniformization, which finds a small set of uniform dependence vectors to cover all possible nonuniform dependences in a DOACROSS loop. It differs from the previous schemes in that we demonstrate a better way to select the uniform dependence vectors. When used with the Static Strip Scheduling scheme, the proposed uniform dependence vector set allows us to enforce dependences with more locality, which reduces the requirement of explicit synchronization considerably while retaining most of the parallelism. This paper describes the uniform dependence vectors selection strategy and the static strip scheduling scheme. The performance analysis and examples are also presented 相似文献

5.

向量并行度指导的循环SIMD向量化方法

高伟韩林赵荣彩徐金龙陈超然《软件学报》2017,28(4):925-939

SIMD扩展部件是集成到通用处理器中的加速部件,旨在发掘多媒体和科学计算等领域程序的数据级并行.当前两种基本的向量发掘方法分别是发掘迭代间并行的Loop-based方法和发掘迭代内并行的SLP方法.Loop-aware方法是对SLP方法的改进,其思想是首先通过循环展开将迭代间并行转换为迭代内并行,使循环体内的同构语句条数足够多,再利用SLP方法进行向量发掘.但当循环展开不合法或者并行度低于向量化因子时,Loop-aware方法无法实现程序向量并行性的发掘.因此提出了向量并行度指导的循环向量化方法,依据迭代间并行度、迭代内并行度和向量化因子,构建循环向量化方法选择方案,同时提出不充分向量化方法发掘并行度低于向量化因子的循环向量并行性,最后依据向量并行度对生成的向量循环进行展开.经过标准测试集测试,向量并行度指导的循环SIMD向量化方法比Loop-aware方法识别率提升107.5%,性能提升12.1%. 相似文献

6.

Optimal code parallelization using unimodular transformations

Michael L Dowling 《Parallel Computing》1990,16(2-3):157-171

Lamport's parallelization algorithm (cf. [7]) is generalized to a broader class of loops, and the complexity of the transformation process has been estimated. It is shown that every loop can be parallelized using methods similar to those in [7]; moreover, they also have the property that all their inner loops are devoid of data dependencies, and so are fully parallelizable. Unfortunately, without restricting the nature of the loop to be parallelized, the negative solution to Hilbert's tenth problem (cf. [3]) can be applied to show that the parallelizing transformations are not computable. The class of affine loops was therefore introduced. This class is more general than that considered by Lamport, and it is shown that parallelizing transformations for affine loops are computable. In general, however, the complexity estimates for finding such loops suggest that the parallelization procedure will take longer than executing the original loop sequentially. It is further shown that, if the loop satisfies an additional, nondegeneracy condition, then the loop can be efficiently transformed.

Finally, although more generally applicable, these methods are best applied to vectorization problems. 相似文献

7.

一种面向众核处理器的嵌套循环多维并行识别方法*

李颖颖庞建民李雁冰翟胜伟《计算机应用研究》2018,35(11)

现有并行识别方法用于众核处理器时存在一定不足,当选择的循环并行维迭代数较少时可能导致严重地负载不均衡。针对这一问题,提出了一种面向众核处理器的多维并行识别方法,在现有并行识别方法无法做到较好的负载均衡时,选择嵌套循环的多个维进行并行,将多个并行维的迭代空间合并后再做任务划分,减少负载不均衡对程序并行效率的影响。此方法已在课题组开发的自动并行化系统中进行了实现,实际应用过程中能够提升一些应用程序在众核处理器上并行执行的效率。相似文献

8.

基于执行频率的循环展开优化方法

余小喜迟万庆《微计算机信息》2012,(3):49-51

本文提出一种面向嵌入式低功耗的基于执行频率的循环展开优化方法,根据循环的执行频率,积极展开一些频繁被执行的循环,不展开那些很少被执行的循环。所有这些都在GCC4.0.0上进行了实现,并在sim-panalyzer功耗模拟器上对12个Benchmarks进行了测试,结果表明,相对于传统的循环展开优化,新的优化方法可以有效的降低功耗,并且减少了代码量的增加。相似文献

9.

A thread partitioning approach for speculative multithreading

Bin Liu Yinliang Zhao Yuxiang Li Yanjun Sun Boqin Feng 《The Journal of supercomputing》2014,67(3):778-805

Speculative multithreading (SpMT) is a thread-level automatic parallelization technique, which partitions sequential programs into multithreads to be executed in parallel. This paper presents different thread partitioning strategies for nonloops and loops. For nonloops, we propose a cost estimation based on combined run-time effects of various speculation factors to predict the resulting performance of candidate threads to guide the thread partitioning. For loops, we parallelize all the profitable loops that can potentially offer additional performance benefits by multilevel spawning in loop bodies, loop iterations, and inner loops. Then we select a proper thread boundary located in the front of loop branch instruction to reduce invalid spawning threads that waste core resources. Experimental results show that the proposed approach can obtain a significant increase in speedup and Olden benchmarks reach a performance improvement of 6.62 % on average. 相似文献

10.

Using knowledge-based systems for research on parallelizing compilers

Chao-Tung Yang Shian-Shyong Tseng Yun-Woei Fann Ting-Ku Tsai Ming-Huei Hsieh Cheng-Tien Wu 《Concurrency and Computation》2001,13(3):181-208

相似文献

11.

基于LLVM Pass的复杂嵌套循环自动并行化框架

马春燕吕炳旭叶许姣张雨《软件学报》2023,34(7):3022-3042

随着多核处理器的普及应用,针对嵌入式遗留系统中串行代码的自动并行化方法是研究热点.其中,针对具有非完美嵌套结构、非仿射依赖关系特征的复杂嵌套循环的自动并行化方法存在技术挑战.提出了一种基于LLVMPass的复杂嵌套循环的自动并行化框架(CNLPF).首先,提出了一种复杂嵌套循环的表示模型,即循环结构树,并将嵌套循环的正则区域自动转换为循环结构树表示;然后,对循环结构树进行数据依赖分析,构建循环内和循环间的依赖关系;最后,基于OpenMP共享内存的编程模型生成并行的循环程序.针对SPEC2006数据集中包含近500个复杂嵌套循环的6个程序案例,分别对其进行复杂嵌套循环占比统计和并行性能加速测试.结果表明,提出的自动并行化框架可以处理LLVMPolly无法优化的复杂嵌套循环,增强了LLVM的并行编译优化能力,且该方法结合Polly的组合优化,比单独采用Polly优化的加速效果提升了9%-43%. 相似文献

12.

EFFECTIVE PARALLELIZATION TECHNIQUES FOR LOOP NESTS WITH NON-UNIFORM DEPENDENCES

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1):37-64

The parallelism of loop nests with non-uniform dependences is difficult to extract and ineffectively explored by the existing parallelization schemes. In this paper, we propose new efficient techniques in extracting parallelism of loop nests with non-uniform dependences using their irregularity. By this way, current highly parallel multiprocessor systems such as multithreaded and clustering multiprocessor systems can be fully utilized. These four mechanisms are (a) parallelization part splitting, (b) partial parallelization decomposition, (c) irregular loop interchange and (d) growing pattern detection. They explore parallelisms of special parallel patterns for nested loops with non-uniform dependences. The loop transformations used in uniform loops are also applied in non-uniform dependence loops after legality tests. We apply the results of classical convex theory and detect special parallel patterns of dependence vectors. We also proposed an algorithm that combines above mechanisms to enhance parallelism. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. Thus, we are encouraged by these apparent enhancements to pursue further development. 相似文献

13.

Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

Li Wang Jingling Xue Xuejun Yang 《The Journal of supercomputing》2012,59(3):1229-1251

Both reuse and concurrency are performance-critical for stream processors. When applying loop unrolling and software pipelining separately to stream-level loops, either reuse or concurrency or both may be inadequately exploited. In this paper, we optimize modulo scheduling to maximize stream reuse and improve concurrency for stream-level loops. The key insight is that an unrolled and software-pipelined stream-level loop could be described by a set of reuse equations. Guided by reuse equations, a reuse-aware modulo scheduling algorithm is developed to simultaneously optimize the two performance objectives, reuse, and concurrency, for a loop in a unified framework. Moreover, we describe a code generation algorithm to automatically produce the optimized loop from a given loop. The experimental results obtained on FT64 and by simulation demonstrate the effectiveness of the proposed approach. 相似文献

14.

Loop coalescing and scheduling for barrier MIMD architectures

O'Keefe M.T. Dietz H.G. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(9):1060-1064

Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem 相似文献

15.

基于指令Cache和寄存器压力的循环展开优化

王翠霞韩林刘浩浩《计算机工程与科学》2022,44(12):2111-2119

循环展开是一种常用的编译优化技术,能够有效减少循环开销,提升指令级并行程度和数据局部性,提升循环的执行效能。然而,过度的循环展开会造成指令Cache溢出,增大寄存器压力,循环展开次数太少又会浪费潜在的性能提升机会,因此寻找恰当的展开因子是研究循环展开问题的核心。基于GCC开源编译器,面向循环展开问题开展深入的分析与研究,针对指令Cache和寄存器资源对循环展开的影响,提出了一种基于指令Cache和寄存器压力的循环展开因子计算方法,并在GCC编译器中实现了该计算方法。申威和海光平台上的实验结果显示,相较于目前GCC中存在的其它展开因子计算方法,所提出的方法可以获得更为有效的循环展开因子,提升了程序性能。在SPEC CPU 2006测试集上的平均性能分别提升了2.7%和3.1%,在NPB-3.3.1测试集上的分别为5.4%和6.1%。相似文献

16.

面向OpenMP自动并行化的代价模型

李雁冰赵荣彩刘晓娴赵捷《软件学报》2014,25(S2):101-110

现有的OpenMP代价模型较为简单,既没有充分考虑OpenMP程序的执行细节,也无法适应不同的循环并行执行方式.针对上述问题,对最先进的产品级优化编译器Open64中已有的代价模型进行扩展,以单个并行候选循环为对象,建立一种用于OpenMP自动并行收益分析的代价模型.该模型在改进了Open64原有DOALL并行代价模型的基础上,又增加了DOACROSS流水并行代价模型和DSWP并行代价模型.实验结果表明,建立的代价模型能够较好地评估循环并行执行开销的趋势,为OpenMP自动并行化中的收益分析提供了有效的支持. 相似文献

17.

Compile-time scheduling of multithread data localities on multiple vector processors

Chih-Yung Chang Jang-Ping Sheu 《Concurrency and Computation》1995,7(5):349-369

A large class of loop programs applied in solving differential equations, Fourier transforms, image processing and neural processing can be translated or rewritten into a vector execution form with a π-block dependence graph. In the paper we propose a multithreading strategy to partition such vectorized loops into multithread execution form. Each partitioned thread consists of instances of statements with localities in vector registers. The multithreading scheme gives a novel combination of loop unrolling, statement instances reordering, index shifting, vector register reuse exploiting and multithreading. For some cases of loop program with π-block dependence graph, experimental results show that our scheme assists vector compilers of the Convex C38 series to reduce the number of memory accesses and synchronizations among CPUs. 相似文献

18.

A flexible algorithm for calculating pair interactions on SIMD architectures

Szilárd Páll Berk Hess 《Computer Physics Communications》2013

Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces. 相似文献

19.

选择性循环的并行方法

下载免费PDF全文

吴悦雷超付杨洪斌《计算机工程》2010,36(9):35-37,40

针对含有大量循环的串行程序存在的问题,提出一种基于线程级前瞻技术的循环选择方案。该方案对循环进行最优选择后建立一个可并行运行的循环集。对于该集合中的循环,选择并行效率高的代码段作并行处理,以加快串行程序运行速度。实验表明,相对于一般的简单内部循环或外部循环并行方法,该方案使9种基准代码的加速比平均上升23.8%,从而提高串行程序并行运行的效率。相似文献

20.

选择性循环的并行方法

下载免费PDF全文

吴悦雷超付杨洪斌《计算机工程》2010,36(9):35-37,4

针对含有大量循环的串行程序存在的问题,提出一种基于线程级前瞻技术的循环选择方案。该方案对循环进行最优选择后建立一个可并行运行的循环集。对于该集合中的循环,选择并行效率高的代码段作并行处理,以加快串行程序运行速度。实验表明,相对于一般的简单内部循环或外部循环并行方法,该方案使9种基准代码的加速比平均上升23.8%,从而提高串行程序并行运行的效率。相似文献