共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
Traditionally, loop nests are fused only when the data dependences in the loop nests are not violated. This paper presents
a new loop fusion algorithm that is capable of fusing loop nests in the presence of fusion-preventing anti-dependences. All
the violated anti-dependences are removed by automatic array copying. As a case study, this aggressive loop fusion strategy
is applied to a Jacobi solver. The performance of iterative methods is typically limited by the speed of the memory system.
Fusing the two loop nests in the Jacobi solver into one reduces data cache misses, and consequently, improves the performance
results of both sequential and parallel versions of the Jacobi program, as validated by our experimental results on an HP
AlphaServer SC45 supercomputer. 相似文献
4.
The I test is an efficient and precise data dependence method to ascertain whether integer solutions exist for one-dimensional arrays with constant bounds. For one-dimensional arrays with variable limits, the I test assumes that there may exist integer solutions. In this paper, we extend the I test. The extended I test can be applied towards determining whether integer solutions exist for one-dimensional arrays with variable limits, improving the applicable range of the I test. Experiments with benchmark cited from EISPACK, LINPACK, Parallel loops, etc. showed that among 1189 pairs of one-dimensional arrays tested, 183 had their data dependence analysis amended by the extended I test. That is, the extended I test increases the success rate of the I test by approximately 15.4%. Comparing with the Power test and the Omega test, the extended I test has higher accuracy than the Power test and shares the same accuracy with the Omega test for these 1189 pairs of arrays, but has much better efficiency over these two well-known tests. 相似文献
5.
6.
7.
Many scientific applications are comprised of irregular reductions on large data sets. In shared-memory parallel programs, these irregular reductions are typically computed in parallel using replicated buffers, then combined using synchronization. We develop L
W
, a new technique which partitions irregular reductions so that each processor computes values only for locally assigned data, eliminating the need for buffers or synchronized writes. Computation is replicated if its results are needed on multiple processors. We experimentally evaluate its performance for three irregular codes on a software DSM running on a distributed-memory multiprocessor and two shared-memory multiprocessors while varying connectivity, locality, and adaptivity. Results show L
W
improves performance significantly compared to using replicated buffers, and can match or exceed explicit message-passing gather/scatter for applications with low locality or high adaptivity. 相似文献
8.
Most scientific applications rely on parallel multiprocessor computing to enhance performance. However, the irregular loops within these applications obstruct the parallelism analysis at compile-time. Rauchwerger et al. presented a run-time method to extract the hidden parallelism in a program using dependence chains. The relative overhead degrades this approach’s performance due to the mass storage requirement and huge array reference processing. In this study, a new predecessor/successor approach is developed in which high-level predecessor/successor information is recorded and processed efficiently. A predecessor/successor table is constructed first in the inspector phase so that only the successor iterations in the current wavefront need to be examined, instead of the entire loop iterations during wavefront scheduling. Usually, the performance of dependence chain approach degrades dramatically for a hot-spot access pattern, but our scheme works very efficiently in this case. The experimental results using synthetic code and real programs are presented to prove the superiority of the proposed approach. 相似文献
9.
SIMD扩展部件是近年来集成到通用处理器中的加速部件,旨在发掘多媒体和科学计算等程序的数据级并行.控制依赖给发掘程序中的数据级并行带来了阻碍,当前不论基于loop-based还是SLP的控制流向量化方法都需要if转换,而没有考虑循环内蕴含的向量并行度,导致生成的向量代码效率较低.此外不精确的代价模型指导控制流向量化,同样导致生成的向量代码效率较低.为此提出了改进的控制流SIMD向量化方法,首先提出了含有控制依赖的循环分布算法,分离循环的可向量化部分和不可向量化部分,同时考虑分布时数据的局部性;其次提出了一种直接向量化控制流的方法,该方法考虑了基本块间的向量重用;最后利用精确的代价模型指导超字选择指令和超字条件分支指令的生成.实验结果表明,与现有的控制流向量化方法相比,本文提出的改进方法生成的向量代码性能提高24%. 相似文献
10.
11.
12.
循环分布是开发向量化程序的一个有效的方法。但是由于程序中的数据相关性,当前的自动向量化编译器实现完全的循环分布非常困难。因此,当前的自动向量化编译器一般采用简单的循环分布方法。以数据依赖关系分析为基础,从有无依赖环的角度分析了程序中语句的向量化能力,提出了基于语句向量化识别的循环分布算法,并在自动向量化中加以实现。通过此方法,可以充分地分析语句或依赖环的向量化能力,最终采用循环分布,将可向量化的语句与不可向量化的语句分布在不同的循环中。该方法可以处理当前的自动向量化编译器无法向量化的循环,对一些语句间有依赖关系的循环可达到较好的效果。 相似文献
13.
14.
15.
Ozcan Ozturk Author Vitae 《Journal of Parallel and Distributed Computing》2011,71(2):280-287
Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In the context of data-intensive embedded applications, there have been two complementary approaches to enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism. Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This can be achieved by considering multiple loop nests simultaneously. Although compilers address these two problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously. Therefore, an integrated approach that combines these two can generate much better results than each individual approach. Based on these observations, this paper proposes a constraint network (CN)-based formulation for data locality optimization and code parallelization. The paper also presents experimental evidence, demonstrating the success of the proposed approach, and compares our results with those obtained through previously proposed approaches. The experiments from our implementation indicate that the proposed approach is very effective in enhancing data locality and parallelization. 相似文献
16.
程序并行化工具由它能有效地解决了多种并行机结构间的代码可移植性和大大地减轻用户使用并行机的困难,已成为当今并行处理领域的一个热门研究课题。相信随着对并行机系统越来越广泛的使用。它还将会得到不断的发展和完善。本文着重介绍了并行化关键技术和工具系统的研究历史与现状,并就这一研究课题今后的发展趋势提出一些看法。 相似文献
17.
一种基于非正规域的区域依赖关系分析法 总被引:1,自引:0,他引:1
在自动并行编译中,并行性的识别主要集中在循环及语句级,而许多程序实际上可通过挖掘子程序级这种“任务“并行性来提高性能。本文提出了基于非正规域的区域依赖分析方法,旨在发掘这类并行性,它能精确地刻划程序中的数据访问区域。克服了现有区域分析技术中趋于保守的弱点,从而提出了并行度,依赖关系的测试算法简单而有效。 相似文献
18.
19.
针对现有通信优化算法无法使MPI自动并行化编译器生成加速比理想的消息传递程序问题,提出了一种基于重排序变换和循环分布的通信优化算法。该算法根据给出的过程间副作用集合和基于mpi_wait/mpi_irecv移动的重排序变换规则,有序地采用重排序变换和循环分布,尽可能安全地扩大点到点非阻塞通信中通信与计算的重叠窗口,使MPI自动并行化编译器生成具有更多计算重叠通信的消息传递代码。实验结果表明,该算法能够隐藏更多的点到点非阻塞通信开销,并且明显提升消息传递程序的加速比。 相似文献