期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

多处理机与非多处理机的选择

张正兴《抗恶劣环境计算机》1995,9(3):10-17,45

相似文献

2.

多处理机

严宝康《计算机技术》1992,(3):58-62

相似文献

3.

分支嵌套循环的自动并行化研究

丁丽丽李雁冰张素平王鹏翔张庆花《计算机科学》2017,44(5):14-19, 52

GCC编译器是一种受广大研究者青睐的开源优化编译器,但它仅仅能够对完美嵌套循环进行依赖分析。为了更好地挖掘嵌套循环粗粒度的并行,深入研究了GCC5.1数据依赖分析过程,提出了一种能够处理分支嵌套循环的依赖测试方法。首先识别出分支嵌套循环,然后分析数组下标与分支嵌套循环外层索引变量的关系,最后计算出外层循环索引变量的距离向量,并通过检测距离向量判断循环是否存在依赖。实验结果表明,该方法能够正确、有效地分析出分支嵌套循环的依赖关系。相似文献

4.

多处理机通信设计

周仲玉《计算机杂志》1990,18(5):6-10

相似文献

5.

产生式系统映射到多处理机

M.F.M.Tenono D.I.Moldovan 陈晓桦《计算机工程与科学》1988,(2)

本文提出一种把产生式系统(PS)映射到多处理机计算机结构的方法。首先给出了PS 的一个图语法模型,利用该模型构造了一个算法来分析规则间的相关性。在PS 中固有依赖性的基础上描述了把规则分配到处理机的问题。在有充足的处理单元可用和相互通讯网络给定的情况下,我们研究了分配问题。然后在规则的数目大于处理单元的数目,并且必须获得一个平衡加载的情况下,我们研究了划分问题。对分配和划分问题提出了一个算法,并给出一个例子,把具有32条规则的一个PS 映射到具有8个处理单元和不同互连网络的多处理机系统。相似文献

6.

多处理机的操作系统

章中云樊建平《中国计算机用户》1991,(10):10-13

相似文献

7.

多处理机操作系统的资源管理

张宇《计算机工程》1989,(5):35-39,44

相似文献

8.

非一致线性递推计算到多处理机上的分解 总被引：1，自引：0，他引：1

张德富陈崚《计算机学报》1998,21(Z1):46-51

对于一致线性递推计算(ULR)进行独立分解、使它分解为若干可独立计算的独立集,从而分配到具有分布式存储器的多处理机上去并行执行,已有许多很好的方法.本文则提出了几种对非一致线性递推计算(NLR)独立分解的方法,可将NLR分解成由一个或多个子平面、或由网状格点构成的独立集.本文还给出应用各种方法的充要条件. 相似文献

9.

多处理机操作系统的分析

陈左宁《电子计算机》2000,(3):2-9

本文通过对多处理机操作系统发展历史的回顾,以及对系统结构经过程的研究,分析了操作系统结构对高性能计算机系统的可扩展性、高效性、易用性以及适应性的影响,结合几个典型的可扩展并行机操作系统的分析,总结出不同结构操作系统的优缺点。相似文献

10.

多处理机UNIX的实现技术

金国华曹琳《计算机研究与发展》1993,30(2):15-20,14

相似文献

11.

Generation of Efficient Nested Loops from Polyhedra 总被引：1，自引：0，他引：1

Fabien Quilleré Sanjay Rajopadhye Doran Wilde 《International journal of parallel programming》2000,28(5):469-498

Automatic parallelization in the polyhedral model is based on affine transformations from an original computation domain (iteration space) to a target space-time domain, often with a different transformation for each variable. Code generation is an often ignored step in this process that has a significant impact on the quality of the final code. It involves making a trade-off between code size and control code simplification/optimization. Previous methods of doing code generation are based on loop splitting, however they have nonoptimal behavior when working on parameterized programs. We present a general parameterized method for code generation based on dual representation of polyhedra. Our algorithm uses a simple recursion on the dimensions of the domains, and enables fine control over the tradeoff between code size and control overhead. 相似文献

12.

基于嵌套循环分类的并行识别技术

赵捷赵荣彩丁锐黄品丰《软件学报》2012,23(10):2695-2704

传统的分布存储并行编译系统大多是在共享存储并行编译系统的基础上开发的.共享存储并行编译系统的并行识别技术适合OpenMP代码生成,实现方式是将所有嵌套循环都按照相同的识别方法进行处理,用于分布存储并行编译系统必然会导致无法高效发掘程序的并行性.分布存储并行编译系统应根据嵌套循环结构的特点进行分类处理,提出适合MPI代码生成的并行识别技术.为解决上述问题,根据嵌套循环的结构和MPI并行程序的特点,提出了一种新的嵌套循环分类方法,并针对不同的嵌套循环分别提出了相应的并行识别技术.实验结果表明,与采用传统并行识别技术的分布存储并行编译系统相比,按照所提方法对嵌套循环进行分类,采用相应并行识别技术的编译系统能够更高效地识别基准程序中的并行循环,自动生成的MPI并行代码其性能加速比提高了20%以上. 相似文献

13.

基于LLVM Pass的复杂嵌套循环自动并行化框架

马春燕吕炳旭叶许姣张雨《软件学报》2023,34(7):3022-3042

随着多核处理器的普及应用,针对嵌入式遗留系统中串行代码的自动并行化方法是研究热点.其中,针对具有非完美嵌套结构、非仿射依赖关系特征的复杂嵌套循环的自动并行化方法存在技术挑战.提出了一种基于LLVMPass的复杂嵌套循环的自动并行化框架(CNLPF).首先,提出了一种复杂嵌套循环的表示模型,即循环结构树,并将嵌套循环的正则区域自动转换为循环结构树表示;然后,对循环结构树进行数据依赖分析,构建循环内和循环间的依赖关系;最后,基于OpenMP共享内存的编程模型生成并行的循环程序.针对SPEC2006数据集中包含近500个复杂嵌套循环的6个程序案例,分别对其进行复杂嵌套循环占比统计和并行性能加速测试.结果表明,提出的自动并行化框架可以处理LLVMPolly无法优化的复杂嵌套循环,增强了LLVM的并行编译优化能力,且该方法结合Polly的组合优化,比单独采用Polly优化的加速效果提升了9%-43%. 相似文献

14.

Mapping Parallel Application Communication Topology to Rhombic Overlapping-Cluster Multiprocessors

Hoganson Kenneth 《The Journal of supercomputing》2000,17(1):67-90

This paper extends research into rhombic overlapping-connectivity interconnection networks into the area of parallel applications. As a foundation for a shared-memory non-uniform access bus-based multiprocessor, these interconnection networks create overlapping groups of processors, buses, and memories, forming a clustered computer architecture where the clusters overlap. This overlapping-membership characteristic is shown to be useful for matching parallel application communication topology to the architecture's bandwidth characteristics. Many parallel applications can be mapped to the architecture topology so that most or all communication is localized within an overlapping cluster, at the low latency of processor direct to cache (or memory) over a bus. The latency of communication between parallel threads does not degrade parallel performance or limit the graininess of applications. Parallel applications can execute with good speedup and scaling on a proposed architecture which is designed to obtain maximum advantage from the overlapping-cluster characteristic, and also allows dynamic workload migration without moving the instructions or data. Scalability limitations of bus-based shared-memory multiprocessors are overcome by judicious workload allocation schemes, that take advantage of the overlapping-cluster memberships. Bus-based rhombic shared-memory multiprocessors are examined in terms of parallel speedup models to explain their advantages and justify their use as a foundation for the proposed computer architecture. Interconnection bandwidth is maximized with bi-directional circular and segmented overlapping buses. Strategies for mapping parallel application communication topologies to rhombic architectures are developed. Analytical models of enhanced rhombic multiprocessor performance are developed with a unique bandwidth modeling technique, and are compared with the results of simulation. 相似文献

15.

A New Approach to Parallelization of Serial Nested Loops Using Genetic Algorithms

Saeed Parsa Shahriar Lotfi 《The Journal of supercomputing》2006,36(1):83-94

Loop parallelization is an important issue in the acceleration of the execution of scientific programs. To exploit parallelism in loops a system of equations representing the dependencies between the loop iterations and a system of non-equations indicating the loop boundary conditions has to be solved. This is a NP-Complete problem. Our major contribution in this paper has been to apply genetic algorithm to solve system of equation and non-equation resulted from loop dependency analysis techniques to find two dependent loop iterations. We use distance vector to find the rest of dependencies. 相似文献

16.

Processor Array Synthesis from Shift-Variant Deep Nested Do Loops

Kittitornkun Surin Hu Yu Hen 《The Journal of supercomputing》2003,24(3):229-249

The consolidation of Internet devices into a universal/portable device will soon be accomplishable through the incorporation of reconfigurable computing in system-on-a-chip (SOC). At any particular moment, it could be a video/audio mobile phone, an MP3 song player, and other devices. The basic construct of these multimedia processing algorithms can be described as deep nested Do loop algorithms. They are considered the most demanding data-intensive algorithms and hence ideal candidates for an array of reconfigurable nanoprocessors. Therefore, algorithm to hardware synthesis methodology is important for an efficient exploitation of both spatial parallelism and temporal pipelining. In this paper, we propose a processor array synthesis methodology. It can map an n-level nested Do loop represented by a nonuniform or shift-variant data dependence graph to a near-optimal of one-or two-dimensional processor array under the available resource constraints to satisfy high-throughput computation demands. 相似文献

17.

XML文件映射为嵌套表格的方法研究与应用

朱静孙忠林魏永山《数字社区&智能家居》2010,(1)

XML文件的高可扩展性和通用性目前被广泛用作中间数据模式,而嵌套表格在描述具有层次结构的复杂对象方面具有简单、直观的特点,在最终编程领域适合作为复杂数据结构的呈现方式被使用。论文提出一种将复杂层次结构的XML文件映射为嵌套表格的方法,利用中间数据结构,给出了复杂层次结构的XML文件映射为中间数据结构的算法。实验表明方法有效地降低了XML文件到嵌套表格映射的复杂性。相似文献

18.

单芯片多处理器的性能优势 总被引：6，自引：0，他引：6

下载免费PDF全文

黄光奇周兴铭《计算机工程与科学》2001,23(1):35-38

本文以一个面积为300mm^2左右的芯片设计为目标,描述了三种不同的芯片结构：一种超标量结构,两种单芯片多处理器结构。模拟结果表明,由于超标量技术本身的局限性,单芯片多处理器结构相对于超标量结构具有明显的性能优势,对并行性的开发更加有效。相似文献

19.

Optimized Unrolling of Nested Loops

Vivek Sarkar 《International journal of parallel programming》2001,29(5):545-581

Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604). 相似文献