首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
赵捷  赵荣彩  丁锐  黄品丰 《软件学报》2012,23(10):2695-2704
传统的分布存储并行编译系统大多是在共享存储并行编译系统的基础上开发的.共享存储并行编译系统的并行识别技术适合OpenMP代码生成,实现方式是将所有嵌套循环都按照相同的识别方法进行处理,用于分布存储并行编译系统必然会导致无法高效发掘程序的并行性.分布存储并行编译系统应根据嵌套循环结构的特点进行分类处理,提出适合MPI代码生成的并行识别技术.为解决上述问题,根据嵌套循环的结构和MPI并行程序的特点,提出了一种新的嵌套循环分类方法,并针对不同的嵌套循环分别提出了相应的并行识别技术.实验结果表明,与采用传统并行识别技术的分布存储并行编译系统相比,按照所提方法对嵌套循环进行分类,采用相应并行识别技术的编译系统能够更高效地识别基准程序中的并行循环,自动生成的MPI并行代码其性能加速比提高了20%以上.  相似文献   

2.
群体智能系统通过邻居个体的信息交互实现群体级别的应用任务,具有良好的鲁棒性和灵活性.与此同时,大多数开发人员难以对分布式、并行的个体交互机制进行描述.一些高级语言允许用户以串行思维方式、从系统全局角度来编程并行的群体智能计算任务,而无需考虑通信协议、数据分布等底层交互细节.但面向用户、全局声明式的群体智能系统应用程序与个体并行执行逻辑存在的巨大语义差距,使得编译过程复杂进而导致应用程序开发效率不高.本文提出了一个编译系统及其支撑工具,支持将高级的群体智能系统应用程序转换为安全、高效的分布式实现.该编译系统通过并行信息识别,计算划分,交互信息生成技术,将面向系统全局、串行编程的群体智能应用程序编译为面向个体独立执行的并行目标代码,从而使用户不必了解个体间的复杂交互机制.设计了一种标准化中间表示,将复杂群体智能计算任务转换为群体智能算子和输入输出变量组合而成的标准化语义模块序列,其以独立于平台的形式表示源程序信息,屏蔽了目标硬件平台的异构性.在一个群体智能系统案例平台中部署和测试了该编译系统,结果表明该系统能够有效将群体智能应用程序编译为平台可执行的目标代码并提升应用程序开发效率,其生成的代码在一系列基准测试中具有比现有编译器更好的性能.  相似文献   

3.
软件DSM(distributed shared memory)系统在机群上构造了共享存储编程环境,结合了共享存储的易编程性和机群的可扩展性,引起了广泛的研究.由于软件DSM系统是一个分布式系统,系统失败风险大,需要实现容错技术以促进其实用化.利用用户级检查点技术,在支持域存储一致模型的软件DSM系统JIAJIA的基础上,设计并实现了一个可恢复的高可移植的软件DSM系统JIACKPT(JIAjia with ChecKPoinTing).由于采用适合软件DSM系统的强全局一致状态以及多种优化措施,JIACKPT易于实现且获得很好的性能.在一个8节点的PC机群上的应用测试表明,即使每分钟做一次检查点,大部分应用的检查点开销也小于10%.此外,JIACKPT还具有高可移植性.这些都表明JIACKPT已经成为一个比较实用的系统.  相似文献   

4.
一个有效的并行分析算法   总被引:3,自引:0,他引:3  
并行分析在并行编译系统中有着很重要的作用,它的优劣直接影响到编译系统的成败,随着机群系统及其并行开发环境的发展,多数的并行系统可支持多重并行循环的运行。而对只支持一重并行循环的编程系统,选择并行运行效率最高的循环,也是很重要的。为此,本文提出了一个有效的循环并行分析方案,它不但能给出多层循环的并行性,而且能够处理绝大部分实际应用中的并行性问题,本文对传统的并行分析算法进行修改,并给出了一个有效的并  相似文献   

5.
机群OpenMP系统的设计与实现   总被引:5,自引:0,他引:5  
OpenMP以其易用性和支持增量并行的特点成为共享存储体系结构的编程标准.目前机群系统已成为高性能计算的主流平台,研究机群OpenMP系统对推进并行应用的开发和普及非常有意义.该文作者以软件DSM系统JIAJIA作为OpenMP的运行时系统,结合一个前端编译器OMP2JIA,在机群系统上实现了OpenMP/JIAJIA计算环境,同时在提高性能方面根据机群系统特点扩展了OpenMP制导,优化了后端运行时库。通过11个OpenMP应用,作者比较了该计算环境和一个支持OpenMP的硬件cc-NUMA系统(SGI 2100)的性能.结果表明,作者的机群OpenMP系统的7机平均加速比为4.62;SGI 2100系统为4.55,二者性能相当.  相似文献   

6.
异构众核架构具有超高的性能功耗比,已成为超级计算机体系结构的重要发展方向.但众核系统更为复杂的并行层次和存储层次,给编程和优化带来了极大的挑战,因此研究面向众核系统的并行编程技术,对于降低国产众核系统并行应用的编程难度、提升并行程序的性能都具有重要的意义.提出统一架构的多模式并行编程模型,包括异构融合的加速运算模型和按同构方式编程的自主运算模型,根据编程模型设计了Parallel C语言,能有效描述国产众核系统的异构并行性,与其它众核系统上MPI+X的使用模式相比,编程和系统优化都具有全局视角,在多级局部性描述、单边消息、兼容已有多核应用等方面具有特色;基于Open64构建了Parallel C编译系统,全面支持加速运算模型和自主运算模型,提出并实现了数据布局与自动DMA、编译指导的线程代理和拓扑位置感知的集合通信等优化.Micro Benchmark和实际应用在神威太湖之光计算机系统上的测试数据表明,Parallel C语言和编译系统具有良好的性能和可扩展性,能够有效支撑大型应用.  相似文献   

7.
随着多核处理器的发展,硬件平台已经提供了充裕的并行能力,这对软件并行编程提出了更高的要求.传统的基于锁机制的并行编程模型存在着诸多难题.借鉴数据库中事务的思想,人们提出事务存储,旨在提供一种可编程性良好的同步手段.硬件事务存储快速有效的优势使之成为研究的热点.阐述了事务存储的基本概念、执行模型和编程接口.介绍了硬件事务存储系统的三大核心内容,对比了两种典型的硬件事务存储系统.分析讨论了目前硬件事务存储系统研究的热点和难点问题.最后介绍了硬件事务存储研究的平台和测试程序.  相似文献   

8.
随着多核芯片的广泛应用,开发线程级并行变得至关重要。事务可以使编程者通过非常简单的多线程编程模型来实现并行,事务存储(TM)可以简单地实现事务执行的原子性和独立性。本文介绍了目前的主流事务存储系统TCC、LogTM、PTM,分析了各自的系统结构和相应的操作系统支持,并在此基础之上揭示了事务存储系统的硬件设计和操作系统支持之间的关系,最终总结得到了TM发展的一些基本规律和特点。  相似文献   

9.
在两类并行计算机系统中,分布式存贮器(DM)型比共享存贮器(SM)型性能优越,但用户开发软件难度特别大。本文详细地介绍了用软件的方法将一个物理上的DM系统构成一个逻辑上的SM系统,即分布共享存贮器(DSM)系统的原理和实现方法。  相似文献   

10.
曙光1000A上消息传递与共享存储的比较   总被引:12,自引:2,他引:12  
分布式共享存储虽然有易于编程的优点,但往往被认为效率不高、完全由软件实现的分布式共享存储系统(又称为虚拟共享存储系统)更是如此,文中以典型的消息传递系统PVM与分布式共享存储系统JIAJIA粉列,报这两种并行程序设计环境的特点,并用7个应用程序在曙光1000A上分别比较了这两个系统的性能,实验3结果表明,JIAJIA的与PV玎当,但基于JIAJIA的并行程序设计却比PVN简单得多。  相似文献   

11.
This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address spaces for an implementation of MapReduce, a programming model for large-scale data processing, and the implementation of a remote memory access (RMA) model. Both implementations are available on the Intel SCC and are portable to similar architectures. We present the design and implementation of HyMR, a MapReduce runtime system whereby different stages and the synchronization operations between them alternate between a distributed memory address space and a shared memory address space, to improve performance and scalability. We compare HyMR to a reference implementation and we find that HyMR improves performance by a factor of 1.71× over a set of representative MapReduce benchmarks. We also compare HyMR with Phoenix++, a state-of-art implementation for systems with hardware-managed cache coherence in terms of scalability and sustained to peak data processing bandwidth, where HyMR demonstrates improvements of a factor of 3.1× and 3.2× respectively. We further evaluate our hybrid remote memory access (HyRMA) programming model and assess its performance to be superior of that of message passing.  相似文献   

12.
This paper presents a unified evaluation of the I/O behavior of a commercial clustered DSM machine, the HP Exemplar. Our study has the following objectives: 1) To evaluate the impact of different interacting system components, namely, architecture, operating system, and programming model, on the overall I/O behavior and identify possible performance bottlenecks, and 2) To provide hints to the users for achieving high out-of-box I/O throughput. We find that for the DSM machines that are built as a cluster of SMP nodes, integrated clustering of computing and I/O resources, both hardware and software, is not advantageous for two reasons. First, within an SMP node, the I/O bandwidth is often restricted by the performance of the peripheral components and cannot match the memory bandwidth. Second, since the I/O resources are shared as a global resource, the file-access costs become nonuniform and the I/O behavior of the entire system, in terms of both scalability and balance, degrades. We observe that the buffered I/O performance is determined not only by the I/O subsystem, but also by the programming model, global-shared memory subsystem, and data-communication mechanism. Moreover, programming-model support can be used effectively to overcome the performance constraints created by the architecture and operating system. For example, on the HP Exemplar, users can achieve high I/O throughput by using features of the programming model that balance the sharing and locality of the user buffers and file systems. Finally, we believe that at present, the I/O subsystems are being designed in isolation, and there is a need for mending the traditional memory-oriented design approach to address this problem  相似文献   

13.
Parallel logic programming (PLP) systems are sophisticated examples of symbolic computing systems. PLP systems address problems such as allocating dynamic memory, scheduling irregular computations, and managing different types of implicit parallelism. Most PLP systems have been developed for bus-based architectures. However, the complexity of PLP systems and the large amount of data they process raise the question of whether logic programming systems can still achieve good performance on modern scalable architectures, such as distributed shared-memory (DSM) systems. In this work we use execution-driven simulation of a cache-coherent DSM architecture to investigate the performance of Andorra-I, a state-of-the-art PLP system, on a modern multiprocessor. The results of this simulation show that Andorra-I exhibits reasonable running time performance, but it does not scale well. Our detailed analysis of cache misses and their sources expose several opportunities for improvements in Andorra-I. Based on this analysis, we modify Andorra-I using a set of simple techniques that led to significantly better running time and scalability. These results suggest that Andorra-I can and should perform well on modern multiprocessors. Furthermore, as Andorra-I shares its main data structures with several PLP systems, we conclude that the methodology and techniques used in our work can greatly benefit these other PLP systems.  相似文献   

14.
开发分布共享存储系统的目的是为了在分布式存储器的基础上构造逻辑上的共享存储器模型,对于如何在共享存储器模型的基础上为用户进程构造虚拟空间,传统的分布共享系统并未给予足够的重视。只有在操作系统中把分布共享存储技术、存储器管理和文件系统结合起来,才能充分发挥分布共享存储技术具有的能力。基于以上思想,在文中提出了一个实现了分布共享存储的操作系统模型,并分析了该模型一个实现原型,讨论该原型具有的优缺点。通过在操作系统中取消进程的逻辑空间,使进程直接在文件上运行,该模型不仅能够实现分布共享存储,而且和许多传统操作系统以及传统分布共享存储系统相比,具有许多优点。该操作系统实现了分布共享存储技术和操作系统中的存储管理以及文件系统的完美结合。  相似文献   

15.
Distributed shared Memory(DSM) systems have gained popular acceptance by combining the scalability and low cost of distributed system with the ease of use of single address space.Many new hardware DSM and software DSM systems have been proposed in recent years.In general,benchmarking is widely used to demonstrate the performance advantages of new systems.Howerver,the common method used to summarize the measured results is the arithmetic mean of ratios,which is incorrect in some cases.Furthermore,many published papers list a lot of data only,and do not summarize them effectively,which confuse users greatly.In fact,many users want to get a single number as conclusion,which is not provided in old summarizing techniques.Therefore,a new data-summarizing technique based on confidence interval is proposed in this paper.The new technique includes two data-summarizing methods:(1) paried confidence interval method;(2) unpaired confidence interval method.With this new technique,it is concluded that at some confidence one system is better than others.Four examples are shown to demonstrate the advantages of this new technique.Furthermore,with the help of confidence level,it is proposed to standardize the benchmaks used for evaluating DSM systems so that a convincing result can be got,In addition,the new summarizing technique fits not only for evaluating DSM systems,but also for evaluating other systems,such as memory system and communication systems.  相似文献   

16.
分布式共享内存的技术和实现   总被引:3,自引:0,他引:3  
分布式共享内存结合了布式内存结构与共享存储结构的优点,具有可扩充性、通过性性、方便性,本文论述了在实现DSM系统中存在的问题,并讨论了DSM系统在软件硬件方面所做的工作和采取的措施。  相似文献   

17.
在多核处理器芯片中,分布式共享存储DSM虽然提供了统一的全局寻址的存储空间,但却引入了虚地址向实地址转换的开销,这对性能产生了负面的影响。我们注意到,在并行程序的执行过程中,被处理的数据属性(私有或共享)并不是一成不变的。并行程序中不同的数据具有不同的属性,即使同一数据在程序的不同执行阶段也可能具有不同的属性。本文首先详细地阐述了一种混合式的分布式共享存储空间,支持对共享数据采用全局寻址的虚地址访问而对私有数据采用快速寻址的实地址访问;进而提出了一种针对混合式的分布式共享存储空间的实时划分技术。该技术根据并行程序中数据的属性,在程序运行时,实时地调整和划分分布式共享存储空间。当数据为私有时,通过实地址访问加快数据的访问速度,当数据为共享时则维持虚地址访问,从而减少整个并行程序运行过程中的地址转换开销,提高系统的性能。实际应用程序的实验结果表明,与传统的分布式共享存储空间相比,实时划分的混合式的分布式共享存储空间具有性能优势,性能的提升比例与具体的网络规模、计算规模、并行程序映射方式等有关。在我们的实验中,性能的提升比例最高为13.14%,最低为6.98%。  相似文献   

18.
Complex coupled multiphysics simulations are playing increasingly important roles in scientific and engineering applications such as fusion, combustion, and climate modeling. At the same time, extreme scales, increased levels of concurrency, and the advent of multicores are making programming of high‐end parallel computing systems on which these simulations run challenging. Although partitioned global address space (PGAS) languages attempt to address the problem by providing a shared memory abstraction for parallel processes within a single program, the PGAS model does not easily support data coupling across multiple heterogeneous programs, which is necessary for coupled multiphysics simulations. This paper explores how multiphysics‐coupled simulations can be supported by the PGAS programming model. Specifically, in this paper, we present the design and implementation of the XpressSpace programming system, which extends existing PGAS data sharing and data access models with a semantically specialized shared data space abstraction to enable data coupling across multiple independent PGAS executables. XpressSpace supports a global‐view style programming interface that is consistent with the PGAS memory model, and provides an efficient runtime system that can dynamically capture the data decomposition of global‐view data‐structures such as arrays, and enable fast exchange of these distributed data‐structures between coupled applications. In this paper, we also evaluate the performance and scalability of a prototype implementation of XpressSpace by using different coupling patterns extracted from real world multiphysics simulation scenarios, on the Jaguar Cray XT5 system at Oak Ridge National Laboratory. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

19.
In this paper, we describe Teapot, a domain-specific language for writing cache coherence protocols. Cache coherence is of concern when parallel and distributed systems make local replicas of shared data to improve scalability and performance. In both distributed shared memory systems and distributed file systems, a coherence protocol maintains agreement among the replicated copies as the underlying data are modified by programs running on the system. Cache coherence protocols are notoriously difficult to implement, debug, and maintain. Moreover, protocols are not off-the-shelf, reusable components, because their details depend on the requirements of the system under consideration. The complexity of engineering coherence protocols can discourage users from experimenting with new, potentially more efficient protocols. We have designed and implemented Teapot, a domain-specific language that attempts to address this complexity. Teapot's language constructs, such as a state-centric control structure and continuations, are better suited to expressing protocol code than those of a typical systems programming language. Teapot also facilitates automatic verification of protocols, so hard to find protocol bugs, such as deadlocks, can be detected and fixed before encountering them on an actual execution. We describe the design rationale of Teapot, present an empirical evaluation of the language using two case studies, and relate the lessons that we learned in building a domain-specific language for systems programming  相似文献   

20.
Large-scale distributed shared-memory multiprocessors (DSMs) provide a shared address space by physically distributing the memory among different processors. A fundamental DSM communication problem that significantly affects scalability is an increase in remote memory latency as the number of system nodes increases. Remote memory latency, caused by accessing a memory location in a processor other than the one originating the request, includes both communication latency and remote memory access latency over I/O and memory buses. The proposed architecture reduces remote memory access latency by increasing connectivity and maximizing channel availability for remote communication. It also provides efficient and fast unicast, multicast, and broadcast capabilities, using a combination of aggressively designed multiplexing techniques. Simulations show that this architecture provides excellent interconnect support for a highly scalable, high-bandwidth, low-latency network.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号