期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王攀峰杜云飞富弘毅杨学军周海芳《计算机科学》2009,36(3):21-25

Checkpointing是高性能计算领域最常用的容错技术.但是,当处理器数目变大时,这种技术的性能迅速恶化.提出一种在并行计算中容忍单进程故障的新方法:并行复算.这种方法的主要特征是利用冗余处理器的计算能力而不是冗余磁盘的存储能力实现低开销的容错.还提出这种方法的一个优化方法,将并行复算与checkpoint技术相结合,以进一步减小容错开销,并通过举例说明如何开发一个基于并行复算以及其优化方法的并行程序.最后通过实验对该方法进行评估.结果显示,当处理器数目变大时,并行复算的开销低于checkpointing,其优化方法能提供优于并行复算的性能. 相似文献

2.

一种利用并行复算实现的OpenMP 容错机制 总被引：1，自引：0，他引：1

富弘毅丁滟宋伟杨学军《软件学报》2012,23(2):411-427

基于并行复算的故障恢复技术,将故障恢复的计算任务分配至未发生故障的结点上并行执行,从而显著缩短复算时间,有效降低故障恢复开销,提高并行程序容错性能.基于该故障恢复技术,提出了一种针对OpenMP并行程序的容错机制PR-OMP,有效解决了分段复算、复算负载重分布等问题;此外,还扩展了传统编译数据流分析技术,提出了针对OpenMP并行程序的数据流分析技术,并基于该技术计算状态保存开销进行优化.设计实现了用于支持PR-OMP的编译工具GiFT-OMP,并通过实验证明了PR-OMP机制及其支持工具的有效性,评估并分析了其性能和可扩展性. 相似文献

3.

全球(z)双三次数值模式并行算法设计与实现

赵军吴建平宋君强张磊《计算机应用研究》2013,30(5):1337-1339

针对双三次数值天气预报模式进行了并行算法研究。采用一维区域分解算法,借鉴块棋盘划分矩阵转置算法,设计和实现了数据转置通信算法,并采取计算与通信重叠技术减小通信时间对并行效率的影响,最终实现了双三次数值天气预报模式的并行算法,并在机群系统上进行了并行性能测试评估。结果表明,实现的双三次数值预报模式并行算法的并行效率较高,设计实现的数据转置通信算法、计算与通信重叠技术取得了较好的效果。相似文献

4.

一种基于并行遗传算法的网格资源分配方法

李慧贤程春田《计算机工程》2006,32(5):175-177,180

提出了基于并行遗传算法的网格资源分配方法，并采用粗粒度模型设计了该并行算法。为了评估该并行算法的性能，在PC集群上实现了该并行算法和一个串行遗传算法。通过比较两个算法的执行时间和解的质量，说明了并行算法极大地提高了求解的速度和质量，是一种高效的资源分配方法。相似文献

5.

大规模并行计算机系统硬件故障容错技术综述

下载免费PDF全文

富弘毅杨学军《计算机工程与科学》2010,32(10):38-43

计算机系统的容错是一个不容忽视的问题。近年来,随着系统结构的复杂性增加,半导体制造工艺的发展,线宽的降低以及集成度的提高,从用户桌面系统到分布式计算环境,乃至大规模并行计算机系统,功耗和可靠性问题都很突出。本文首先介绍了计算机系统可靠性和容错技术的基本概念、基本方法和基本思想,然后回顾了近些年来一些具有代表性的硬件故障检测技术和硬件故障恢复技术,其中重点介绍了针对大规模并行计算机系统提出的容错方法。本文还介绍了我们在先前的研究工作中提出的一种优化的故障恢复技术,称为容错并行算法。最后,总结了一些可能的研究方向。相似文献

6.

含时滞记忆的非线性时滞系统满意容错控制

薄翠梅冯康康张湜《控制工程》2012,19(1):36-40

针对一类基于T-S模糊模型描述的非线性时滞系统,研究在一般执行器故障模式下的含时滞记忆的鲁棒H∞容错控制器设计问题.针对任意连续型执行器故障模式,采用并行分布式补偿原理设计含记忆型状态反馈控制器,给出非线性时滞系统在执行器发生故障情况下的鲁棒镇定准则.然后给出H∞性能指标约束下的满意容错控制器的设计方法和设计步骤.提出的含时滞记忆的状态反馈控制方法可以确保当执行器发生故障时,闭环系统不仅具有渐近稳定性,而且有一定的抗扰动性能,状态反馈控制器设计的保守性较不含时滞记忆控制器设计方法大大降低.仿真实例验证了鲁棒容错控制策略的有效性. 相似文献

7.

并行数据库上的并行CMD_Join算法

李建中都薇《软件学报》1998,(4)

并行数据库在多处理机之间的分布方法(简称数据分布方法)对并行数据操作算法的性能影响很大.如果在设计并行数据操作算法时充分利用数据分布方法的特点,可以得到十分有效的并行算法.本文研究如何充分利用数据分布方法的特点,设计并行数据操作算法的问题,提出了基于CMD多维数据分布方法的并行CMD_Join算法.理论分析和实验结果表明,并行CMD_Join算法的效率高于其它并行Join算法. 相似文献

8.

一种支持容错的任务并行程序设计模型

王一拙陈旭计卫星苏岩王小军石峰《软件学报》2016,27(7):1789-1804

任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献

9.

并行数据库上的并行CMD－Join算法 总被引：3，自引：1，他引：3

李建中都薇《软件学报》1998,9(4):256-262

并行数据库在多处理机之间的分布方法(简称数据分布方法)对并行数据操作算法的性能影响很大.如果在设计并行数据操作算法时充分利用数据分布方法的特点,可以得到十分有效的并行算法.本文研究如何充分利用数据分布方法的特点,设计并行数据操作算法的问题,提出了基于CMD多维数据分布方法的并行CMD－Join算法.理论分析和实验结果表明,并行CMD－Join算法的效率高于其它并行Join算法. 相似文献

10.

由对象本身出发的并行算法设计探讨

王华君《福建电脑》2006,(10):48-48,79

并行处理是并行计算机的关键技术，它包括并行结构、并行算法、并行操作系统、并行语言及其编译系统等，而并行算法设计是最基础最重要的内容，本文针对三种并行算法设计方法中”由对象本身出发的并行算法设计”进行了探讨。相似文献

11.

FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing

Yang Xuejun Du Yunfei Wang Panfeng Fu Hongyi Jia Jia 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(10):1471-1486

As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the Fault-Tolerant Parallel Algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed Get it Fault-Tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach. 相似文献

12.

Performance measurement,visualization and modeling of parallel and distributed programs using the AIMS toolkit

Jerry Yan Sekhar Sarukkai Pankaj Mehra 《Software》1995,25(4):429-461

Writing large-scale parallel and distributed scientific applications that make optimum use of the multiprocessor is a challenging problem. Typically, computational resources are underused due to performance failures in the application being executed. Performance-tuning tools are essential for exposing these performance failures and for suggesting ways to improve program performance. In this paper, we first address fundamental issues in building useful performance-tuning tools and then describe our experience with the AIMS toolkit for tuning parallel and distributed programs on a variety of platforms. AIMS supports source-code instrumentation, run-time monitoring, graphical execution profiles, performance indices and automated modeling techniques as ways to expose performance problems of programs. Using several examples representing a broad range of scientific applications, we illustrate AIMS' effectiveness in exposing performance problems in parallel and distributed programs. 相似文献

13.

Optimal design of wireless sensor network for ionising radiation detection using Neyman–Pearson criteria

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(1):75-94

This paper deals with the optimal design of wireless sensor network (WSN) with parallel configuration using Neyman–Pearson methodology for monitoring and detecting the possible presence of ionising radiations in the vicinity of a nuclear power plant. We derive the design equations for the WSN with parallel configuration, focusing only on the signal processing task under certain assumptions. We present the detection performance of individual node and network of sensor nodes under two different operating options for different network parameters and sensor characteristics to understand the design trade-offs between sensor network parameters and performance measures. We also assess the robustness of the network designed against node failures. 相似文献

14.

执行器故障下不确定非线性系统的鲁棒保成本控制

张刚王执铨韩祥兰《信息与控制》2006,35(4):474-479

对一类执行器故障的非线性不确定系统,研究了具有L2增益扰动衰减性能指标约束的鲁棒保成本控制器的设计问题.提出了更实际、更一般的执行器故障模型；给出了具有L2增益扰动衰减性能的鲁棒保成本控制系统的概念和属性.采用HJI不等式方法,导出了故障闭环系统渐近稳定、具有给定的抗干扰能力和成本函数有上界的充分条件.仿真实例验证了结论的有效性. 相似文献

15.

多计算机的自动插桩与监测系统

苏铭宋宗宇王华《计算机工程与应用》2002,38(4):79-82

在设计大规模的并行应用程序时,如何使多处理器的利用率达到最优,这对程序设计人员来讲是一个很大的挑战。一般说来,由于应用程序在运行时性能上的缺陷,计算资源得不到充分利用。因此,迫切需要对应用程序进行“性能调试”,即在正确性的基础上,通过揭示这些缺陷,对程序进行细调而提高程序性能。在这篇文章里,介绍了一个软件工具包—自动插桩和监测系统(theAutomatedInstrumentationandMonitoringSystem),它集程序插桩、运行监测和性能分析为一体,支持在多处理器上对并行应用程序进行性能评估。文章首先论述了一些建立性能调试工具的基本问题;然后,详细描述AIMS系统的体系结构以及在利用AIMS工具包进行性能调试工具的开发中的经验;最后,使用两个例子详细地描述使用AIMS系统进行性能调试的过程。相似文献

16.

高可用性工业以太网技术的研究与实现

姜立群徐皑冬宋岩王静《计算机工程》2009,35(11):260-262

针对现场总线控制系统对高可靠性和稳定性的要求会限制工业以太网的应用问题,在原有网络的基础上添加一个并行的冗余链路,以增强通信的可用性,减小通信链路故障对系统运行的影响。结合EPA通信协议栈,设计并行网络冗余解决方案,包括协议栈结构、帧结构和通信过程。给出该方案的性能分析。相似文献

17.

k元n方体的条件强匹配排除

冯凯《计算机应用》2017,37(9):2454-2456

为了度量发生故障时k元n方体对其可匹配性的保持能力,通过剖析条件故障下使得k元n方体中不存在完美匹配或几乎完美匹配所需故障集的构造,研究了条件故障下使得k元n方体不可匹配所需的最小故障数。当k ≥ 4为偶数且n ≥ 2时,得出了k元n方体这一容错性参数的精确值并对其所有相应的最小故障集进行了刻画;当k ≥ 3为奇数且n ≥ 2时,给出了该k元n方体容错性参数的一个可达下界和一个可达上界。结果表明,选取k为奇数的k元n方体作为底层互连网络拓扑设计的并行计算机系统在条件故障下对其可匹配性有良好的保持能力;进一步地,该系统在故障数不超过2n时仍是可匹配的,要使该系统不可匹配至多需要4n-3个故障元。相似文献

18.

An adaptive control scheme for systems with unknown actuator failures 总被引：1，自引：0，他引：1

Gang TaoAuthor Vitae Shuhao Chen^{Author Vitae} 《Automatica》2002,38(6):1027-1034

A state feedback output tracking adaptive control scheme is developed for plants with actuator failures characterized by the failure pattern that some inputs are stuck at some unknown fixed values at unknown time instants. New controller parametrization and adaptive law are developed under some relaxed system conditions. All closed-loop signals are bounded and the plant output tracks a given reference output asymptotically, despite the uncertainties in actuator failures and plant parameters. Simulation results verify the desired adaptive control system performance in the presence of actuator failures. 相似文献

19.

Organization of Parallel Query Processing in Multiprocessor Database Machines with Hierarchical Architecture

Sokolinsky L. B. 《Programming and Computer Software》2001,27(6):297-308

The development of database systems with hierarchical hardware architecture is currently a perspective trend in the field of parallel database machines. Hierarchical architectures have been suggested with the aim to combine advantages of shared-nothing architectures and architectures with shared memory and disks. A commonly accepted way of construction of hierarchical systems is to combine shared-memory (shared-everything) clusters in a unique system without shared resources. However, such architectures cannot ensure data accessibility under hardware failures on the processor cluster level, which limits their use in systems with high fault-tolerance requirements. In this paper, an alternative approach to construction of hierarchical systems is suggested. In accordance with this approach, the systems is constructed as an assembly of processor clusters with shared disks, with each cluster being a two-level multiprocessor structure with a standard strongly connected topology of interprocessor connections. A stream model for organization of parallel query processing in systems with the hierarchical architecture suggested is described. This model has been implemented in a prototype parallel database management system Omega designed for Russian multiprocessor computational systems MBC-100/1000. Our experiments show that the total performance of the processor clusters in the Omega system is comparable with that of the processor clusters with shared resources even in the case of great data skew. At the same time, the clusters of the Omega system are capable of ensuring a higher degree of data availability compared to the clusters with shared-memory architectures. 相似文献

20.

大规模MPI 并行计算的可扩展三模冗余容错机制

王之元杨学军周云《软件学报》2012,23(4):1022-1035

随着系统规模的扩大,并行计算的性能不断提高,但可靠性却也在不断下降,因此需要采用某种容错机制来容忍或恢复硬件故障和数据错误.目前常用的容错机制Checkpoint/Restart和多模冗余均引入了额外的开销,这些开销均在某种程度上制约了并行计算的可扩展性.因此,在高性能计算需求不断增长的今天,可扩展容错机制的设计显得尤为迫切和重要.以三模冗余(triple modular redundancy,简称TMR)为典型案例,描述了传统TMR在大规模MPI 并行计算上的实现方法,分析了该机制所面临的实际问题,进而指出传统TMR制约了并行计算的扩展.根据该技术所面临的问题,设计了可扩展三模冗余(scalable triple modular redundancy,简称STMR),并进一步验证了其有效性和可扩展性.该机制不仅能够处理Checkpoint/Restart针对的fail-stop故障,还能够解决绝大部分硬件不能直接感知的数据错误.最后,借用BlueGene/L的系统参数进行模拟,预测当系统规模增大时,在分别采用TMR和STMR的情况下并行计算可扩展性的变化,结果进一步验证了STMR是可扩展的容错机制. 相似文献