期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

丁凯高扬《计算机工程与设计》2004,25(10):1778-1780

传统的分布式文件系统无法为集群系统提供严格的单一映像功能(SSI)，使得集群系统的管理较为复杂。基于Linux的集群文件系统CSFS(Cluster Single File System)，提供单一映像功能，有效地解决了这一问题。通过对Linux虚拟文件系统(VFS)特点的分析，在VFS上添加文件系统CSFS，提高了集群系统的可用性和可管理性。相似文献

2.

集群流媒体文件系统MFS设计与实现

下载免费PDF全文

庞丽萍蒙廷友石柯程斌唐维《计算机工程与科学》2005,27(6):32-34

文章描述了WanLan集群视频服务器上的集群流媒体文件系统(MFS)的设计与实现。MFS是一种支持MPEG文件格式的分布式流媒体文件系统，它由MFS的客户端、管理节点、数据节点以及元数据服务节点组成。MFS流媒体集群文件系统实现了单一系统的逻辑映像、数据和元数据的高可用以及系统自动配置。相似文献

3.

NAS集群中单一系统映像的关键技术

鲁宏伟李悦《计算机应用研究》2003,20(7):108-109,112

NAS(Network Attached Storage)和SAN(Storage Area Network)是目前数据存储领域的主流技术，但都存在缺陷。介绍了构建NAS集群的设想和动向，并就其单一系统映像的关键技术进行了探讨。相似文献

4.

基于用户数据包协议(UDP)的实时集群节点控制与实现 总被引：4，自引：1，他引：4

向建军左继章白欣《计算机工程与应用》2002,38(19):48-50

集群计算机技术是当今高性能并行计算机系统中的一个研究热点。文章基于用户数据包传输协议(UDP),有效地对集群系统互为信任关系各个节点进行实时控制,实现了实时集群的单一系统映像,并采用通用商业化部件构建了实时集群计算机系统,拓展了集群计算机的实时应用领域。相似文献

5.

基于映像的集群部署系统设计与实现

董小社孙发龙李纪云胡雷均《计算机工程》2005,31(24):132-134,168

使用基于映像的安装技术和Intel的PXE协议,基于分布式网络模型设计并实现了一个Linux集群部署系统,能够快速部署大规模集群。系统具有良好的可扩展性、单一控制点。经过实践使用表明,该系统能够简化集群安装和维护工作,降低系统管理员工作量。相似文献

6.

基于SSI的远程集群管理系统

下载免费PDF全文

童端韩忠愿苏杭丽《计算机工程》2008,34(20):34-36

受集群系统结构的固有特性的影响,集群系统的管理问题日益突出。早期集群系统通过命令行方式进行管理,存在功能不完善、结构单一、可用性差、不支持远程管理等缺点。该文分析了集群管理软件的功能需求和相关技术,设计和实现了一套基于SSI的远程集群管理系统。该系统采用标准化模块设计方法,其功能可灵活组态,扩展性较好,并实现比较完整的单一系统映像,可提供简单、高效的管理功能。对系统进行了测试和评价,并提出该系统未来的研究方向。相似文献

7.

集群负载平衡的性能评测方法

刘楠翁楚良李明禄《计算机工程与设计》2011,32(10):3407-3409,3456

随着单一系统映像(SSI)集群的发展,其提供的易于使用的高性能和高可用的计算环境,对用户越来越具有吸引力。但是,对于单一系统映像集群,现有的基准测试程序在测试系统性能时将负载静态地平均到各个处理器,而没有考虑系统的动态负载平衡特性。基于此,提出一种针对负载平衡集群的性能评测方法,采用非平衡树搜索算法提供动态负载,通过负载单元的运行时间评测系统在真实的负载平衡环境下的性能。相似文献

8.

一种高性价比的网络容灾与高可用集群的设计 总被引：1，自引：0，他引：1

妙全兴武海鹰《微机发展》2003,13(9):40-42,45

讨论了网络容灾技术和高可用集群技术的概念、原理及研究现状，设计了一种网络容灾与高可用集群系统，并探讨了系统实现的若干关键技术。目的是探索一种利用集群技术和容灾技术，构建一个高性价比的、同时具有高可用性和灾难恢复能力的集群系统的方法。结果表明，系统结构简单、合理，运行稳定，具有较高的应用和推广价值。相似文献

9.

高可用集群技术的研究与应用

汪筱红《数字社区&智能家居》2011,(20)

集群技术是一种较新的技术,高可用要求当硬件或软件系统发生故障时,运行在该系统上的数据不会丢失,而且在尽可能短的时间内恢复应用系统的正常运行。文章基于一个高可用集群应用实例,阐述了高可用集群技术的实现方法及技术优势。采用集群技术后,大大提高了系統的可用性,取得了很好的效果。相似文献

10.

开放式计算管理系统研究

冯萍刘君瑞孙蓬《计算机科学》2004,31(Z1):184-186

对于大规模科学计算来说,高可用系统仍然是目前需要解决的关键技术.文章介绍了一个开放式计算管理系统(Open Calculating Management System,OCMS).OCMS体系结构是基于Web技术的、遵循开放式网格协议(Open Grid Services Architecture,OGSA)的计算服务的集合.在网格结点中的集群系统采用BrowserS/Server/ServerS体系结构实现,其服务支持多用户操作.基于Java技术的BrowserS/Server/ServerS体系结构具有单一映像功能.最后文章介绍了在OCMS系统运行中尺度大气数值模型计算的性能报告. 相似文献

11.

Hadoop平台的集群故障监控的研究与实现

朱娜娜《软件》2013,(12):73-77

使用Hadoop构建的云平台已经得到广泛使用,如Amazon、Yahoo、Facebook等。集群的稳定性和可靠性对于云平台的服务质量有着重要的影响,随着企业信息化在生产实时检测、海量存储和科学分析决策等方面的需求不断提升,集群故障监控也越来越重要。PDM(Integrated Parallel Mining)是中国移动的商务智能应用需求为背景,旨在针对海量数据提供高效、准确、便捷的数据分析服务,能够对Hadoop集群进行性能监控并且进行故障告警是非常重要的。Ganglia和Nagios在集群故障监控方面各有优势,将两者的优势结合,结合企业项目设计出了一个相对完整的集群故障监控平台。相似文献

12.

一个适合大规模集群并行计算的检查点系统 总被引：5，自引：1，他引：4

周恩强卢宇彤沈志宇《计算机研究与发展》2005,42(6):987-992

分布式检查点系统是大规模并行计算系统容错的重要手段．协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈．针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性．初步测试结果表明,C系统的设计策略适合大规模并行计算的容错．相似文献

13.

Fault-Aware Runtime Strategies for High-Performance Computing

Yawei Li Zhiling Lan Gujrati P. Xian-He Sun 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(4):460-473

As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability). 相似文献

14.

神经网络在容错系统可靠性分析中的应用

下载免费PDF全文

胡华平戴葵金士尧王清元《计算机工程与科学》2000,22(3):10-13

由于神经网络具有大规模并行、集团运算等特点,以及强大的自适应、自学习、容错和推广能力,从而使神经网络在可靠性工程的许多领域得到应用,本文将循环前馈神经网络应用到容错系统的可靠性分析中,达到了简化容错系统可靠性分析与设计的目的。并且,由于神经网络具有的学习与自适应的功能,使它在分析较为复杂的容错系统的可靠
靠性时,具有更强的竞争力。相似文献

15.

CEFT: A cost-effective,fault-tolerant parallel virtual file system

《Journal of Parallel and Distributed Computing》2006,66(2):291-306

The vulnerability of computer nodes due to component failures is a critical issue for cluster-based file systems. This paper studies the development and deployment of mirroring in cluster-based parallel virtual file systems to provide fault tolerance and analyzes the tradeoffs between the performance and the reliability in the mirroring scheme. It presents the design and implementation of CEFT, a scalable RAID-10 style file system based on PVFS, and proposes four novel mirroring protocols depending on whether the mirroring operations are server-driven or client-driven, whether they are asynchronous or synchronous. The comparisons of their write performances, measured in a real cluster, and their reliability and availability, obtained through analytical modeling, show that these protocols strike different tradeoffs between the reliability and performance. Protocols with higher peak write performance are less reliable than those with lower peak write performance, and vice versa. A hybrid protocol is proposed to optimize this tradeoff. 相似文献

16.

试论办公自动化系统中群集技术的运用

李宗慧《信息安全与技术》2011,(11):88-89

在网络使用者越来越离不开网络办公系统时代下,需要办公系统可靠性更加严格,而且要确保系统与数据的安全性。通过群集技术的使用会确保在单台服务器的操作系与统硬件而产生的故障情况下持久运行,进而将系统的可用性得到全面提高。本文主要对办公自动化系统中群集技术的运用进行探讨。相似文献

17.

分布式实时系统中前向恢复技术的研究与实践

文梅李宏亮《计算机工程与科学》1999,21(5):28-31

本文论述了分布式实时系统中的前向恢复技术，着重讨论了在实际时，高可用的双工系统中的前向恢复技术。相似文献

18.

Design of a fault tolerant control system incorporating reliability analysis and dynamic behaviour constraints

F. Guenab P. Weber Y.M. Zhang 《International journal of systems science》2013,44(1):219-233

In highly automated aerospace and industrial systems where maintenance and repair cannot be carried out immediately, it is crucial to design control systems capable of ensuring desired performance when taking into account the occurrence of faults/failures on a plant/process; such a control technique is referred to as fault tolerant control (FTC). The control system processing such fault tolerance capability is referred to as a fault tolerant control system (FTCS). The objective of FTC is to maintain system stability and current performance of the system close to the desired performance in the presence of system component and/or instrument faults; in certain circumstances a reduced performance may be acceptable. Various control design methods have been developed in the literature with the target to modify or accommodate baseline controllers which were originally designed for systems operating under fault-free conditions. The main objective of this article is to develop a novel FTCS design method, which incorporates both reliability and dynamic performance of the faulty system in the design of a FTCS. Once a fault has been detected and isolated, the reconfiguration strategy proposed in this article will find possible structures of the faulty system that best preserve pre-specified performances based on on-line calculated system reliability and associated costs. The new reconfigured controller gains will also be synthesised and finally the optimal structure that has the ‘best’ control performance with the highest reliability will be chosen for control reconfiguration. The effectiveness of this work is illustrated by a heating system benchmark used in a European project entitled intelligent Fault Tolerant Control in Integrated Systems (IFATIS EU-IST-2001-32122). 相似文献

19.

Reliability growth modeling and optimal release policy under fuzzy environment of an N-version programming system incorporating the effect of fault removal efficiency

P. K. Kapur Anshu Gupta P.C. Jha 《国际自动化与计算杂志》2007,4(4):369-379

Failure of a safety critical system can lead to big losses.Very high software reliability is required for automating the working of systems such as aircraft controller and nuclear reactor controller software systems.Fault-tolerant softwares are used to increase the overall reliability of software systems.Fault tolerance is achieved using the fault-tolerant schemes such as fault recovery (recovery block scheme),fault masking (N-version programming (NVP)) or a combination of both (Hybrid scheme).These softwares incorporate the ability of system survival even on a failure.Many researchers in the field of software engineering have done excellent work to study the reliability of fault-tolerant systems.Most of them consider the stable system reliability.Few attempts have been made in reliability modeling to study the reliability growth for an NVP system.Recently,a model was proposed to analyze the reliability growth of an NVP system incorporating the effect of fault removal efficiency.In this model,a proportion of the number of failures is assumed to be a measure of fault generation while an appropriate measure of fault generation should be the proportion of faults removed.In this paper,we first propose a testing efficiency model incorporating the effect of imperfect fault debugging and error generation.Using this model,a software reliability growth model (SRGM) is developed to model the reliability growth of an NVP system.The proposed model is useful for practical applications and can provide the measures of debugging effectiveness and additional workload or skilled professional required.It is very important for a developer to determine the optimal release time of the software to improve its performance in terms of competition and cost.In this paper,we also formulate the optimal software release time problem for a 3VP system under fuzzy environment and discuss a the fuzzy optimization technique for solving the problem with a numerical illustration. 相似文献

20.

Reliability-aware performance model for optimal GPU-enabled cluster environment

Supada Laosooksathit Raja Nassar Chokchai Leangsuksun Mihaela Paun 《The Journal of supercomputing》2014,68(3):1630-1651

Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed. 相似文献