首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 171 毫秒
1.
李静  罗金飞  李炳超 《计算机应用》2021,41(4):1113-1121
主动容错机制通过预先发现即将故障的硬盘来提醒系统提前迁移备份危险数据,从而显著提高存储系统的可靠性。针对现有研究无法准确评价主动容错副本存储系统可靠性的问题,提出几种副本存储系统的状态转换模型,然后利用蒙特卡洛仿真算法实现了该模型,从而模拟主动容错副本存储系统的运行,最后统计系统在某个运行时期内发生数据丢失事件的期望次数。采用韦布分布函数模拟设备故障和故障修复事件的时间分布,并定量评价了主动容错机制、节点故障、节点故障修复、硬盘故障以及硬盘故障修复事件对存储系统可靠性的影响。实验结果表明,当预测模型的准确率达到50%时,系统的可靠性可以提高1~3倍;与二副本系统相比,三副本系统对系统参数更敏感。所提模型可以帮助系统管理者比较权衡不同的容错方式以及系统参数下的系统可靠性水平,从而搭建高可靠和高可用的存储系统。  相似文献   

2.
主动容错机制解决了被动容错冗余问题,实现了硬盘潜在故障的提前预测和主动处理,显著提高了存储系统的可靠性.然而,民航存储系统采用被动容错机制无法全面保障系统可靠性.本文利用主动容错机制的优势,基于硬盘故障预测模型构建了多副本民航存储系统状态转换模型.该模型全面考虑硬盘,节点和机架故障3个因素,采用韦布分布模拟民航存储系统事件的发生.根据系统状态转换模型,本文使用了改进的基于事件驱动的蒙特卡洛仿真方法,对民航存储系统进行了全面的可靠性分析.实验结果表明,本文模型显著提高了民航存储系统的可靠性.另外,敏感性分析得出主动和被动的结合机制有效延缓了系统可靠性下降的趋势,节约了网络带宽资源.  相似文献   

3.
对存储系统的可靠性预测,可以用来评估、比较不同容错机制以及评价不同系统参数对存储系统可靠性的作用,有利于创建高可靠的存储系统。因此,存储系统可靠性预测的研究一直是领域热点之一。从硬盘单体和存储系统两种预测对象的角度,对近年来该领域的相关研究进行了介绍和分析。首先从硬盘个体和群体两种对象,主动容错和被动容错两种容错方式,以及纠删码和副本两种冗余机制几个角度分类介绍当前可靠性预测研究现状,然后进一步指明当前该领域研究中尚未解决的一些难题和未来可能的发展方向。分析表明,目前对副本存储系统和主动容错存储系统的可靠性预测研究尚显不足,是未来很有价值的研究方向。  相似文献   

4.
高可靠磁盘阵列的设计   总被引:7,自引:0,他引:7  
可靠性是网络存储系统最重要的指标之一,冗余磁盘阵列(RAID)是网络存储系统的基础和关键部件,RAID一方面通过冗余数据或校验信息增强容错能力;另一方面通过硬件部件冗余来提高系统可靠性。重点研究提高RAID可靠性的关键技术。分析降低磁盘阵列可靠性的几种主要因素:系统故障、不可纠位错误和相关联的磁盘故障,研究相应的防御措施,对双重校验信息冗余技术、硬件部件冗余技术、快速故障盘重建等关键容错技术进行研究与探讨。  相似文献   

5.
针对云服务的冗余特性和可靠性保障的需求,探讨了提高云服务可靠性的有效途径。基于云服务的可靠性体系框架和管理模式,提出了信任冗余的云服务可靠性增强总体框架;在服务准备时的冗余设计阶段,基于选举协议的云服务轮询检测机制,设计了信任感知的容错服务选择算法,给出了最小容错服务个数的求解方法;基于服务组合运行时的容错处理框架,提出了保证服务响应时间的基于失效规则的云服务调用策略。实验结果表明,提出的容错服务选择算法和云服务调用策略具有较好的实用性和有效性。  相似文献   

6.
基于单容错编码的数据布局已经不能满足存储系统对可靠性越来越高的要求。对基于多容错编码的数据布局的研究受到了广泛的关注,并且出现了一些三容错的布局算法,如HDD1,HDD2等。但这些布局算法普遍存在冗余度较差、计算负载大等缺点。提出了一种基于三重奇偶校验的多容错数据布局算法TP-RAID(Triple Parity RAID)。该算法只需要在RAID5阵列系统中增加两个校验磁盘,通过水平、正向对角和反向对角三重奇偶校验,可容许同时发生的三个磁盘故障。该算法编码、解码简单,三重校验条纹长度相等,计算负载小,易于实现。此外,由于该算法中尽量减少了三重校验之间逻辑关联,使得该算法的小写性能比其他的三容错算法相比有了大幅度的提高。  相似文献   

7.
针对目前存储系统中的DAS、NAS存储方案存在的单点故障与性能瓶颈问题,介绍了一种新型的智能网络磁盘(IntelligentNetwork Disk,IND)存储系统结构,提出了一种面向智能网络磁盘存储系统的文件数据容错算法,理论分析和仿真实验结果表明:多个智能网络磁盘(IND)采用这种容错算法时,其并行数据读取性能良好,已经具备很强的单点容错能力;这表明文中提出的文件数据容错算法能够实现智能网络磁盘之间的容错处理。  相似文献   

8.
纠删码技术是独立磁盘冗余阵列-6(RAID-6)的双容错能力的底层实现技术,它的性能是左右RAID-6性能的重要因素。针对RAID-6中常用阵列纠删码的I/O不平衡和数据恢复速度慢的问题,提出一种基于异或(XOR)的混合阵列码——J码(J-code)。J-code采用新的校验生成规则,首先,利用原始数据构造的二维阵列计算出对角校验位并构造新的阵列;然后,利用新阵列中数据块之间的位置关系计算得到反对角校验位。此外,J-code将原始数据与部分校验位存储于同一磁盘,能减少编译码过程中的异或(XOR)操作次数和单盘恢复过程中读取数据块的个数,从而降低编译码复杂度和单盘故障修复的I/O成本,缓解磁盘热点集中现象。仿真实验结果表明,相较于RDP(RowDiagonal Parity)、EaR(Endurance-aware RAID-6)等阵列码,J-code的编码时间减少了0.30%~28.70%,单磁盘故障和双磁盘故障的修复用时分别减少了2.23%~31.62%和0.39%~36.00%。  相似文献   

9.
张航  唐聃  蔡红亮 《计算机科学》2021,48(5):130-139
纠删码消耗的存储空间较少,获得的数据可靠性较高,因此被分布式存储系统广泛采用。但纠删码在修复数据时较高的修复成本限制了其应用。为了降低纠删码的修复成本,研究人员在分组码和再生码上进行了大量的研究。由于分组码和再生码属于被动容错方式,对于一些容易出现失效的节点,采用主动容错的方式能更好地降低修复成本,维护系统的可靠性,因此,提出了一种主动容错的预测式纠删(Proactive basic-Pyramid,PPyramid)码。PPyramid码利用硬盘故障预测方法来调整basic-Pyramid码中冗余块和数据块之间的关联,将预测出的即将出现故障的硬盘划分到同一小组,使得在修复数据时,所有的读取操作在小组内进行,从而减少读取数据块的个数,节省修复成本。在基于Ceph搭建的分布式存储系统中,在修复多个硬盘故障时,将PPyramid码与其他常用的纠删码进行对比。实验结果表明,相比basic-Pyramid码,PPyramid码能降低6.3%~34.9%的修复成本和减少7.6%~63.6%的修复时间,相比LRC码、pLRC码、SHEC码、DLRC码,能降低8.6%~52%的修复成本和减少10.8%~52.4%的修复时间。同时,PPyramid码构造灵活,具有很强的实际应用价值。  相似文献   

10.
把分布式的备份思想应用到大规模并行文件系统中,在使用冗余机制构建数据的系统中提供快速恢复机制。并使用马尔可夫模型建立了一个平均直到数据丢失时间的分布模型,指导如何在数据可靠性需求和冗余数据开销之间进行平衡。根据可靠性模型分析,在快速恢复机制下,使用m-n机制,只要n≥m+2,并且恢复数据所需的计算时间与磁盘I/O时间相比可以忽略不计,就可以满足大规模存储系统对町靠性的需求。  相似文献   

11.
Redundant arrays of independent disks (RAID) provide an efficient stable storage system for parallel access and fault tolerance. The most common fault tolerant RAID architecture is RAID-1 or RAID-5. The disadvantage of RAID-1 lies in excessive redundancy, while the write performance of RAID-5 is only 1/4 of that of RAID-0. In this paper, we propose a high performance and highly reliable disk array architecture, called stripped mirroring disk array (SMDA). It is a new solution to the small-write problem for disk array. SMDA stores the original data in two ways, one on a single disk and the other on a plurality of disks in RAID-0 by stripping. The reliability of the system is as good as RAID-1, but with a high throughput approaching that of RAID-0. Because SMDA omits the parity generation procedure when writing new data, it avoids the write performance loss often experienced in RAID-5.  相似文献   

12.
Cloud computing is a recent trend in IT, which has attracted lots of attention. In cloud computing, service reliability and service performance are two important issues. To improve cloud service reliability, fault tolerance techniques such as fault recovery may be used, which in turn has impact on cloud service performance. Such impact deserves detailed research. Although there exist some researches on cloud/grid service reliability and performance, very few of them addressed the issues of fault recovery and its impact on service performance. In this paper, we conduct detailed research on performance evaluation of cloud service considering fault recovery. We consider recovery on both processing nodes and communication links. The commonly adopted assumption of Poisson arrivals of users’ service requests is relaxed, and the interarrival times of service requests can take arbitrary probability distribution. The precedence constraints of subtasks are also considered. The probability distribution of service response time is derived, and a numerical example is presented. The proposed cloud performance evaluation models and methods could yield results which are realistic, and thus are of practical value for related decision-makings in cloud computing.  相似文献   

13.
As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.  相似文献   

14.
Video services are likely to dominate the traffic in future broadband networks. Most of these services will be provided by large- scale public-access video servers. Research to date has shown that disk arrays are a promising technology for providing the storage and throughput required to serve many independent video streams to a large customer population. Large disk arrays, however, are susceptible to disk failures which can greatly affect their reliability. In this paper, we discuss suitable redundancy mechanisms to increase the reliability of disk arrays and compare the performance of the RAID-3 and RAID-5 redundancy schemes. We use cost and performability analyses to rigorously compare the two schemes over a variety of conditions. Accurate cost models are developed and Markov reward models (with time-dependent reward structures) are developed and used to give insight into the tradeoffs between system cost and revenue earning potential. The paper concludes that for large-scale video servers, coarse-grained striping in a RAID-5 style of disk array is most cost effective.  相似文献   

15.
杨娜  刘靖 《软件学报》2019,30(4):1191-1202
通过提供高效且持续可用的容错服务以保障云应用系统的可靠运行是至关重要的.采用容错即服务的模式,提出了一种优化的云容错服务动态提供方法,从云应用组件的可靠性及响应时间等方面描述云应用容错需求,以常用的复制、检查点和NVP(N-version programming)等容错技术为基础,充分考虑容错服务动态切换开销,分别针对支撑容错服务的底层云资源是否足够的场景,给出可用容错即服务提供方案的最优化求解方法.实验结果表明,所提方法降低了云应用系统支付的容错服务费用及支撑容错服务的底层云资源的开销,提高了容错服务提供商为多个云应用实施高效、可靠容错即服务的能力.  相似文献   

16.
The vulnerability of computer nodes due to component failures is a critical issue for cluster-based file systems. This paper studies the development and deployment of mirroring in cluster-based parallel virtual file systems to provide fault tolerance and analyzes the tradeoffs between the performance and the reliability in the mirroring scheme. It presents the design and implementation of CEFT, a scalable RAID-10 style file system based on PVFS, and proposes four novel mirroring protocols depending on whether the mirroring operations are server-driven or client-driven, whether they are asynchronous or synchronous. The comparisons of their write performances, measured in a real cluster, and their reliability and availability, obtained through analytical modeling, show that these protocols strike different tradeoffs between the reliability and performance. Protocols with higher peak write performance are less reliable than those with lower peak write performance, and vice versa. A hybrid protocol is proposed to optimize this tradeoff.  相似文献   

17.
与超级计算机的快速的开发,规模和复杂性曾经正在增加,并且可靠性和跳回面临更大的挑战。在容错有许多重要技术,例如基于差错预言的积极失败回避技术,反应容错基于检查点,和安排技术到改进可靠性。系统差错的特征上的质、量的描述为这些技术是很批评的。这研究在超级计算机把 Sunway BlueLight 称为的二典型 petascale 上分析失败的来源(基于多核心中央处理器) 并且 Sunway TaihuLight (基于异构的 manycore 中央处理器) 。它揭开一些有趣的差错特征并且在主要部件差错之中发现未知关联关系。最后,纸在资源和不同时间跨度的各种各样的谷物分析二台超级计算机的失败时间,并且为 petascale 超级计算机造一个一致多维的失败时间模型。  相似文献   

18.
Failures are normal rather than exceptional in cloud computing environments, high fault tolerance issue is one of the major obstacles for opening up a new era of high serviceability cloud computing as fault tolerance plays a key role in ensuring cloud serviceability. Fault tolerant service is an essential part of Service Level Objectives (SLOs) in clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a foolproof fault tolerance strategy is needed. In this paper, the definitions of fault, error, and failure in a cloud are given, and the principles for high fault tolerance objectives are systematically analyzed by referring to the fault tolerance theories suitable for large-scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are checkpointing fault tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs; and (iii) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale cloud data centers and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, response time, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction. It efficiently and effectively achieves a trade-off for fault tolerance objectives in cloud computing environments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号