期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Connective fault tolerance in multiple bus systems

Hung-Kuei Ku Hayes J.P. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(6):574-586

We present an efficient approach to characterizing the fault tolerance of multiprocessor systems that employ multiple shared buses for interprocessor communication. Of concern is connective fault tolerance, which is defined as the ability to maintain communication between any two fault-free processors in the presence of faulty processors, buses, or processor-bus links. We introduce a model called processor-bus-link (PBL) graphs to represent a multiple-bus system's interconnection structure. The model is more general than previously proposed models, and has the advantages of simple representation, broad application, and the ability to model partial bus failures. The PBL graph implies a set of component adjacency graphs that highlights various connectivity features of the system. Using these graphs, we propose a method for analyzing the maximum number of faults a multiple-bus system can tolerate, and for identifying every minimum set of faulty components that disconnects the processors of the system. We also analyze the connective fault tolerance of several proposed multiple-bus systems to illustrate the application of our method 相似文献

2.

Improved extra group network: a new fault-tolerant multistage interconnection network

Fathollah Bistouni Mohsen Jahanshahi 《The Journal of supercomputing》2014,69(1):161-199

Supersystems are shown to provide enough computational power to solve complex problems on a real-time basis. In all these systems, the computational parallelism is obtained from multiple processors. Multistage interconnection networks (MINs) play a vital role on the performance of these multiprocessor systems. This paper introduces a new fault-tolerant MIN named as improved extra group network (IEGN). IEGN is designed by existing extra group (EGN) network, which is a regular multipath network with limited fault tolerance. IEGN provides four times more paths between any source–destination pairs compared with EGN. The performance of IEGN has been evaluated in terms of permutation capability, fault tolerance, reliability, path length, and cost. It has also been proved that the IEGN can achieve better results in terms of fault tolerance, reliability, path length and cost-effectiveness, in comparison to known networks, namely, EGN, augmented baseline network, augmented shuffle-exchange network, fault-tolerant double tree, Benes network, and Replicated MIN. 相似文献

3.

A multipath network with cross links

《Journal of Parallel and Distributed Computing》1988,5(2):185-193

A fault-tolerant multistage interconnection network, called the H-network, and a fault-tolerant control algorithm for this network are introduced. The novel feature of this network lies in its design, which has connections not only between switching elements of successive stages but also between switching elements of the same stage. The control algorithm is a simple modification of the destination tag algorithm, but it provides for fault tolerance and is dynamic in nature. It is shown that this design technique is effective in reducing hardware, improving fault tolerance, and giving better performance than other fault-tolerant networks with comparable hardware cost. 相似文献

4.

Fault-tolerant, real-time communication in FDDI-based networks

Biao Chen Kamat S. Wei Zhao 《Computer》1997,30(4):83-90

The first high-speed network to meet the Safenet standard's bandwidth requirements, the Fiber Distributed Data Interface (FDDI) needs help to meet Safenet's fault tolerance requirement. Researchers have proposed a number of FDDI-based network architecture designs for improving fault tolerance. An architecture called FBRN (FDDI-Based Reconfigurable Network) provides enhanced fault tolerance by using (a) multiple FDDI networks to connect hosts, and (b) efficient fault detection and network configuration algorithms. To provide fault-tolerant real-time communication with the FBRN architecture, users must manage network resources properly. We sought to accomplish this by using a fault-tolerant, real-time management mechanism with online and offline components. We focused on achieving high performance by designing efficient and effective online and offline management algorithms to work around multiple faults 相似文献

5.

VTFTR：高维胖树中的无死锁容错路由算法

刘博阳胡舒凯施得君卢宏生《计算机工程》2022,48(12):38

随着近年来高性能计算系统规模的急剧扩大,高性能互连网络的可靠性成为愈发重要的问题。高维胖树是一种结合了胖树与多维环网优点的网络拓扑结构,凭借其良好的可扩展性与网络性能在E级时代具有广阔的应用前景。然而,目前关于高维胖树中容错路由算法的相关研究较为有限,其可靠性问题亟待解决。为提高高维胖树拓扑在高性能互连网络中的容错能力,进一步提高对应超算系统的运行效率,提出一种用于高维胖树中叶交换机故障的容错路由算法VTFTR。该算法结合转向模型与虚通道切换的思想,通过严格控制报文在无故障路径与容错路径中的转向,使用少量的容错虚通道与额外跳步实现高维胖树中的无死锁容错。实验结果表明,在单点故障情况下,VTFTR算法的容错路径较对比算法有2~4个跳步的减少,在4 096个节点规模的网络中,当叶交换机故障数量为10时,在故障叶交换机不同的分布情况下,该算法能够以1.4%~2.0%的吞吐率下降作为代价来保持全网无故障节点之间的互连。相似文献

6.

New encoding/decoding methods for designing fault-tolerant matrixoperations

Tao D.L. Hartmann C.R.P. Han Y.S. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(9):931-938

Algorithm-based fault tolerance (ABFT) can provide a low-cost error protection for array processors and multiprocessor systems. Several ABFT techniques (weighted check-sum) have been proposed to design fault-tolerant matrix operations. In these schemes, encoding/decoding uses either multiplications or divisions so that overhead is high. In this paper, new encoding/decoding methods are proposed for designing fault-tolerant matrix operations. The unique feature of these new methods is that only additions and subtractions are used in encoding/decoding. In this paper, new algorithms are proposed to construct error detecting/correcting codes with the minimum Hamming distance 3 and 4. We will show that the overhead introduced due to the incorporation of fault tolerance is drastically reduced by using these new coding schemes 相似文献

7.

On the Construction of Fault-Tolerant Cube-Connected Cycles Networks

《Journal of Parallel and Distributed Computing》1995,25(1):98-106

This paper presents a new approach to tolerating edge faults and node faults in (CCC) networks of Cube-Connected Cycles in a worst-case scenario. Our constructions of fault-tolerant CCC networks are obtained by adding extra edges to the CCC. The main objective is to reduce the cost of the fault-tolerant network by minimizing the degree of the network. Specifically, we have two main results. (i) We have created a fault tolerant CCC that can tolerate any single fault, either a node fault or an edge fault. When the dimension of the CCC is odd, the degree of the fault tolerant graph is 4. In the even case, there is a single node per cycle that is of degree 5 and the rest are of degree 4. (ii) We have created a fault-tolerant CCC, where every node has degree y + 2, which can tolerate any 2y − 1 cube-edge faults. Our constructions are extremely efficient for the case of edge faults-they result in healthy CCC networks that utilize all of the processors. 相似文献

8.

混合型实时容错调度算法的设计和性能分析 总被引：15，自引：2，他引：15

秦啸韩宗芬庞丽萍李胜利《软件学报》2000,11(5):686-693

以往文献中研究的实时容错调度算法都只能调度单一的具有容错需求的任务.该文建立了一个混合型实时容错调度模型,提出一种静态实时容错调度算法.该算法能同时调度具有容错需求的实时任务和无容错需求的实时任务.该文还提出了一个求解最小处理机个数的算法,用于对静态实时容错调度算法的性能进行模拟分析.为了提高静态调度算法的调度性能,提出了一种动态调度算法.最后,通过模拟实验分析了静态和动态调度算法的性能.实验表明,调度算法的性能与实时任务的个数、任务的计算时间、周期和处理机个数等系统参数相关. 相似文献

9.

Fault tolerant aggregation in heterogeneous sensor networks

Laukik Chitnis Alin DobraAuthor VitaeSanjay RankaAuthor Vitae 《Journal of Parallel and Distributed Computing》2009

Fault tolerance and scalability are important considerations in the design of sensor network applications. Data aggregation is an essential operation in sensor networks. Multiple techniques have been proposed recently to tackle the issues of scalability and fault tolerance of aggregation in sensor networks. In this article, we analyze the impact of using a few of the more reliable, though expensive, nodes–such as the Intel XScale–called microservers, in addition to the standard motes, on the fault tolerance and scalability of the aggregation algorithms in sensor networks. In particular, we propose a simple model that captures the essence of tree aggregation in such heterogeneous sensor networks. We validate this theoretical model with simulation results. We also study the effective impact on the sustainable probability of failure, and perform cost-benefit analysis. We also show how hybrid aggregation can be utilized instead of tree, to improve the performance of aggregation in heterogeneous sensor networks. We show that our work can be applied for effectively optimizing the use of expensive hardware while designing fault-tolerant, distributed sensor networks. 相似文献

10.

龙芯1号处理器的故障注入方法与软错误敏感性分析 总被引：12，自引：0，他引：12

黄海林唐志敏许彤《计算机研究与发展》2006,43(10):1820-1827

在纳米级制造工艺下以及在航天等特殊应用场合中,可靠性将是处理器设计中的一个重要考虑因素．以龙芯1号处理器为研究对象,探讨了处理器可靠性设计中的故障注入方法,并提出了一种同时运行两个处理器RTL模型的故障注入与分析方法,可以实现连续快速的处理器仿真故障注入．在此基础上,进一步分析了龙芯1号处理器的软错误敏感性,通过快速注入大约30万个软错误,保证了分析结果具有较好的统计意义,可以有效指导后续的容错与可靠性设计．相似文献

11.

无线传感器网络中网络层故障容错技术研究进展 总被引：3，自引：1，他引：2

李洪兵熊庆宇石为人王小刚何栋李芮《计算机应用研究》2013,30(7):1921-1928

故障容错能提高无线传感器网络的稳定性和可靠性, 是无线传感器网络的一项关键技术。网络层容错及跨层协同优化设计是网络故障容错的重要研究内容, 主要对网络层容错技术研究进行了归纳和总结。网络层容错控制技术主要分为多路由传输、纠删编码/网络编码、数据重传机制、跨层协同优化与复合容错和仿生智能容错等, 并对网络层容错控制技术研究趋势作了探讨。相似文献

12.

Eventual strong consensus with fault detection in the presence of dual failure mode on processors under dynamic networks

《Journal of Network and Computer Applications》2012,35(4):1260-1276

The fault tolerance capability and reliability of a distributed system can be enhanced if the Strong Consensus (SC) problem can be properly addressed. Most of the extant SC protocols are designed for static networks. Besides, the number of rounds of message exchange required by all of the extant SC protocols is determined by the total number of processors in the network rather than by the actual number of faulty processors in the network. Even if there is only a few or no faulty processor in the network, the SC protocols may waste a lot of time and memory space on many unnecessary rounds of message exchange. Thus, this paper revisits the SC problem in dynamic networks and uses two rules, Detection Rule for Malicious fault in dynamic network (DRM_dyn) and Early Stopping Rule for Strong Consensus protocol in dynamic networks (ESRSC_dyn), to reduce the time consumption and space complexity of SC protocols. DRM_dyn is a rule that detects malicious processors, and ESRSC_dyn is a rule that determines whether the messages collected are enough for reaching a strong consensus. To be succinct, the proposed SC protocol can not only work in dynamic networks consisting of both dormant processors and malicious processors (dual failure mode) but also ensure that all correct processors reach a SC value within fewer rounds of message exchange than required by the extant SC protocols. 相似文献

13.

Bandwidth availability of m-level hierarchical multiprocessor systems

I. O. Mahgoub

A. K. Elmagarmid 《Information Sciences》1992,60(3):185-215

In previous work, we proposed an m-level hierarchical multiprocessor system. The proposed system reduces the network complexity by employing m levels of hierarchy. The system performance was analyzed, and the results showed that, for a higher rate of local requests, the system performed close to the crossbar system, and better than a typical multiple-bus system (with the number of buses equal to half the number of processors). In this paper, we study the effect of failures on the performance of the m-level hierarchical multiprocessor systems. We develop analytical modeling techniques to compute the reliability and the bandwidth availability of the m-levelsystem, for hierarchically nonuniform reference (HNR) and uniform reference (UR) models. The results obtained for the m-level system are compared with those of the crossbar and multiple-bus systems. 相似文献

14.

Investigating the fault tolerance of neural networks

Tchernev EB Mulvaney RG Phatak DS 《Neural computation》2005,17(7):1646-1664

Particular levels of partial fault tolerance (PFT) in feedforward artificial neural networks of a given size can be obtained by redundancy (replicating a smaller normally trained network), by design (training specifically to increase PFT), and by a combination of the two (replicating a smaller PFT-trained network). This letter investigates the method of achieving the highest PFT per network size (total number of units and connections) for classification problems. It concludes that for non-toy problems, there exists a normally trained network of optimal size that produces the smallest fully fault-tolerant network when replicated. In addition, it shows that for particular network sizes, the best level of PFT is achieved by training a network of that size for fault tolerance. The results and discussion demonstrate how the outcome depends on the levels of saturation of the network nodes when classifying data points. With simple training tasks, where the complexity of the problem and the size of the network are well within the ability of the training method, the hidden-layer nodes operate close to their saturation points, and classification is clean. Under such circumstances, replicating the smallest normally trained correct network yields the highest PFT for any given network size. For hard training tasks (difficult classification problems or network sizes close to the minimum), normal training obtains networks that do not operate close to their saturation points, and outputs are not as close to their targets. In this case, training a larger network for fault tolerance yields better PFT than replicating a smaller, normally trained network. However, since fault-tolerant training on its own produces networks that operate closer to their linear areas than normal training, replicating normally trained networks ultimately leads to better PFT than replicating fault-tolerant networks of the same initial size. 相似文献

15.

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引：1，自引：0，他引：1

Ouyang Jinsong Maheshwari Piyush 《The Journal of supercomputing》1999,14(3):207-232

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献

16.

Multiple-bus shared-memory system: Aquarius project

Carlton M. Despain A. 《Computer》1990,23(6):80-83

A multiple-bus architecture called a multi-multi is presented. The architecture is designed to handle several dimensions with a moderate number of processors per bus. It provides scaling to a large number of processors in a system. A key characteristic of the architecture is the large amount of bandwidth it provides. Each node in the architecture contains a microprocessor, memory, and a cache. The cache-coherence protocol for the multi-multi architecture combines features of snooping cache schemes, to provide consistency on individual buses, with features of directory schemes, to provide consistency between buses. The snooping cache component can take advantage of the low-latency communication possible on shared buses for efficiency, yet the complete protocol will support many more processors than a single bus can. The resulting protocol naturally extends cache coherence from a multi to a multi-multi. Cache and directory states are described. Concepts that allow efficient performance, namely, local sharing, root node, and bus addresses in the directory, are discussed 相似文献

17.

On fault tolerance of 3-dimensional mesh networks 总被引：5，自引：0，他引：5

下载免费PDF全文

Gao-CaiWang Jian-ErChen Guo-JunWang 《计算机科学技术学报》2004,19(2):0-0

In this paper, the concept of k-submesh and k-submesh connectivity fault tolerance model is proposed. And the fault tolerance of 3-D mesh networks is studied under a more realistic model in which each network node has an independent failure probability. It is first observed that if the node failure probability is fixed, then the connectivity probability of 3-D mesh networks can be arbitrarily small when the network size is sufficiently large. Thus, it is practically important for multicomputer system manufacturer to determine the upper bound for node failure probability when the probability of network connectivity and the network size are given. A novel technique is developed to formally derive lower bounds on the connectivity probability for 3-D mesh networks. The study shows that 3-D mesh networks of practical size can tolerate a large number of faulty nodes thus are reliable enough for multicomputer systems. A number of advantages of 3-D mesh networks over other popular network topologies are given. 相似文献

18.

Algorithm-based fault tolerance applied to high performance computing

George Bosilca Rémi Delmas Jack Dongarra Julien Langou 《Journal of Parallel and Distributed Computing》2009

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique [K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.) 33 (1984) 518–528] to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault-tolerant matrix–matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix–matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly. 相似文献

19.

Design of 4-disjoint gamma interconnection network layouts and reliability analysis of gamma interconnection Networks

S. Rajkumar Neeraj Kumar Goyal 《The Journal of supercomputing》2014,69(1):468-491

Multistage interconnection networks (MINs) are widely used for reliable data communication in a tightly coupled large-scale multiprocessor system. High reliability of MINs can be achieved using fault tolerance techniques. The fault tolerance is generally achieved by disjoint paths available through multiple connectivity options. The gamma interconnection network (GIN) is a class of fault tolerant MINs providing alternate paths for source–destination node pairs. Various 2-disjoint and 3-disjoint GIN architectures have been presented in the literature. In this paper, two new designs of 4-disjoint paths multistage interconnection networks, called 4-disjoint gamma interconnection networks (4DGIN-1 and 4DGIN-2) are proposed. The proposed 4DGINs provide four disjoint paths for each source–destination pair and can tolerate three switches/link failures in intermediate interconnection layers. Proposed designs are highly reliable GIN with higher fault-tolerant capability than other gamma networks at low cost. Terminal pair reliabilities of proposed designs and various other 2-disjoint and 3-disjoint GINs are evaluated, analyzed and compared. Reliability values of proposed designs are found higher. 相似文献

20.

The STAR fault manager for distributed operating environments. design,implementation and performance

Pierre Sens Bertil Folliot 《Software》1998,28(10):1079-1099

This paper presents the design, implementation and performance evaluation of a software fault manager for distributed applications. Dubbed Star, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. To improve the response time of fault-tolerant applications, Star implements non-blocking and incremental checkpointing to perform an efficient backup of process state. Moreover, Star is application independent, highly configurable. Star actually runs on top of SunOs and is easily portable to UNIX™-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment. © 1998 John Wiley & Sons, Ltd. 相似文献