期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

张元瑞武继刚段新明《计算机科学》2012,39(3):295-298

可重构多处理器阵列上的容错技术可用来重构含有故障单元的处理器阵列,以便获得最大可用的目标阵列。现有的研究成果主要侧重于重构算法的构造,还没有涉及对重构后目标阵列的同步通讯性能的研究。提出了一种改善目标阵列同步通讯性能的电路优化算法,用来降低目标阵列行与行之间通讯的延时,使得相邻两行处理器的通讯尽可能达到同步。实验结果表明,提出的算法对不同大小、不同故障率的阵列都有相应的同步通讯性能的改善。相似文献

2.

Parallel reconfiguration algorithms for mesh-connected processor arrays

Jigang Wu Guiyuan Jiang Yuze Shen Siew-Kei Lam Jizhou Sun Thambipillai Srikanthan 《The Journal of supercomputing》2014,69(2):610-628

Effective fault tolerance techniques are essential for improving the reliability of multiprocessor systems. At the same time, fault tolerance must be achieved at high speed to meet the real-time constraints of embedded systems. While parallelism has often been exploited to increase performance, to the best of our knowledge, there has been no previously reported work on parallel reconfiguration of mesh-connected processor arrays with faults. This paper presents two parallel algorithms to accelerate reconfiguration of the processor arrays. The first algorithm reconfigures a host array in parallel in a multithreading manner. The threads in the parallel algorithm execute independently within a safe rerouting distance. The second algorithm is based on a divide-and-conquer approach to first generate the leftmost segments in parallel and then merge the segments in parallel. When compared to the conventional algorithm, simulation results from a large number of instances confirm that the proposed algorithms significantly accelerate the reconfiguration without loss of harvest. 相似文献

3.

容错处理器阵列的多逻辑列并行重构算法

章子凯武继刚姜文超刘竹松《计算机工程与科学》2018,40(1):24-33

处理器阵列的容错重构技术是片上网络多核、众核高性能体系结构的可靠性技术之一。现有的最大逻辑阵列并行重构技术仅对单条逻辑列的构造实现了并行化,而对多条逻辑列的同步并行仍未见可行算法。依据处理器阵列的潜在并行性,在分治策略的基础上,提出了一种阵列分块的并行重构算法。算法对处理器阵列实施横向分块划分,对每个阵列块进行并行重构,并对所得逻辑子阵列进行归并,实现了多条逻辑列的同步并行重构。与现有的并行算法相比,新算法同样能够生成最大逻辑列,并且减少了通信开销与计算中的数据冗余,有效提高了运行速度。实验结果表明,在物理阵列大小为64×64的处理器阵列上,运行速度比现有并行算法提高39.55%,并且具有良好的可扩展性。相似文献

4.

Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines

S.K.S. Gupta S.D. Kaushik C.-H. Huang P. Sadayappan 《Journal of Parallel and Distributed Computing》1996,32(2):155

Array statements are often used to express data-parallelism in scientific languages such as Fortran 90 and High Performance Fortran. In compiling array statements for a distributed-memory machine, efficient generation of communication sets and local index sets is important. We show that for arrays distributed block-cyclically on multiple processors, the local memory access sequence and communication sets can be efficiently enumerated as closed forms using regular sections. First, closed form solutions are presented for arrays that are distributed using block or cyclic distributions. These closed forms are then used with avirtual processor approachto give an efficient solution for arrays with block-cyclic distributions. This approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to physical processors. These views are referred to asvirtual-blockorvirtual-cyclicviews, depending on whether a block or cyclic distribution of the array on the virtual processors is used. The virtual processor approach permits different schemes based on the combination of the virtual processor views chosen for the different arrays involved in an array statement. These virtualization schemes have different indexing overhead. We present a strategy for identifying the virtualization scheme which will have the best performance. Performance results on a Cray T3D system are presented for hand-compiled code for array assignments. These results show that using the virtual processor approach, efficient code can be generated for execution of array statements involving block-cyclically distributed arrays. 相似文献

5.

Flexible rerouting schemes for reconfiguration of multiprocessor arrays

Guiyuan Jiang Jigang Wu Jizhou Sun Yiyi Gao 《Journal of Parallel and Distributed Computing》2014

In a multiprocessor array, some processing elements (PEs) fail to function normally due to hardware defects or soft faults caused by overheating, overload or occupancy by other running applications. Fault-tolerant reconfiguration reorganizes fault-free PEs to a new regular topology by changing the interconnection among PEs. This paper investigates the problem of constructing as large as possible logical array with short interconnects from a physical array with faults. A flexible rerouting scheme is developed to improve the efficiency of utilizing fault-free PEs. Under the scheme, two efficient reconfiguration algorithms are proposed. The first algorithm is able to generate the maximum logical array (MLA) in linear time. The second algorithm reduces the interconnect length of the MLA, and it is capable of producing nearly optimal logical arrays in comparison to the lower bound of the interconnect length, that is also proposed in this paper. Experimental results validate the efficiency of the flexible rerouting schemes and the proposed algorithms. For 128×128 host arrays with 30% unavailable PEs, the proposed approaches improve existing algorithm up to 44% in terms of logical array size, while reducing the interconnection redundancy by 49.6%. In addition, the proposed algorithms are more scalable than existing approaches. On host arrays with 50% unavailable PEs, our algorithms can produce logical arrays with harvest over 56% while existing approaches fail to construct a feasible logical array. 相似文献

6.

A new algorithm based on Givens rotations for solving linearequations on fault-tolerant mesh-connected processors

Murthy K.N.B. Bhuvaneswari K. Ram Murthy C.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):825-832

In this paper, we propose a new I/O overhead free Givens rotations based parallel algorithm for solving a system of linear equations. The algorithm uses a new technique called two-sided elimination and requires an N×(N+1) mesh-connected processor array to solve N linear equations in (5N-log N-4) time steps. The array is well suited for VLSI implementation as identical processors with simple and regular interconnection pattern are required. We also describe a fault-tolerant scheme based on an algorithm based fault tolerance (ABFT) approach. This scheme has small hardware and time overhead and can tolerate up to N processor failures 相似文献

7.

Availability-based noncontiguous processor allocation policies for 2D mesh-connected multicomputers 总被引：1，自引：0，他引：1

Ismail Ababneh 《Journal of Systems and Software》2008,81(7):1081-1092

Various contiguous and noncontiguous processor allocation policies have been proposed for mesh-connected multicomputers. Contiguous allocation suffers from high external processor fragmentation because it requires that the processors allocated to a parallel job be contiguous and have the same topology as the multicomputer. The goal of lifting the contiguity condition in noncontiguous allocation is reducing processor fragmentation. However, this can increase the communication overhead because the distances traversed by messages can be longer, and messages from different jobs can interfere with each other by competing for communication resources. The extra communication overhead depends on how the allocation request is partitioned and mapped to free processors. In this paper, we investigate a new class of noncontiguous allocation schemes for two-dimensional mesh-connected multicomputers. These schemes are different from previous ones in that request partitioning is based on the submeshes available for allocation. The available submeshes selected for allocation to a job are such that a high degree of contiguity among their processors is achieved. The proposed policies are compared to previous noncontiguous policies using detailed simulations, where several common communication patterns are considered. The results show that the proposed policies can reduce the communication overhead and improve performance substantially. 相似文献

8.

The efficient computation of ownership sets in HPF

Joisha P.G. Banerjee P. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(8):769-788

Ownership sets are fundamental to the partitioning of program computations across processors by the owner-computes rule. These sets arise due to the mapping of arrays onto processors. In this paper, we focus on how ownership sets can be efficiently determined in the context of the HPF language and show how the structure of these sets can be symbolically characterized in the presence of arbitrary array alignment and array distribution directives. Our starting point is a system of equalities and inequalities due to Ancourt et al. (1995) that captures the array mapping problem in HPF. We arrive at a refined system that enables us to efficiently solve for the ownership set using the Fourier-Motzkin Elimination technique and that requires the course vector as the only auxiliary vector. The formulation makes it possible to enumerate the elements of the ownership set exactly once, a feature that is very beneficial when such sets are applied to handle DO loops qualified by HPF's INDEPENDENT directive. We develop important and general properties pertaining to HPF alignments and distributions and show how they can be used to eliminate redundant communication due to array replication. Polynomial-time schemes that determine whether the ownership set of a particular processor, with respect to some array, is the empty set or whether the ownership set of every processor, with respect to some array, is the empty set, are presented. We show how distribution directives with unspecified processor meshes can be efficiently handled at compile time. We also show how to avoid the generation of communication code when pairs of array references are ultimately mapped onto the same processors. Experimental data demonstrating the improved code performance that the latter optimization enables is presented and discussed 相似文献

9.

Some results concerning linear iterative (systolic) arrays

Oscar H. Ibarra Michael A. Palis Sam M. Kim 《Journal of Parallel and Distributed Computing》1985,2(2):182-218

Characterizations of various types of linear iterative (systolic) arrays in terms of single processor sequential machines are given. Using these characterizations, new or improved results concerning the properties, power, and limitations of the different linear array models are proved. For example, a speed-up theorem is proved that is stronger than what has previously appeared in the literature and, moreover, works for arrays with one-way communication lines. Also investigated are the effects of augmenting the array with a supplemental control mechanism called global control. It is shown that in many cases arrays with global control can be simulated in real-time by arrays without global control. The result remains true even if one of the processors is augmented by a stack. Cases are also exhibited where the addition of global control makes the array strictly more powerful, even if the control lines are restricted to only a few processors of the array. 相似文献

10.

环网处理器阵列的容错重构技术

祝龙婷武继刚姜桂圆王超《计算机工程与科学》2015,37(8):1423-1429

高效的容错技术对于提高多处理器系统的可靠性至关重要。环网(Torus)是连接多处理器阵列的重要网络结构,而环网处理器阵列上的容错重构技术目前尚属空白。针对环网阵列的特殊连接方式,将环网阵列重构问题转化为矛盾图上求解最大独立集问题。矛盾图上的结点表示故障处理器的替换方案,而边代表了不同替换方案之间的不可共存特性。主要是根据三种不同的冗余处理器分布方案,设计生成矛盾图算法,求解最大独立集算法,以及由独立集生成逻辑处理器阵列算法,取得了令人满意的结果。实验结果表明,当阵列规模较小或故障率较低时,一行一列和十字型的冗余单元分布的重构能力较好;而随着阵列规模或故障率的增大,三种冗余单元分布策略的重构成功率都随之下降,但可通过增加冗余单元以及调整冗余分布来改善容错效果。此外,从实验结果中还可以看出,环网处理器阵列的容错能力显然优于网格(Mesh)处理器阵列。相似文献

11.

Fault-tolerant rate-monotonic first-fit scheduling inhard-real-time systems

Bertossi A.A. Mancini L.V. Rossini F. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(9):934-945

Hard-real-time systems require predictable performance despite the occurrence of failures. In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy scheduled on a different processor. An active copy is always executed, while a passive copy is executed only in the case of a failure. First, the paper considers the ability of the widely-used rate-monotonic scheduling algorithm to meet the deadlines of periodic tasks in the presence of a processor failure. In particular, the completion time test is extended so as to check the schedulability on a single processor of a task set including backup copies. Then, the paper extends the well-known rate-monotonic first-fit assignment algorithm, where all the task copies, included the backup copies, are considered by rate-monotonic priority order and assigned to the first processor in which they fit. The proposed algorithm determines which tasks must use the active duplication and which can use the passive duplication. Passive duplication is preferred whenever possible, so as to overbook each processor with many passive copies whose primary copies are assigned to different processors. Moreover, the space allocated to active copies is reclaimed as soon as a failure is detected. Passive copy overbooking and active copy deallocation allow many passive copies to be scheduled sharing the same time intervals on the same processor, thus reducing the total number of processors needed. Simulation studies reveal a remarkable saving of processors with respect to those needed by the usual active duplication approach in which the schedule of the non-fault-tolerant case is duplicated on two sets of processors 相似文献

12.

Effect of fault tolerance and communication delay on response time in a multiprocessor system with a bus topology

Sameer Bataineh Mohammad Al-Ibrahim 《Computer Communications》1994,17(12):843-851

The effect of interprocessor communication and fault tolerance on the response time of N processors (nodes) interconnected through a bus type communication medium is discussed. Deterministic as well as probabilistic approaches are considered. Four correction methods to handle the unprocessed data by the faulty processor(s) are studied and compared. It is found that the effect of interprocessor communication and fault tolerance on the response time for communication-extensive programs (I/O bound) is more than that for computation-extensive programs (CPU bound). It is also found that the effect of fault tolerance on the response time is significant, and cannot be ignored when evaluating the performance of multiprocessor systems. We have shown that the work presented in this paper for a bus topology can be generalized and readily adopted by other multiprocessor network topologies. 相似文献

13.

New encoding/decoding methods for designing fault-tolerant matrixoperations

Tao D.L. Hartmann C.R.P. Han Y.S. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(9):931-938

Algorithm-based fault tolerance (ABFT) can provide a low-cost error protection for array processors and multiprocessor systems. Several ABFT techniques (weighted check-sum) have been proposed to design fault-tolerant matrix operations. In these schemes, encoding/decoding uses either multiplications or divisions so that overhead is high. In this paper, new encoding/decoding methods are proposed for designing fault-tolerant matrix operations. The unique feature of these new methods is that only additions and subtractions are used in encoding/decoding. In this paper, new algorithms are proposed to construct error detecting/correcting codes with the minimum Hamming distance 3 and 4. We will show that the overhead introduced due to the incorporation of fault tolerance is drastically reduced by using these new coding schemes 相似文献

14.

A message combining approach for efficient array redistribution in non-all-to-all communication networks

《国际计算机数学杂志》2012,89(11):1609-1619

The Array redistribution problem is the heart of a number of applications in parallel computing. This paper presents a message combining approach for scheduling runtime array redistribution of one-dimensional arrays. The important contribution of the proposed scheme is that it eliminates the need for local data reorganization, as noted by Sundar in 2001; the blocks destined for each processor are combined in a series of messages exchanged between neighbouring nodes, so that the receiving processors do not need to reorganize the incoming data blocks before storing them to memory locations. Local data reorganization is of great importance, especially in networks where there is no direct communication between all nodes (like tori, meshes, and trees). Thus, a block must travel through a number of relays before reaching the target processor. This requires a higher number of messages generated, therefore, a higher number of data permutations within the memory of each target processor should be made to assure correct data order. The strategy is based on a relation between groups of communicating processor pairs called superclasses. 相似文献

15.

Efficient Index Set Generation for Compiling HPF Array Statements on Distributed-Memory Machines

S.D. Kaushik C.-H. Huang P. Sadayappan 《Journal of Parallel and Distributed Computing》1996,38(2):237

In languages such as High Performance Fortran (HPF), array statements are used to express data parallelism. In compiling array statements for distributed-memory machines, efficient enumeration of local index sets and commmunication sets is important. A method based on a virtual processor approach has been proposed for efficient index set enumeration for array statements involving arrays distributed using block-cyclic distributions. The virtual processor approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to the physical processors. The key idea of the method is to first develop closed forms in terms of simple regular sections for the index sets for arrays distributed using block or cyclic distributions. These closed forms are then used with the virtual processor approach to give an efficient solution for arrays with the block-cyclic distribution. HPF supports a two-level mapping of arrays to processors. Arrays are first aligned with a template at an offset and a stride and the template is then distributed among the processors using a regular data distribution. The introduction of a nonunit stride in the alignment creates “holes” in the distributed arrays which leads to memory wastage. In this paper, using simple mathematical properties of regular sections, we extend the virtual processor approach to address the memory allocation and index set enumeration problems for array statements involving arrays mapped using the two-level mapping. We develop a methodology for translating the closed forms for block and cyclically distributed arrays mapped using a one-level mapping to closed forms for arrays mapped using the two-level mapping. Using these closed forms, the virtual processor approach is extended to handle array statements involving arrays mapped using two-level mappings. Performance results on the Cray T3D are presented to demonstrate the efficacy of the extensions and identify various trade-offs associated with the proposed method. 相似文献

16.

面向通讯同步的多处理器阵列重构

吴亚兰武继刚姜文超刘竹松《计算机科学》2017,44(7):47-56

从多处理器阵列中获取所需大小并且同步通讯性能优良的子阵列,是高性能拓扑重构的核心问题之一。基于不同的逻辑列剔除策略提出了3种面向通讯同步的拓扑重构算法:基于分治思想剔除逻辑列的重构算法(SCA_01),该算法能够使被优化的逻辑列相对均匀地分布在物理阵列中;优先剔除长逻辑列的贪心重构算法(SCA_02),该算法能够使被优化的逻辑列的长链接总数最少;基于分治与长链接数的混成重构算法(SCA_03),该算法将某一区域内的最长逻辑列剔除,且尽可能将剩余逻辑列均匀分布在物理阵列中。同时,对逻辑阵列的最大通讯延时给出了下界的求解算法。实验结果表明,3种算法在故障率小于1%、逻辑列的剔除率超过20%时,算法重构出的逻辑阵列的通讯延时特别接近计算出的性能下界。在多数情况下SCA_01优于SCA_02和SCA_03,而后两者的性能相近。在小阵列上且故障率与剔除率较小时,SCA_02具有性能优势,但在大阵列上SCA_03具有优势。在32×32的阵列上,SCA_01构造的阵列产生的通讯延时较SCA_02和SCA_03产生的延时平均减少25%,并且运行速度也提升了19.4%。相似文献

17.

Task allocation model for distributed systems

G. SAGAR ANIL K. SARJE 《International journal of systems science》2013,44(9):1671-1678

The problem of distributing tasks to processors in a distributed computing system is addressed. A task should be assigned to a processor whose capabilities are most appropriate for the execution of that task and excessive interprocessor communication is avoided. A simple algorithm for task allocation is presented. The execution costs and communication costs of the tasks are represented by arrays. A task is either assigned to a processor or fused with another task using a simple criterion. The execution and communication costs are then modified suitably. The process continues until all the tasks are assigned to processors. This algorithm also facilitates incorporation of various system constraints. It is applicable to random program structures and to a system containing any number of processors. 相似文献

18.

On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices

Qin Zheng Bharadwaj Veeravalli 《Journal of Parallel and Distributed Computing》2009

Fault-tolerant scheduling is an imperative step for large-scale computational Grid systems, as often geographically distributed nodes co-operate to execute a task. By and large, primary-backup approach is a common methodology used for fault tolerance wherein each task has a primary and a backup on two different processors. In this paper, we address the problem of how to schedule DAGs in Grids with communication delays so that service failures can be avoided in the presence of processors faults. The challenge is, that as tasks in a DAG have dependence on each other, a task must be scheduled to make sure that it will succeed when any of its predecessors fails due to a processor failure. We first propose a communication model and determine when communications between a backup and backups of its successors are necessary. Then we determine when a backup can start and its eligible processors so as to guarantee that every DAG can complete upon any processor failure. We develop two algorithms to schedule backups, which minimize response time and replication cost, respectively. We also develop a suboptimal algorithm which targets minimizing replication cost while not affecting response time. We conduct extensive simulation experiments to quantify the performance of the proposed algorithms. 相似文献

19.

Efficient task migration algorithm for distributed systems

Suen T.T.Y. Wong J.S.K. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(4):488-499

The objective of the study was to achieve balanced load among processors, reduce the communication overhead of the load balancing algorithm, and improve respource utilization, which results in better average resonse time. A communication protocol and a fully distributed algorithm for dynamic load balancing through task migration in a connected N-processor network are presented. Each processor communicates its load directly with only a subset (of the size √ N) of processors, reducing communication traffic and average response time. It is proved that the given algorithm will perform task migration even if there is only one light load processor and one heavy load processor in the system. Simulation results show that the proposed scheme can save up to 60% of the protocol messages used by the broadcast algorithms and can reduce the average response time 相似文献

20.

Using task migration to improve non-contiguous processor allocation in NoC-based CMPs

《Journal of Systems Architecture》2013,59(7):468-481

In this paper, a processor allocation mechanism for NoC-based chip multiprocessors is presented. Processor allocation is a well-known problem in parallel computer systems and aims to allocate the processing nodes of a multiprocessor to different tasks of an input application at run time. The proposed mechanism targets optimizing the on-chip communication power/latency and relies on two procedures: processor allocation and task migration. Allocation is done by a fast heuristic algorithm to allocate the free processors to the tasks of an incoming application when a new application begins execution. The task-migration algorithm is activated when some application completes execution and frees up the allocated resources. Task migration uses the recently deallocated processors and tries to rearrange the current tasks in order to find a better mapping for them. The proposed method can also capture the dynamic traffic pattern of the network and perform task migration based on the current communication demands of the tasks. Consequently, task migration adapts the task mapping to the current network status. We adopt a non-contiguous processor allocation strategy in which the tasks of the input application are allowed to be mapped onto disjoint regions (groups of processors) of the network. We then use virtual point-to-point circuits, a state-of-the-art fast on-chip connection designed for network-on-chips, to virtually connect the disjoint regions and make the communication latency/power closer to the values offered by contiguous allocation schemes. The experimental results show considerable improvement over existing allocation mechanisms. 相似文献