首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 467 毫秒
1.
This article is devoted to the redistribution of arrays that are distributed in a GEN_BLOCK fashion over a processor grid. While GEN_BLOCK redistribution is essential for load balancing, prior research about redistribution has been focused on block-cyclic redistribution. The proposed scheduling algorithm exploits a spatial locality in message passing from a seemingly irregular array redistribution. The algorithm attempts to obtain near optimal scheduling by trying to minimize communication step size and the number of steps. According to experiments on CRAY T3E and IBM SP2, the algorithm shows good performance in typical distributed memory machines.  相似文献   

2.
The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another. Block-cyclic redistribution consists of index set computation , wherein the destination locations for individual data blocks are calculated, and data communication , wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed direct indirect and hybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms. Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O( log K) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K) communication steps for redistribution. Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations. Received June 1, 1997; revised March 10, 1998.  相似文献   

3.
A given binary resolution proof, represented as a binary tree, is said to be minimal if the resolutions cannot be reordered to generate an irregular proof. Minimality extends Tseitin"s regularity restriction and still retains completeness. A linear-time algorithm is introduced to decide whether a given proof is minimal. This algorithm can be used by a deduction system that avoids redundancy by retaining only minimal proofs and thus lessens its reliance on subsumption, a more general but more expensive technique.Any irregular binary resolution tree is made strictly smaller by an operation called Surgery, which runs in time linear in the size of the tree. After surgery the result proved by the new tree is nonstrictly more general than the original result and has fewer violations of the regular restriction. Furthermore, any nonminimal tree can be made irregular in linear time by an operation called Splay. Thus a combination of splaying and surgery efficiently reduces a nonminimal tree to a minimal one.Finally, a close correspondence between clause trees, recently introduced by the authors, and binary resolution trees is established. In that sense this work provides the first linear-time algorithms that detect minimality and perform surgery on clause trees.  相似文献   

4.
We propose an algorithm to quadrangulate an N‐sided patch (2 ≤ N ≤ 6) with prescribed numbers of edge subdivisions at its boundary. Our algorithm is guaranteed to succeed for arbitrary valid input, which is proved using a canonical simplification of the input and a small set of topological patterns that are sufficient for supporting all possible cases. Our algorithm produces solutions with minimal number of irregular vertices by default, but it also allows the user to choose other feasible solutions by solving a set of small integer linear programs. We demonstrate the effectiveness of our algorithm by integrating it into a sketch‐based quad remeshing system. A reference C++ implementation of our algorithm is provided as a supplementary material.  相似文献   

5.
In many scientific applications, dynamic data redistribution is used to enhance algorithm performance and achieve data locality in parallel programs on distributed memory multi-computers. In this paper, we present a new method, Compressed Diagonals Remapping (CDR) technique aims to the efficiency of runtime data redistribution on banded sparse matrices. The main idea of the proposed technique is first to compress the source matrix into a Compressed Diagonal Matrix (CDM) form. Based on the compressed diagonal matrix, a one-dimensional local and global index transformation method can be applied to carry out data redistribution on the compressed diagonal matrix. This process is identical to redistribute data in the two-dimensional banded sparse matrix. The CDR technique uses an efficient one-dimensional indexing scheme to perform data redistribution on banded sparse matrix. A significant improvement of this approach is that a processor does not need to determine the complicated sending or receiving data sets for dynamic data redistribution. The indexing cost is reduced significantly. The second advantage of the present techniques is the achievement of optimal packing/unpacking stages consequent upon the consecutive attribute of column elements in a compressed diagonal matrix. Another contribution of our methods is the ability to handle sparse matrix redistribution under two disjoint processor grids in the source and destination phases. A theoretical model to analyze the performance of the proposed technique is also presented in this paper. To evaluate the performance of our methods, we have implemented the present techniques on an IBM SP2 parallel machine along with the v2m algorithm and a dense redistribution strategy. The experimental results show that our technique provides significant improvement for runtime data redistribution of banded sparse matrices in most test samples.  相似文献   

6.
Early consensus in an asynchronous system with a weak failure detector   总被引:2,自引:0,他引:2  
Summary.  Consensus is one of the most fundamental problems in the context of fault-tolerant distributed computing. The problem consists, given a set Ω of processes having each an initial value v i , in deciding among Ω on a common value v. In 1985, Fischer, Lynch and Paterson proved that the consensus problem is not solvable in an asynchronous system subject to a single process crash. In 1991, Chandra and Toueg showed that, by augmenting the asynchronous system model with a well defined unreliable failure detector, consensus becomes solvable. They also give an algorithm that solves consensus using the ◊? failure detector. In this paper we propose a new consensus algorithm, also using the ◊? failure detector, that is more efficient than the Chandra-Toueg consensus algorithm. We measure efficiency by introducing the notion of latency degree, which defines the minimal number of communication steps needed to solve consensus. The Chandra-Toueg algorithm has a latency degree of 3 (it requires at least three communication steps), whereas our early consensus algorithm requires only two communication steps (latency degree of 2). We believe that this is an interesting result, which adds to our current understanding of the cost of consensus algorithms based on ◊?. Received: April 1995 / Accepted: October 1996  相似文献   

7.
An algorithm combining direct and iterative solution procedures is proposed for the solution of the discrete equations resulting from static or dynamic finite element models. In this technique, the discrete model is broken into regular or irregular blocks of elements. The algorithm is composed of two analysis steps, which are repetitively carried out. In the first step, using the specified boundary data on the block boundaries, the solutions within the blocks are defined directly. In the second step the variables on the block boundaries are updated iteratively. The algorithm would be particularly useful in solving large nonlinear or dynamic problems.  相似文献   

8.
Fingerprint segmentation is one of the most important preprocessing steps in an automatic fingerprint identification system (AFIS). Accurate segmentation of a fingerprint will greatly reduce the computation time of the following processing steps, and the most importantly, exclude many spurious minutiae located at the boundary of foreground. In this paper, a new fingerprint segmentation algorithm is presented. First, two new features, block entropy and block gradient entropy, are proposed. Then, an AdaBoost classifier is designed to discriminate between foreground and background blocks based on these two features and five other commonly used features. The classification error rate (Err) and McNemar’s test are used to evaluate the performance of our method. Experimental results on FVC2000, FVC2002 and FVC2004 show that our method outperforms other methods proposed in the literature both in accuracy and stability.  相似文献   

9.
Irregular array redistribution has been paid attention recently since it can distribute different size of data segment to heterogeneous processors according to their computational ability. It’s also the reason why it has been kept an eye on load balance. High Performance Fortran Version 2 (HPF2) provides GEN_BLOCK distribution format which facilitates generalized block distributions. In this paper, we present a two-phase degree-reduction (TPDR) method for scheduling HPF2 irregular array redistribution. Using a bipartite communication graph, the first phase of TPDR schedules communication links adjacent to processors that with degree greater than two. A communication step will be scheduled follow each degree-reduction iteration. The second phase of TPDR schedules remaining messages of all processors that with degree-2 and degree-1 using an adjustable coloring mechanism. An extended algorithm based on TPDR is also presented in this paper. Effectiveness of the proposed methods not only avoids node contention but also shortens the overall communication cost. The proposed methods are also practicable due to low algorithmic complexity. To evaluate the performance of our methods, we have implemented both algorithms along with the divide-and-conquer algorithm and two scheduling mechanism. The simulation results show improvement of total communication costs.
Chao-Yang LanEmail:
  相似文献   

10.
We consider the problem of symbol-by-symbol a posteriori probability (APP) decoding for information symbols of nonsystematically encoded block codes. This problem arises at soft concatenated decoding of generalized concatenated block codes. The well-known BCJR algorithm for efficient APP decoding is not able to solve the problem if it runs on the minimal code trellis of a block code. We introduce an extended trellis representation for block codes, which includes encoding information and thus makes it possible to apply the BCJR algorithm as well as trellis-based decoding in the dual code space. Complexity properties of the extended trellis are investigated.  相似文献   

11.
高效地利用无线频谱资源和保证用户体验质量是未来无线网络的主要目标。基于此,提出一种基于QoE的LTE多业务资源分配算法。在考虑信道信息、QoS要求及公平性的基础上,引入QoE来计算的用户优先级。特别的,引入最小QoE约束来保证RT用户QoE要求;提出一种次优资源块(Resource Block,RB)分配算法来解决复杂的资源分配优化问题,该算法主要分为两步:保证RT用户最小QoE要求;最大化系统加权和速率。仿真结果表明,相较现有的RT/NRT资源分配算法,该算法在用户分组丢失率、平均QoE和小区频谱效率方面性能都有所提升。  相似文献   

12.
Using remote sensing data for marine monitoring, marine rescue, marine pollution monitoring which has a broad monitoring area and fast response time, so in the last 10 years, satellite borne or airborne optical and SAR sensor data have been increasingly used in the field of ocean monitoring. However, based on the dynamic video to carry out the study of the detection of irregular debris blocks from the wrecked ship, aircraft is less. To a great extent, it affects the efficiency of maritime rescue. This paper, based on the dynamic objects detection, Color Space Quantization and property of Human Vision System, takes the irregular wreckage patches as research. And a detection algorithm for dynamic objects is developed with the foundation of Lab channels. To verify that, the calculation of the Euclidean distance between the characteristic parameters of irregular wreckage patches and regular floating ones, and the average Euclidean distance is 0.8245 between the irregular1 and the regular while that between irregular2 and the regular is 4.3645. At the same time, the Hausdorff distance was calculated and verified, and the average distance between the irregular block 1 and the regular block was 2.5975. The average distance between the block 2 and the rule block was 13.8962. Experimental results show that the method is consistent with the results of the second methods, which proves that the first method is scientific and operational. Therefore, it can be found that some parameters including divergence, elongation and eccentricity fluctuate dramatically from those in irregular ones to the regular one, which could be extracted to recognize or distinguish the patches. That is why this paper is of importance in marine rescue and detection of objects on the sea surface.  相似文献   

13.
Run-time array redistribution is necessary to enhance the performance of parallel programs on distributed memory supercomputers. In this paper, we present an efficient algorithm for array redistribution from cyclic(x) on P processors to cyclic(Kx) on Q processors. The algorithm reduces the overall time for communication by considering the data transfer, communication schedule, and index computation costs. The proposed algorithm is based on a generalized circulant matrix formalism. Our algorithm generates a schedule that minimizes the number of communication steps and eliminates node contention in each communication step. The network bandwidth is fully utilized by ensuring that equal-sized messages are transferred in each communication step. Furthermore, the time to compute the schedule and the index sets is significantly smaller. It takes O(max(P, Q)) time and is less than 1 percent of the data transfer time. In comparison, the schedule computation time using the state-of-the-art scheme (which is based on the bipartite matching scheme) is 10 to 50 percent of the data transfer time for similar problem sizes. Therefore, our proposed algorithm is suitable for run-time array redistribution. To evaluate the performance of our scheme, we have implemented the algorithm using C and MPI on an IBM SP2. Results show that our algorithm performs better than the previous algorithms with respect to the total redistribution time, which includes the time for data transfer, schedule, and index computation  相似文献   

14.
This paper presents some novel theoretical results as well as practical algorithms and computational procedures on fuzzy relation equations (FRE). These results refine and improve what has already been reported in a significant manner. In the previous paper, the authors have already proved that the problem of solving the system of fuzzy relation equations is an NP-hard problem. Therefore, it is practically impossible to determine all minimal solutions for a large system if PNP. In this paper, an existence theorem is proven: there exists a special branch-point-solution that is greater than all minimal solutions and less than the maximum solution. Such branch-point-solution can be calculated based on the solution-base-matrix. Furthermore, a procedure for determining all branch-point-solutions is designed. We also provide efficient algorithms which is capable of determining as well as searching for certain types of minimal solutions. We have thus obtained: (1) a fast algorithm to determine whether a solution is a minimal solution, (2) the algorithm to search for the minimal solutions that has at least a minimum value at a component in the solution vector, and (3) the procedure of determining if a system of fuzzy relation equations has the unique minimal solution. Other properties are also investigated.  相似文献   

15.
为缓解无线胶囊内镜图像在电子设备以及服务器中的存储压力,提出一种自适应不规则纹理的无损压缩算法。在图像块内,利用扩展角度预测模式寻找与待预测像素最邻近的5个参考像素,并给其中3个参考像素分配不同权重,同时根据邻近像素值梯度变化规律,扩大待预测像素在不规则纹理方向上的预测值选择范围,基于图像块的最小信息熵选择最优的预测值,将真实值与预测值作差获得预测残差,以适应不规则纹理图像。利用跨分量预测模式选择最优的预测系数,构建符合图像块内预测残差分布规律的线性关系,从而消除当前编码像素中3个分量的冗余数据。结合Deflate算法对经多角度预测模式与跨分量预测模式预测后的剩余残差进行熵编码。实验结果表明,该算法在Kvasir-Capsule数据集上的无损压缩比平均为5.81,相比WebP、SAP、MDIP等算法,具有较优的压缩性能,能够有效提高图像的冗余消除率,其中相较WebP算法的冗余消除率提高约1.9%。  相似文献   

16.
破损区域分块划分的图像修复   总被引:3,自引:2,他引:1       下载免费PDF全文
目的: 提出一个算法,使计算机能够自动修复破损区域较大且结构信息较复杂的图像。 方法: 本研究通过模仿手工修复破损区域较大且结构信息较复杂的图像的方法,按如下2个步骤来修复图像:1) 破损区域的划分,首先,对各断裂边界线进行匹配配对。然后,将已配对的各断裂边界线进行直接连接,从而在破损区域内形成各个待修复块。2) 各块的修复,首先,采用BSCB算法中的传输方程和扩散方程将已选邻域信息迭代传输和扩散到各块破损区域,以修复完优先级最大的各个块。然后,判断是否有次优先级的待修复块,若有,则采用边界线删除算法删除部分冗余边界线,接着按相同方法修复次优先级的待修复块;若无,则修复完成。 结果: 基于以上图像修复步骤,提出了破损区域分块划分的图像修复算法。将提出的该算法和其它3个算法用于修复破损区域较大且结构信息较复杂的图像,其结果显示,该算法所修复图像的PSNR值平均提高1.49db,同时,所修复图像具有较好的视觉效果。 结论: 和其它3个算法相比,提出的破损区域分块划分的图像修复算法更适合于修复破损区域较大且结构信息较复杂的图像。  相似文献   

17.
刘锐  董社勤  洪先龙  龙迪  顾钧 《软件学报》2004,15(5):641-649
在模拟集成电路设计中,关于X轴和y轴同时对称的Stack,以及模块之间的合并,对于增加器件之间的匹配和控制寄生是至关重要的.描述了模拟集成电路二轴对称Stack生成算法和模块合并算法.通过对于对称欧拉图和对称欧拉路径的研究,得出了多项理论结果.在此基础上,提出了时间复杂度为O(n)的伪器件插入算法、对称欧拉路径构造算法和二轴对称Stack生成算法.生成的Stack,不但关于X轴和y轴对称,而且具有公共质心(commoncentroid)的结构.还描述了模块合并算法,给出了计算最大合并距离的公式.该算法本质上是独立于任何拓扑表示的.实验结果验证了算法的有效性.  相似文献   

18.
Modifying the data distribution over the course of a program to adapt to variations in the data access patterns may leads to significant computational benefits in many scientific applications. Therefore, dynamic realignment of data is used to enhance algorithm performance in parallel programs on distributed memory machines. This paper presents a new method aims to the efficiency of block-cyclic data realignment of sparse matrix. The main idea of the proposed technique is first todevelop closed forms for generating the Vector Index Set (VIS) of each source/destination processor. Based on the vector index set and the nonzero structure of sparse matrix, two efficient algorithms,vector2message (v2m) and message2vector (m2v) can be derived. The proposed technique uses v2m to extract nonzero elements from source compressed structures and packs them into messages in the source stage; and uses m2v to unpack each received messages and construct the destination matrix in the destination stage. A significant improvement of this approach is that a processor does not need to determine the complicated sending or receiving data sets for dynamic data redistribution. The indexing cost is reduced obviously. The second advantage of the present techniques is the achievement of optimal packing/unpacking stages consequent upon the informative VIS tables. Another contribution of our methods is the ability to handle sparse matrix redistribution under two disjoint processor grids in the source and destination phases. A theoretical model to analyze the performance of the proposed technique is also presented in this work. To evaluate the performance of our methods, we have implemented the present algorithms on an IBM SP2 parallel machine along with the Histogram method and a dense redistribution strategy. The experimental results show that our technique provides significant improvement for runtime data redistribution of sparse matrices in most test samples.  相似文献   

19.
Summary We present an optimal algorithm to determine whether a placement of N isothetic non-overlapping rectangles (macros) can be represented by a slicing tree, and if so, to find a representation of minimal height. A slicing is a recursive partition of the overall bounding rectangle, by straight horizontal or vertical cuts, into rectangular regions, each one containing exactly one macro. The algorithm first determines a representation of the empty space of the placement by means of maximally extended horizontal and vertical channels. A second phase then generates a maximal slicing tree (an ordered tree with unbounded degree and maximal branching, i.e., minimal height) in a top-down fashion. The complexity of each phase is O(N log N). The problem arises in steps (1) and (2) of our top-down approach to VLSI custom chip design, which consists of (1) floorplanning by slicing, (2) hierarchicial global wiring, and (3) detailed layout of macros.On leave from: Laboratory of Information Processing Science, Helsinki University of Technology, Espoo, FinlandOn leave from: Dipartimento di Elettrotecnica, Elettronica e Informatica, Università degli Studi di Trieste, Via Valerio, 10-34127 Trieste, Italy  相似文献   

20.
The problem of finding the minimal spanning tree on an undirected weighted graph has been investigated by many people and O(n2) algorithms are well known. P. M. Spira and A. Pan (Siam J. Computing4 (1975), 375–380) present an O(n) algorithm for updating the minimal spanning tree if a new vertex is inserted into the graph. In this paper, we present another O(n) algorithm simpler than that presented by Spira and Pan for the insertion of a vertex. Spira and Pan further show that the deletion of a vertex requires O(n2) steps. If all the vertices are considered, O(n3) steps may be used. The algorithm which we present here takes only O(n2) steps and labels the vertices of the graph in such a way that any vertex may be deleted from the graph and the minimal spanning tree can be updated in constant time. Similar results are obtained for the insertion and the deletion of an edge.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号