首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics: the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very difficult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among different processors. Our goal here is to determine the most scalable load balancing schemes for different architectures such as hypercube, mesh, and network of workstations. For each of these architectures, we establish lower bounds on the scalability of any possible load balancing scheme. We present the scalability analysis of a number of load balancing schemes that have not been analyzed before. This gives us valuable insights into their relative performance for different problem and architectural characteristics. For each of these architectures, we are able to determine near optimal load balancing schemes. Results obtained from implementation of these schemes in the context of the Tautology Verification problem on the Ncube/2 (a trademark of the Ncube Corporation) multicomputer are used to validate our theoretical results for the hypercube architecture. These results also demonstrate the accuracy and viability of our framework for scalability analysis.  相似文献   

2.
In this paper, we revisit the problem of load-balancing structured peer-to-peer systems with on-line protocols. Load-balancing is of major significance for large-scale decentralized networks in terms of enhanced scalability and performance. The main incentives behind balancing schemes are under-utilization of bandwidth and computer resources. Therefore, our methods focus mainly on task-skew. Specifically, we address the problem with on-line protocols on the basis of migration and enhanced availability. In particular, the cornerstones of our methods are the notions of virtual nodes, replication and Multiple realities, combined altogether with allocation techniques based on balls-in-bins games. The rationale of our dynamic protocol to depend exclusively on peer load distribution preserves intact the structural properties and search efficiency of the overlay used as an indexing infrastructure, while preserving the semantic information of the data (e.g., range partitioned network). We also propose an effective load-aware mechanism to facilitate robust operations that counteract against contingent churn failures. Finally, our work is complemented with extensive experiments using both real and synthetic data sets.  相似文献   

3.
This paper describes changes made to a previous implementation of an N-body tree code developed for a fine-grained, SIMD computer architecture. These changes include (1) switching from a balanced binary tree to a balanced oct tree, (2) addition of quadrupole corrections, and (3) having the particles search the tree in groups rather than individually. An algorithm for limiting errors is also discussed. In aggregate, these changes have led to a performance increase of over a factor of 10 compared to the previous code. For problems several times larger than the processor array, the code now achieves performance levels of ∼ 1 Gflop on the Maspar MP-2 or roughly 20% of the quoted peak performance of this machine. This percentage is competitive with other parallel implementations of tree codes on MIMD architectures. This is significant, considering the low relative cost of SIMD architectures.  相似文献   

4.
This paper presents a parallel formulation of depth-first search which retains the storage efficiency of sequential depth-first search and can be mapped on to anyMIMD architecture. To study its effectiveness it has been implemented to solve the 15-puzzle problem on three commercially available multiprocessors—Sequent Balance 21000, the Intel Hypercube and BBN Butterfly. We have been able to achieve fairly linear speedup on Sequent up to 30 processors (the maximum configuration available) and on the Intel Hypercube andBBN Butterfly up to 128 processors (the maximum configurations available). Many researchers considered the ring architecture to be quite suitable for parallel depth-first search. Our experimental results show that hypercube and sharedmemory architectures are significantly better.At the heart of our parallel formulation is a dynamic work distribution scheme that divides the work between different processors. The effectiveness of the parallel formulation is strongly influenced by the work distribution scheme and architectural features such as presence/absence of shared memory, the diameter of the network, relative speed of the communication network, etc. In a companion paper,(1) we analyze the effectiveness of different load-balancing schemes and architectures, and also present new improved work distribution schemes.This work was supported by Army Research Office Grant No. DAAG29-84-K-0060 to the Artificial Intelligence Laboratory, and Office of Naval Research Grant N00014-86-K-0763 to the Computer Science Department at the University of Texas ay Austin  相似文献   

5.
We present a parallel formulation for enumerative search in high dimensional spaces and apply it to planning paths for a 6-dof manipulator robot. Participating processors perform local A* search towards the goal configuration. To exploit all the processors at their maximum capacity at all times, a dynamic load-balancing scheme matches idle and busy processors for load transfer. For comparison purposes, we have also implemented an existing parallel static load-balancing formulation based on regular domain decomposition. Both methods achieved almost linear speed-up in our experiments. The two methods follow different search strategies in parallel and the implementation of the existing method (with tuned space decomposition) was more time efficient on average. However, the planning time of that method is highly dependent on the distribution of the search space among the processors and its tuned decomposition varies for different obstacle placements. Empirical selection of the space decomposition parameters for the existing method does not guarantee minimal planning time in all environments and leads to slower planning than our dynamic load-balancing method in some cases. The performance of the developed dynamic method is independent of the obstacle placements and the method can achieve consistent speed-up in all environments.  相似文献   

6.
Image processing problems frequently involve large structured arrays of data and a need for very rapid computation. Special parallel processing schemes have evolved over the last 20 years to deal with these problems. In this paper many parallel systems which have been developed for image processing are outlined and the features of their underlying architectures are discussed. Most of these special architectures may be loosely classified as either SIMD or pipeline structures although some MIMD structures have been designed for high level image analysis. In recent years several multiple SIMD (MSIMD) schemes have been proposed as suitable architectures for image processing. The fundamental problems of developing an effective MSIMD system are discussed and a simple SIMD/MIMD computational model for comparison with such systems is proposed.  相似文献   

7.
Finger vein image retrieval is a biometric identification technology that has recently attracted a lot of attention. It has the potential to reduce the search space and has attracted a considerable amount of research effort recently. It is a challenging problem owing to the large number of images in biometric databases and the lack of efficient retrieval schemes. We apply a hierarchical vocabulary tree modelbased image retrieval approach because of its good scalability and high efficiency.However, there is a large accumulative quantization error in the vocabulary tree (VT)model thatmay degrade the retrieval precision. To solve this problem, we improve the vector quantization coding in the VT model by introducing a non-negative locality-constrained constraint: the non-negative locality-constrained vocabulary tree-based image retrieval model. The proposed method can effectively improve coding performance and the discriminative power of local features. Extensive experiments on a large fused finger vein database demonstrate the superiority of our encoding method. Experimental results also show that our retrieval strategy achieves better performance than other state-of-theart methods, while maintaining low time complexity.  相似文献   

8.
Search of discrete spaces is important in combinatorial optimization. Such problems arise in artificial intelligence, computer vision, operations research, and other areas. For realistic problems, the search spaces to be processed are usually huge, necessitating long computation times, pruning heuristics, or massively parallel processing. We present an algorithm that reduces the computation time for graph matching by employing both branch-and-bound pruning of the search tree and massively-parallel search of the as-yet-unpruned portions of the space. Most research on parallel search has assumed that a multiple-instruction-stream/multiple-data-stream (MIMD) parallel computer is available. Since massively parallel stream (SIMD) computers are much less expensive than MIMD systems with equal numbers of processors, the question arises as to whether SIMD systems can efficiently handle state-space search problems. We demonstrate that the answer is yes, and in particular, that graph matching has a natural and efficient implementation on SIMD machines  相似文献   

9.
负载平衡是提高分布式系统性能不可缺少的技术,同时也是系统高可用性、可扩展性、容错性的必然要求。该文在分析和研究负载平衡模型的基础上,提出了一种基于消息队列的负载平衡模型,并在此模型下改进了基于阈值的动态负载平衡算法,给出了一种自适应的动态负载平衡算法。最后,在J2EE平台下,进行了实验和性能比较。  相似文献   

10.
陈丽娜  赵建民 《计算机科学》2011,38(2):144-147,165
在传统的基于时序逻辑的模型检查框架下验证Statcchart模型面临三大挑战:全状态空间搜索、多次重复搜索和复杂时序逻辑公式难写。基于上述问题和实践工作,提出一种新的Statechart模型验证方法。该方法的中心是一种强化了的属性描述语言—属性状态图,并利用属性状态图中存在的先后关系和并发关系,把各个属性状态图有机地结合成一个树结构—属性树。属性树涵盖了目标系统要求验证的属性空间,因此可自上而下的验证整棵属性树。在验证过程中系统Statechart模型对应状态空间是逐步展开的,每验证部分属性就展开相应的部分状态空间并对其进行验证,验证过程是基于属性树转换并以step为单位,验证step的初始status和结束status是否满足对应属性树节点公式对其的属性约束,这样既能够迅速找出错误又能屏蔽step内部系统Statcchart模型的状态变化,使得验证过程更简单快捷。为了说明属性状态图和基于其的验证算法是实用和易用的,通过一个例子说明了从模型设计到具体验证整个过程。  相似文献   

11.
通过改进ACO算法达到一种能实现通信网络负载平衡的群体智能路由策略。采用的主要方法为:将蚁群划分为若干个子群,不同子群的蚂蚁释放不同类型的信息素。通过不同类型信息素之间的相互制约作用,以及链路负载的测量,提出了三种策略实现负载平衡路由。在模拟网络上进行了不同策略的对比实验,以及与已有的群体智能路由算法的运行测试比较。实验结果表明,本文的路由策略具有较好的效果和一定的优势。  相似文献   

12.
The bulge-chasing kernel in the small-bulge multi-shift QR algorithm for the non-symmetric dense eigenvalue problem becomes a sequential bottleneck when the QR algorithm is run in parallel on a multicore platform with shared memory. The duration of each kernel invocation is short, but the critical path of the QR algorithm contains a long sequence of calls to the bulge-chasing kernel. We study the problem of parallelizing the bulge-chasing kernel itself across a handful of processor cores in order to reduce the execution time of the critical path. We propose and evaluate a sequence of four algorithms with varying degrees of complexity and verify that a pipelined algorithm with a slowly shifting block column distribution of the Hessenberg matrix is superior. The load-balancing problem is non-trivial and computational experiments show that the load-balancing scheme has a large impact on the overall performance. We propose two heuristics for the load-balancing problem and also an effective optimization method based on local search. Numerical experiments show that speed-ups are obtained for problems as small as 40 × 40 on two different multicore architectures.  相似文献   

13.
In the multisignature scheme with distinguished signing authorities, each group member is responsible for his own signing of a partial message. Recently, Wu and Hsu proposed the multisignature schemes with distinguished signing authorities for sequential and broadcasting architectures. However, in their schemes, the computational complexity for a verifier to verify the multisignature is dependent on the number of members in the group. It is very time-consuming for a verifier to verify the multisignature when the number of group members increases. In this paper, we propose new multisignature schemes with distinguished signing authorities for sequential and broadcasting architectures. In our methods, the computational complexity for a verifier to verify the multisignature will not be significantly affected by the number of signers in the group G. Hence, the verification time for any multisignature is almost equivalent to that of an individual signature.  相似文献   

14.
In this paper, we consider the design of high performance SIMD architectures. We examine three mechanisms by which the performance of this class of machines may be improved, and which have been largely unexplored by the SIMD community. The mechanisms are pipelined instruction broadcast, pipelining of the PE architecture, and the introduction of a novel memory hierarchy in the PE address space which we denote the direct only data cache, (dod-cache). For each of the performance improvements, we develop analytical models of the potential speedup, and apply those models to real program traces obtained on a MasPar MP-2 system. In addition, we consider the impact of all improvements taken together  相似文献   

15.
基于位宽控制提高SIMD架构并行度的优化算法   总被引:1,自引:0,他引:1  
随着SIMD功能单元作为多媒体加速部件的广泛应用,如何有效利用这一构架优化应用程序成为编译优化研究的热点.目前典型的SIMD结构为同一操作对不同的数据化宽提供了不同的指令版本,随着操作数位宽的增加,对应的SIMD指令可同时完成的操作个数也随之降低.因此,如何有效识别操作数的有效位宽,对提高优化过程中SIMD指令内操作的并行度将产生至关重要的影响.文中针对SIMD优化面临的并行度问题,提出了一种优化算法,该算法在对操作数的有效位进行分析的基础上,进行溢出控制,从而减少操作数对宽位宽数据类型的依赖.实验数据表明,该算法可以有效提高多媒体程序优化的并行度,对多媒体程序获得较好的加速效果.  相似文献   

16.
Heterogeneity, parallelization and vectorization are key techniques to improve the performance and energy efficiency of modern computing systems. However, programming and maintaining code for these architectures poses a huge challenge due to the ever-increasing architecture complexity. Task-based environments hide most of this complexity, improving scalability and usage of the available resources. In these environments, while there has been a lot of effort to ease parallelization and improve the usage of heterogeneous resources, vectorization has been considered a secondary objective. Furthermore, there has been a swift and unstoppable burst of vector architectures at all market segments, from embedded to HPC. Vectorization can no longer be ignored, but manual vectorization is tedious, error-prone and not practical for the average programmer. This work evaluates the feasibility of user-directed vectorization in task-based applications. Our evaluation is based on the OmpSs programming model, extended to support user-directed vectorization for different SIMD architectures (i.e., SSE, AVX2, AVX512). Results show that user-directed codes achieve manually optimized code performance and energy efficiency with minimal code modifications, favoring portability across different SIMD architectures.  相似文献   

17.
Clusters of workstations, connected by a fast network, are emerging as a viable architecture for building high-throughput fault-tolerant servers. This type of architecture is more scalable and more cost-effective than a tightly coupled multiprocessor and may achieve as good a throughput. Two of the most important issues that a designer of such clustered servers must consider in order for the system to meet its fault-tolerance and throughput goals are the load-balancing scheme and the fault-tolerance scheme that the system will use. This paper explores several combinations of such fault-tolerance and load-balancing schemes and compares their impact on the maximum throughout achievable by the system, and on its survivability. In particular, we show that a fault-tolerance scheme may have an effect on the throughput of the system, while a load-balancing scheme may affect the ability of the system to override failures. We study the scalability of the different schemes under different loads and failure conditions. Our simulations take into consideration the overhead of each scheme, the network contention, and the resource loads.  相似文献   

18.
We present a unified framework for applying iteration reordering transformations. This framework is able to represent traditional transformations such as loop interchange, loop skewing and loop distribution as well as compositions of these transformations. Using a unified framework rather than a sequence of ad-hoc transformations makes it easier to analyze and predict the effects of these transformations. Our framework is based on the idea that all reordering transformations can be represented as a mapping from the original iteration space to a new iteration space. An optimizing compiler would use our framework by finding a mapping that both corresponds to a legal transformation and produces efficient code. We present the mapping selection problem as a search problem by decomposing it into a sequence of smaller choices. We then characterize the set of all legal mappings by defining a search tree. As part of this process we use a new operation called affine closure.  相似文献   

19.
《Computer Networks》2007,51(8):1861-1881
The success of a P2P file-sharing network highly depends on the scalability and versatility of its search mechanism. Two particularly desirable search features are scope (ability to find infrequent items) and support for partial-match queries (queries that contain typos or include a subset of keywords). While centralized-index architectures (such as Napster) can support both these features, existing decentralized architectures seem to support at most one: prevailing unstructured P2P protocols (such as Gnutella and FastTrack) deploy a “blind” search mechanism where the set of peers probed is unrelated to the query; thus they support partial-match queries but have limited scope. On the other extreme, the recently-proposed distributed hash tables (DHTs) such as CAN and CHORD, couple index location with the item’s hash value, and thus have good scope but can not effectively support partial-match queries. Another hurdle to DHTs deployment is their tight control of the overlay structure and the information (part of the index) each peer maintains, which makes them more sensitive to failures and frequent joins and disconnects.We develop a new class of decentralized P2P architectures. Our design is based on unstructured architectures such as Gnutella and FastTrack, and retains many of their appealing properties including support for partial match queries, and relative resilience to peer failures. Yet, we obtain orders of magnitude improvement in the efficiency of locating rare items. Our approach exploits associations inherent in human selections to steer the search process to peers that are more likely to have an answer to the query. We demonstrate the potential of associative search using models, analysis, and simulations.  相似文献   

20.
Peer-to-peer (P2P) networks are beginning to form the infrastructure of future applications. Computers are organized in P2P overlay networks to facilitate search queries with reasonable cost. So, scalability is a major aim in design of P2P networks. In this paper, to obtain a high factor of scalability, we partition network search space using a consistent static shared upper ontology. We name our approach semantic partition tree (SPT). All resources and queries are annotated using the upper ontology and queries are semantically routed in the overlay network. Also, each node indexes addresses of other nodes that possess contents expressible by the concept it maintains. So, our approach can be conceived as an ontology-based distributed hash table (DHT). Also, we introduce a lookup service for the network which is very scalable and independent of the network size and just depends on depth of the ontology tree. Further, we introduce a broadcast algorithm on the network. We present worst case analysis of both lookup and broadcast algorithms and measure their performance using simulation. The results show that our scheme is highly scalable and can be used in real P2P applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号