首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Parallelizing the Data Cube   总被引:1,自引:0,他引:1  
This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.  相似文献   

2.
Initially, parallel algorithms were designed by parallelising the existing sequential algorithms for frequently occurring problems on available parallel architectures.

More recently, parallel strategies have been identified and utilised resulting in many new parallel algorithms. However, the analysis of such techniques reveals that further strategies can be applied to increase the parallelism. One of these strategies, i.e., increasing the computational work in each processing node, can reduce the memory accesses and hence congestion in a shared memory multiprocessor system. Similarly, when network message passing is minimised in a distributed memory processor system, dramatic improvements in the performance of the algorithm ensue.

A frequently occurring computational problem in digital signal processing (DSP) is the solution of symmetric positive definite Toeplilz linear systems. The Levinson algorithm for solving such linear equations is where the Toeplitz matrix property is utilised in the elimination process of each element to advantage. However, it can be shown that in the Parallel Implicit Elimination (PIE) method where more than one element is eliminated simultaneously, the Toeplitz structure can again be utilised to advantage. This relatively simple strategy yields a reduction in accesses to shared memory or network message passing, resulting in a significant improvement in the performance of the algorithm [2],  相似文献   

3.
网络并行计算系统的消息存储器网络接口设计   总被引:4,自引:0,他引:4  
文中通过定性分析典型并行应用程序,提出产蒙义了消息传递无关因子R,即堆中的数据的传递在整个消息传递中所占比例,而且后在一个实际的NPC环境中对一组典型并行应用程序进行踪迹统计,证实了R接近1的分析,根据这个定性分析以及定量统计结构,结合存储器技术的进展,在NPC中的网络接口上引入了消息存储器,使得NPC中各个结点可以直接访问其它结点的消息存储器,通过竣是出结论,在设置了消息存储器的网络接口的NPC  相似文献   

4.
We present a simple and efficient mutual exclusion algorithm whose optimal message passing complexity isO(N), whereNis the number of processors in the network. The message complexity is measured by counting the number of communication hops in a network for a given topology. This algorithm reduces its message passing complexity by a token-chasing method, and enhances its effectiveness by dynamically adjusting state information stored in each processor. Moreover, this algorithm shortens the request delay by fully taking advantage of the network dynamic status information. The performance of the algorithm is also modeled for analytical evaluation. We have conducted a group of experiments on a network of workstations for comparisons between our algorithm and two other existing mutual exclusion algorithms. The experimental results show the effectiveness of our algorithm, especially when a large number of requests access the critical region in a distributed system. Finally, the token-chasing algorithm is further enhanced for fault tolerance under message loss and link crash conditions.  相似文献   

5.
Moving data between processes has often been discussed as one of the major bottlenecks in parallel computing—there is a large body of research, striving to improve communication latency and bandwidth on different networks, measured with ping-pong benchmarks of different message sizes. In practice, the data to be communicated generally originates from application data structures and needs to be serialized before communicating it over serial network channels. This serialization is often done by explicitly copying the data to communication buffers. The message passing interface (MPI) standard defines derived datatypes to allow zero-copy formulations of non-contiguous data access patterns. However, many applications still choose to implement manual pack/unpack loops, partly because they are more efficient than some MPI implementations. MPI implementers on the other hand do not have good benchmarks that represent important application access patterns. We demonstrate that the data serialization can consume up to 80 % of the total communication overhead for important applications. This indicates that most of the current research on optimizing serial network transfer times may be targeted at the smaller fraction of the communication overhead. To support the scientific community, we extracted the send/recv-buffer access patterns of a representative set of scientific applications to build a benchmark that includes serialization and communication of application data and thus reflects all communication overheads. This can be used like traditional ping-pong benchmarks to determine the holistic communication latency and bandwidth as observed by an application. It supports serialization loops in C and Fortran as well as MPI datatypes for representative application access patterns. Our benchmark, consisting of seven micro-applications, unveils significant performance discrepancies between the MPI datatype implementations of state of the art MPI implementations. Our micro-applications aim to provide a standard benchmark for MPI datatype implementations to guide optimizations similarly to the established benchmarks SPEC CPU and Livermore Loops.  相似文献   

6.
This study presents a method to construct formal rules used to run-time verify message passing between clients in distributed systems. Rules construction is achieved in four steps: (1) Visual specification of expected behavior of the sender, receiver, and network in sending and receiving a message, (2) Extraction of properties of sender, receiver, and network from the visual specification, (3) specification of constraints that should govern message passing in distributed systems, and (4) construction of verifier rules from the properties and the constraints. The rules are used to verify actual sender, receiver, and network behavior. Expected behavior of the client (process) is one that to be and the actual one is the behavior should be verified. The rules were applied to verify the behavior of client and servers that communicated with each other in order to compute Fibonacci numbers in parallel and some violations were discovered.  相似文献   

7.
A reconfigurable network termed as the reconfigurable multi-ring network (RMRN) is described. The RMRN is shown to be a truly scalable network in that each node in the network has a fixed degree of connectivity and the reconfiguration mechanism ensures a network diameter of O(log2 N) for anN-processor network. Algorithms for the two-dimensional mesh and the SIMD or SPMD n-cube are shown to map very elegantly onto the RMRN. Basic message passing and reconfiguration primitives for the SIMD/SPMD RMRN are designed for use as building blocks for more complex parallel algorithms. The RMRN is shown to be a viable architecture for image processing and computer vision problems using the parallel computation of the stereocorrelation imaging operation as an example. Stereocorrelation is one of the most computationally intensive imaging tasks. It is used as a visualization tool in many applications, including remote sensing, geographic information systems and robot vision.An earlier version of this paper was presented at the 1995 International Conference on Parallel and Distributed Processing Techniques and Applications.  相似文献   

8.
One of the grand challenges for computer applications is the creation of a system that will provide accurate computer simulations of physical objects coupled with powerful design optimization tools to allow optimum prototyping and the final design of a broad range of physical objects. We refer to such a software environment aselectronic prototyping for physical object design (EPPOD). The research challenges in building such systems are in softwareintegration, in utilizingmassive parallelismto satisfy their large computational requirements, in incorporatingknowledgeinto the entire electronic prototyping process, in creatingintelligentuser interfaces for such systems, and in advancing thealgorithmic infrastructureneeded to support the desired functionality. In this paper we address issues related to the parallel processing of the computationally intensive components of the EPPOD problem solving environment on message passing parallel machines and present its software architecture. The parallel methodology adopted to map the underlying computations to parallel machines is based on the optimal decomposition of continuous and discrete geometric data associated with the physical object. One of the main goals of this methodology is thereuseof existing software parts while implementing various components of the EPPOD system on parallel computational environments. Finally, some performance data of the parallel algorithmic infrastructured developed are listed and discussed.  相似文献   

9.
We investigate the effectiveness of Stackelberg strategies for atomic congestion games with unsplittable demands. In our setting, only a fraction of the players are selfish, while the rest are willing to follow a predetermined strategy. A Stackelberg strategy assigns the coordinated players to appropriately selected strategies trying to minimize the performance degradation due to the selfish players. We consider two orthogonal cases, namely congestion games with affine latency functions and arbitrary strategies, and congestion games on parallel links with arbitrary non-decreasing latency functions. We restrict our attention to pure Nash equilibria and derive strong upper and lower bounds on the pure Price of Anarchy (PoA) under different Stackelberg strategies.  相似文献   

10.
A model of parallel program that can be effectively interpreted on the development computer guaranteeing the possibility of a sufficiently precise prediction of real run time for a simulated parallel program at the prescribed computer system is studied. The model is worked out for parallel programs with explicit message passing written in the Java language with MPI library access and is included into the composition of ParJava environment. The model is obtained by transforming the program control tree that can be constructed for Java programs by modifying the abstract syntax tree. To model communication functions, the model LogGP is used which allows taking into consideration the specific character of the communication network of the distributed computer system.  相似文献   

11.
在消息传递并行机上的高效的最小生成树算法   总被引:5,自引:0,他引:5  
王光荣  顾乃杰 《软件学报》2000,11(7):889-898
基于传统的Borǔ vka串行最小生成树算法,提出了一个在消息传递并行机上的高效的最小生成树算法.并且采用3种方法来提高该算法的效率,即通过两趟合并及打包收缩的方法来减少通信开销,通过平衡数据分布的办法使各个处理器的计算量平衡.该算法的计算和通信复杂度分别为O(n2/p)和O((tsp+twn)n/p).在曙光-1000并行机上运行的实际效果是,对于有10 000个顶点的稀疏图,通过16个节点的运行加速比是12.  相似文献   

12.
We incorporate a prewrite operation before a write operation in a mobile transaction to improve data availability. A prewrite operation does not update the state of a data object but only makes visible the future value that the data object will have after the final commit of the transaction. Once a transaction reads all the values and declares all the prewrites, it can pre-commit at mobile host (MH) (computer connected to unreliable mobile communication network). The remaining transaction's execution (writes on database) is shifted to the mobile service station (MSS) (computer connected to the reliable fixed network). Writes on database consume time and resources and are therefore shifted to MSS and delayed. This reduces wireless network traffic congestion. Since the responsibility of expensive part of the transaction's execution is shifted to the MSS, it also reduces the computing expenses at mobile host. A pre-committed transaction's prewrite values are made visible both at mobile and at fixed database servers before the final commit of the transaction. Thus, it increases data availability during frequent disconnection common in mobile computing. Since a pre-committed transaction does not abort, no undo recovery needs to be performed in our model. A mobile host needs to cache only prewrite values of the data objects which take less memory, transmission time, energy and can be transmitted over low bandwidth. We have analysed various possible schedules of running transactions concurrently both at mobile and fixed database servers. We have discussed the concurrency control algorithm for our transaction model and proved that the concurrent execution of our transaction processing model produces only serializable schedules. Our performance study shows that our model increases throughput and decreases transaction-abort-ratio in comparison to other lock based schemes. We have briefly discussed the recovery issues and implementation of our model.  相似文献   

13.
This paper introduces a model for parallel computation, called thedistributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network.We introduce the notion of aconservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to shortcut pointers in a data structure so that remote processors can communicate without causing undue congestion. We giveO(lgn)-step, linear-processor, linear-space, conservative algorithms for a variety of problems onn-node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions in an expression tree. We giveO(lg2 n)-step, linear-processor, linear-space, conservative algorithms for problems on graphs of sizen, including finding a minimum-cost spanning forest, computing biconnected components, and constructing an Eulerian cycle. Most of these algorithms use as a subroutine a generalization of the prefix computation to trees. We show that any suchtreefix computation can be performed inO(lgn) steps using a conservative variant of Miller and Reif's tree-contraction technique.This research was supported in part by the Defense Advanced Research Projects Agency under Contract N00014-80-C-0622 and by the Office of Naval Research under Contract N00014-86-K-0593. Charles Leiserson is supported in part by an NSF Presidential Young Investigator Award with matching funds provided by AT&T Bell Laboratories and Xerox Corporation. Bruce Maggs is supported in part by an NSF Fellowship.  相似文献   

14.
We proposed in Ref. 5) a new,message-oriented implementation technique for Moded Flat GHC that compiled unification for data transfer into message passing. The technique was based on constraint-based program analysis, and significantly improved the performance of programs that used goals and streams to implement reconfigurable data structures. In this paper we discuss how the technique can be parallelized. We focus on a method for shared-memory multiprocessors, called theshared-goal method, though a different method could be used for distributed-memory multiprocessors. Unlike other parallel implementations of concurrent logic languages which we callprocess-oriented, the unit of parallel execution is not an individual goal but a chain of message sends caused successively by an initial message send. Parallelism comes from the existence of different chains of message sends that can be executed independently or in a pipelined manner. Mutual exclusion based on busy waiting and on message buffering controls access to individual, shared goals. Typical goals allowlast-send optimization, the message-oriented counterpart of last-call optimization. We have built an experimental implementation on Sequent Symmetry. In spite of the simple scheduling currently adopted, preliminary evaluation shows good parallel speedup and good absolute performance for concurrent operations on binary process trees.  相似文献   

15.
This paper presents an improved analysis of a randomized parallel backtrack search algorithm (RPBS). Our analysis uses the single-node-donation model that each donation contains a single tree node. It is shown that with high probability the total number of messages generated by RPBS is O(phd) where p is the number of processors, and h and d are the height and degree of the backtrack search tree. Under the assumption of unit-time message delivery, it is shown that with high probability the execution time of RPBS is n/p + O(hd) where n is the number of nodes of the backtrack search tree and the leading term n/p has no constant factor. As the result of limited communication requirement, RPBS can be efficiently implemented in message-passing or shared-memory multiprocessor systems. A general analysis of network implementation of RPBS is presented. The concept of total routing time, the sum of routing times of all messages, is introduced as a measure of communication cost. It is shown that the overall effect of message delay to the execution time of RPBS is small if the total routing time is small. Some experimental data on a shared-memory machine are reported. Received November 23, 1996; revised February 15, 1998.  相似文献   

16.
In this paper, we describe the traffic simulation system which analyzes a traffic flow in a road network. The system is constructed and operated on the parallel computer AP1000. A database of a road network is implemented with the object-oriented programming technique. Elements of a road network are regarded as objects, and vehicles which move on the lanes are regarded as attributive data. The database is divided into sub-databases and assigned to each processor. The system calculates the behavior of vehicles in parallel. Sending and receiving data among the objects are carried out with message passing on the communication network. Moreover, we show the results of the vehicles behavior, and evaluate the parallel efficiency.  相似文献   

17.
In this paper, a robust decentralized congestion control strategy is developed for a large scale network with Differentiated Services (Diff-Serv) traffic. The network is modeled by a nonlinear fluid flow model corresponding to two classes of traffic, namely the premium traffic and the ordinary traffic. The proposed congestion controller does take into account the associated physical network resource limitations and is shown to be robust to the unknown and time-varying delays. Our proposed decentralized congestion control strategy is developed on the basis of Diff-Serv architecture by utilizing a robust adaptive technique. A Linear Matrix Inequality (LMI) condition is obtained to guarantee the ultimate boundedness of the closed-loop system. Numerical simulation implementations are presented by utilizing the QualNet and Matlab software tools to illustrate the effectiveness and capabilities of our proposed decentralized congestion control strategy.  相似文献   

18.
Web缓存是用来解决网络访问延迟和网络拥塞问题,缓存替换策略直接影响缓存的命中率。为此,提出一种朴素贝叶斯(NB)分类器重访概率预测的Web缓存替换策略;根据用户之前访问日志,通过分区操作提取多项特征来表示每次访问的对象,并构建特征数据集;训练NB分类器,用来确定缓存中对象被再次访问的概率,为对象分配权重;结合LRU策略来合理删除一些对象。仿真结果表明,提出的策略在保证较高命中率的同时有效降低了执行时间。  相似文献   

19.
In this paper, we present a parallel programming and execution model based on alogicalordering of control flows. We show that it is possible to provide a unifying framework consisting of a synchronous programming model, thereby facilitating the mastery of programs, and an asynchronous execution model yielding efficient executions. Our approach is based on a SPMD and task parallel programming language, called –Chan. Communications take place through channels and rely on explicit send/receive instructions. In contrast to classical message passing models, synchronizations and communications are dissociated. We show that it is possible to perform a data-driven automatic translation of sequential and arbitrary DOACROSS loops into –Chan, by using nonmatching send/receive instructions. Our parallelization technique allows us to handle irregular control and leads to optimizations of communications in irregular computations.  相似文献   

20.
肖嵩  吴成柯  周有喜  杜建超 《软件学报》2007,18(11):2882-2892
提出了一种用于在无线网络中传输视频的结合信源特性及网络拥塞控制的鲁棒性算法.通过场景建模以及特性分析,将分级编码产生的所有码流层划分成不同的类型,并根据它们对网络拥塞控制的贡献以及对重建图像质量的贡献不同,将其分成两个不同的队列.系统根据不同的网络丢包状态(即丢包是由网络拥塞引起还是由无线信道的不可靠传输引起)动态地调整信源速率、不等错误保护强度以及拥塞控制策略.仿真结果表明,该方法与MPEG-4信源编码加固定速率Turbo码方法以及动态调整信源、信道编码速率加选择性丢I,B,P包的网络拥塞控制方法相比,能够提供更好的性能.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号