首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 11 毫秒
1.
In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms  相似文献   

2.
MapReduce大数据处理平台与算法研究进展   总被引:1,自引:1,他引:0  
本文综述了近年来基于MapReduce编程模型的大数据处理平台与算法的研究进展。首先介绍了12个典型的基于MapReduce的大数据处理平台,分析对比它们的实现原理和适用场景,抽象它们的共性。随后介绍基于MapReduce的大数据分析算法,包括搜索算法、数据清洗/变换算法、聚集算法、连接算法、排序算法、偏好查询、最优化算法、图算法、数据挖掘算法。将这些算法按MapReduce实现方式分类,分析影响这算法性能的因素。最后,将大数据处理算法抽象为外存算法,并对外存算法的特征加以梳理,提出了普适的外存算法性能优化方法的研究思路和研究问题,以供研究人员参考。具体包括优化外存算法的磁盘I/O,优化外存算法的局部性,以及设计增量式迭代算法。现有大数据处理平台和算法研究多集中在基于资源分配和任务调度的平台动态性能优化、特定算法并行化、特定算法性能优化等领域,本文提出的外存算法性能优化属于静态优化方法,是现有研究的良好补充,为研究人员提供了广阔的研究空间。  相似文献   

3.
It is widely acknowledged that improving parallel I/O performance is critical for widespread adoption of high performance computing. In this paper, we show that communication in out-of-core distributed memory problems may require both interprocessor communication and file I/O. Thus, in order to improve I/O performance, it is necessary to minimize the I/O costs associated with a communication step. We present three methods for performing communication in out-of-core distributed memory problems. The first method, called thegeneralized collective communicationmethod, follows a loosely synchronous model; computation and communication phases are clearly separated, and communication requires permutation of data in files. The second method, called thereceiver-driven in-core communication, communicates only the in-core data. The third method, called theowner-driven in-core communication, goes even one step further and tries to identify the potential future use of data (by the recipients) while it is in the senders memory. We provide performance results for two out-of-core applications: the two-dimensional FFT code, and the two-dimensional elliptic Jacobi solver.  相似文献   

4.
《Parallel Computing》2014,40(10):754-767
The processing of massive amounts of data on clusters with finite amount of memory has become an important problem facing the parallel/distributed computing community. While MapReduce-style technologies provide an effective means for addressing various problems that fit within the MapReduce paradigm, there are many classes of problems for which this paradigm is ill-suited. In this paper we present a runtime system for traditional MPI programs that enables the efficient and transparent out-of-core execution of distributed-memory parallel programs. This system, called BDMPI,1 leverages the semantics of MPI’s API to orchestrate the execution of a large number of MPI processes on much fewer compute nodes, so that the running processes maximize the amount of computation that they perform with the data fetched from the disk. BDMPI enables the development of efficient out-of-core parallel distributed memory codes without the high engineering and algorithmic complexities associated with multiple levels of blocking. BDMPI achieves significantly better performance than existing technologies on a single node as well as on a small cluster, and performs within 30% of optimized out-of-core implementations.  相似文献   

5.
In recent years, many researchers have investigated optical interconnections as parallel computing. Optical interconnections are attractive due to their high bandwidth and concurrent access to the bus in a pipelined fashion. The Linear Array with Reconfigurable Pipelined Bus System (LARPBS) model is a powerful optical bus system that combines both the advantages of optical buses and reconfiguration. To increase the scalability of the LARPBS model, we propose a two-dimensional extension: a simplified two-dimensional Array with Reconfigurable Pipelined Bus System (2D ARPBS). While achieving better scalability, we show the effectiveness of this newly proposed model by designing two novel optimal sorting algorithms on this model. The first sorting algorithm is an extension of Leighton's seven-phase columnsort algorithm that eliminates the restriction of sorting only an r times s array, where r ge s^2 , and sorts an n times n array in O(log n) time. The second one is an optimal multiway mergesort algorithm that uses a novel processor efficient two-way mergesort algorithm and a novel multiway merge scheme to sort n^2 items in O(log n) time. Using an optimal sorting algorithm Pipelined Mergesort designed for the LARPBS model as a building block, we extend our research on parallel sorting on the LARPBS to a more scalable 2D ARPBS model and achieve optimality in both sorting algorithms.  相似文献   

6.
核外计算中的几种I/O优化方法   总被引:1,自引:0,他引:1  
大数据量应用问题引入核外计算模式,由于访问磁盘数据的速度比较慢,I/O成为核外计算性能重要的限制因素.提出了一种使用运行库进行I/O优化的方法,给出了3种有效的优化策略:规则区域筛选、数据预取和边缘重用.编程人员可针对不同的应用问题使用相应的优化API来缩短程序执行时间.实验结果表明,通过减少I/O操作次数和内外存交换的数据量以及隐藏部分I/O操作延迟,有效提高了核外计算的性能.  相似文献   

7.
Several fast and space-optimal sequential and parallel algorithms for solving the satisfaction problem of functional and multivalued dependencies (FDs and MVDs) are presented. Two frameworks to verify an MVD for a relation and their implementation by exploring the existing fast space-optimal sorting techniques are described. The space optimality means that only a constant amount of extra memory space is needed for the sequential implementations, and O(M) amount of extra memory space for parallel algorithms that use M processors. This feature makes the algorithms attractive whenever space is a critical resource and I/O transfers should be reduced to the minimal, as is often the case for relational database systems. The time requirements for in-place FD and MVD verification are given in terms of M and of N, which is the number of tuples in a relation. The effect of relation modification on FD and MVD verification is examined  相似文献   

8.
9.
In this paper, we present a framework for synthesizing I/O efficient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory model in which data can be distributed using various cyclic(B) distributions in contrast to the normally used physical track distribution cyclic(Bd ), where Bd is the physical disk block size. We first introduce tensor bases to capture the semantics of block-cyclic data distributions of out-of-core data and also data access patterns to out-of-core data. We then present program generation techniques for tensor products and matrix transposition. We accurately represent the number of parallel I/O operations required for the synthesized programs for tensor products and matrix transposition as a function of tensor bases and data distributions. We introduce an algorithm to determine the data distribution which optimizes the performance of the synthesized programs. Further, we formalize the procedure of synthesizing efficient out-of-core programs for tensor product formulas with various block-cyclic distributions as a dynamic programming problem. We demonstrate the effectiveness of our approach through several examples. We show that the choice of an appropriate data distribution can reduce the number of passes to access out-of-core data by as large as eight times for a tensor product and the dynamic programming approach can largely reduce the number of passes to access out-of-core data for the overall tensor product formulas  相似文献   

10.
In this paper, we present efficient algorithms for sorting, selection, and packet routing on the AROB (Array with Reconfigurable Optical Buses) model. One of our sorting algorithms sorts n general keys in O(1) time on an AROB of size nϵ×n for any constant ϵ>0. We also show that selection from out of n elements can be done in randomized O(1) time employing n processors. Our routing algorithm can route any h-relation in randomized O(h) time. All these algorithms are clearly optimal  相似文献   

11.
This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Inter Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests  相似文献   

12.
Many routing problems in parallel processing, such as concentration and permutation problems, can be cast as sorting problems. In this paper, we consider the problem of sorting on a new model, called an adaptive sorting network. We show that any sequence of n bits can be sorted on this model in O(lg2 n) bit-level delay using O(n) constant fanin gates. This improves the cost complexity of K.E. Batcher's binary sorters (1968) by a factor of O(lg2 n) while matching their sorting time. The only other network that can sort binary sequences in O(n) cost is the network version of columnsort algorithm, but this requires excessive pipelining. In addition, using binary sorters, we construct permutation networks with O(n lg n) bit-level cost and O(lg3 n) bit-level delay. These results provide the asymptotically least-cost practical concentrators and permutation networks to date. We note, of course, that the well-known AKS sorting network has O(lg n) sorting time and O(n lg n) cost, but the constants hidden in these complexities are so large that our complexities outperform those of the AKS sorting network until n becomes extremely large  相似文献   

13.
With the popularity of parallel database machines based on the shared-nothing architecture, it has become important to find external sorting algorithms which lead to a load-balanced computation, i.e., balanced execution, communication and output. If during the course of the sorting algorithm each processor is equally loaded, parallelism is fully exploited. Similarly, balanced communication will not congest the network traffic. Since sorting can be used to support a number of other relational operations (joins, duplicate elimination, building indexes etc.) data skew produced by sorting can further lead to execution skew at later stages of these operations. In this paper we present a load-balanced parallel sorting algorithm for shared-nothing architectures. It is a multiple-input multiple-output algorithm with four stages, based on a generalization of Batcher's odd-even merge. At each stage then keys are evenly distributed among thep processors (i.e., there is no final sequential merge phase) and the distribution of keys between stages ensures against network congestion. There is no assumption made on the key distribution and the algorithm performs equally well in the presence of duplicate keys. Hence our approach always guarantees its performance, as long asn is greater thanp 3, which is the case of interest for sorting large relations. In addition, processors can be added incrementally. Recommended by: Patrick Valduriez  相似文献   

14.
Sorting networks of fixed I/O size p have been used, thus far, for sorting a set of p elements. Somewhat surprisingly, the important problem of using such a sorting network for sorting arbitrarily large datasets has not been addressed in the literature. Our main contribution is to propose a simple sorting architecture whose main feature is the pipelined use of a sorting network of fixed I/O size p to sort an arbitrarily large data set of N elements. A noteworthy feature of our design is that no extra data memory space is required, other than what is used for storing the input. As it turns out, our architecture is feasible for VLSI implementation and its time performance is virtually independent of the cost and depth of the underlying sorting network. Specifically, we show that by using our design N elements can be sorted in Θ(N/p log N/p) time without memory access conflicts. Finally, we show how to use an AT2-optimal sorting network of fixed I/O size p to construct a similar architecture that sorts N elements in Θ(N/p log N/p log p) time  相似文献   

15.
Parallel Algorithms for Discovery of Association Rules   总被引:2,自引:0,他引:2  
Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of database items), thus incurring high I/O overhead. In the parallel case, most algorithms perform a sum-reduction at the end of each pass to construct the global counts, also incurring high synchronization cost. In this paper we describe new parallel association mining algorithms. The algorithms use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets. Once this set has been identified, the algorithms make use of efficient traversal techniques to generate the frequent itemsets contained in each cluster. We propose two clustering schemes based on equivalence classes and maximal hypergraph cliques, and study two lattice traversal techniques based on bottom-up and hybrid search. We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of the database needed for the computation of associations is local to each processor. After the initial set-up phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Once in the set-up phase, and once when processing the itemset clusters. Unlike previous parallel approaches, the algorithms use simple intersection operations to compute frequent itemsets and do not have to maintain or search complex hash structures. Our experimental testbed is a 32-processor DEC Alpha cluster inter-connected by the Memory Channel network. We present results on the performance of our algorithms on various databases, and compare it against a well known parallel algorithm. The best new algorithm outperforms it by an order of magnitude.  相似文献   

16.
We generalize the well-known odd-even merge sorting algorithm, originally due to Batcher (1968), and show how this generalized algorithm can be applied to sorting on product networks. If G is an arbitrary factor graph with N nodes, its r-dimensional product contains Nr nodes. Our algorithm sorts Nr keys stored in the r-dimensional product of G in O(rrF(N)) time, where F(N) depends on G. We show that, for any factor graph G, F(N) is, at most, O(N), establishing an upper bound of O(r2 N) for the time complexity of sorting Nr keys on any product network. For product networks with bounded r(e.g. for grids), this leads to the asymptotic complexity of O(N) to sort Nr keys, which is optimal for several instances of product networks. There are factor graphs for which F(N)=O(log2 N), which leads to the asymptotic running time of O(log2 N) to sort Nr keys. For networks with bounded N (e.g. in the hypercube N=2, fixed), the asymptotic complexity becomes O(r2). We show how to apply the algorithm to several cases of well-known product networks, as well as others introduced recently. We compare the performance of our algorithm to well-known algorithms developed specifically for these networks, as well as others. The result of these comparisons led us to conjecture that the proposed algorithm is probably the best deterministic algorithm that can be found in terms of the low asymptotic complexity with a small constant  相似文献   

17.
This paper presents bitonic sorting schemes for special-purpose parallel architectures such as sorting networks and for general-purpose parallel architectures such as SIMD and/or MIMD computers. First, bitonic sorting algorithms for shared-memory SIMD and/or MIMD computers are developed. Shared-memory accesses through the interconnection network of shared memory SIMD and/or MIMD computers can be very time consuming. A scheme is introduced which reduces the number of such accesses. This scheme is based on the parity strategy which is the main idea of the paper. By reducing the communication through the network, a performance improvement is achieved. Second, a recirculating bitonic sorting network is presented, which is composed of one level of N/2 comparators plus an Ω-network of (log N-1) switch levels. This network reduces the cost complexity to O(N log N) compared with the O(N log2 N) of the original bitonic sorting network, while preserving the same time complexity. Finally, a simplified multistage bitonic sorting network, is presented. For simplifying the interlevel wiring, the parity strategy is used, so N/2 keys are wired straight through the network  相似文献   

18.
VI-attached database storage   总被引:1,自引:0,他引:1  
This work presents a Vl-attached database storage architecture to improve database transaction rates. More specifically, we examine how Vl-based interconnects can be used to improve I/O path performance between a database server and a storage subsystem. To facilitate the interaction between client applications and a Vl-aware storage system, we design and implement a software layer called DSA, that is layered between applications and VI. DSA takes advantage of specific VI features and deals with many of its shortcomings. We provide and evaluate one kernel-level and two user-level implementations of DSA. These implementations trade transparency and generality for performance at different degrees and, unlike research prototypes, are designed to be suitable for real-world deployment. We have also investigated many design trade offs in the storage cluster. We present detailed measurements using a commercial database management system with both microbenchmarks and industrial database workloads on a mid-size, 4 CPU, and a large, 32 CPU, database server. We also compare the effectiveness of Vl-attached storage with an iSCSI configuration, and conclude that storage protocols implemented using DSA over VI have significant performance advantages. More generally, our results show that Vl-based interconnects and user-level communication can improve all aspects of the I/O path between the database system and the storage back-end. We also find that to make effective use of VI in I/O intensive environments, we need to provide substantial additional functionality than what is currently provided by VI. Finally, new storage APIs that help minimize kernel involvement in the I/O path are needed to fully exploit the benefits of Vl-based communication.  相似文献   

19.
本文针对遥感图像IHS、HPF、DWT等典型的像素级融合算法,提出并实现了相应的基于数据并行的并行融合算法P-IHS、P-HPF、P-DWT,并在算法时空复杂度分析的基础上进行了通信、I/O优化。针对IKONOS卫星遥感图像在机群系统上的测试结果表明,我们提出的并行算法可获得良好的并行加速比,并行效率较高。这三类算法适合于对实时性要求比较高的遥感应用领域。  相似文献   

20.
基于安腾2的机群系统的实现与应用   总被引:2,自引:0,他引:2       下载免费PDF全文
本文设计并实现了一个基于安腾2处理器的机群计算系统,并结合安腾2处理器和机群系统的特性,对气象应用并行程序进行了I/O问题优化、通信优化、计算代价优化和通信数据的Cache利用率优化,以发挥该机群系统的长处,规避其弱点。测试结果表明,该机群系统适合气象应用并行软件的高效并行计算。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号