期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Framework for Simple Sorting Algorithms on Parallel Disk Systems

S. Rajasekaran 《Theory of Computing Systems》2001,34(2):101-114

In this paper we present a simple parallel sorting algorithm and illustrate its application in general sorting, disk sorting, and hypercube sorting. The algorithm (called the (l,m) -mergesort (LMM)) is an extension of the bitonic and odd—even mergesorts. Literature on parallel sorting is abundant. Many of the algorithms proposed, though being theoretically important, may not perform satisfactorily in practice owing to large constants in their time bounds. The algorithm presented in this paper has the potential of being practical. We present an application to the parallel disk sorting problem. The algorithm is asymptotically optimal (assuming that N is a polynomial in M , where N is the number of records to be sorted and M is the internal memory size). The underlying constant is very small. This algorithm performs better than the disk-striped mergesort (DSM) algorithm when the number of disks is large. Our implementation is as simple as that of DSM (requiring no fancy data structures or prefetch techniques.) As a second application, we prove that we can get a sparse enumeration sort on the hypercube that is simpler than that of the classical algorithm of Nassimi and Sahni [16]. We also show that Leighton's columnsort algorithm is a special case of LMM. Online publication December 26, 2000. 相似文献

2.

Speeding up external mergesort

LuoQuan Zheng Larson P.-A. 《Knowledge and Data Engineering, IEEE Transactions on》1996,8(2):322-332

External mergesort is normally implemented so that each run is stored continuously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Interleaved layout places blocks from different runs in consecutive disk addresses. This is done in the hope that interleaving will reduce seek overhead during merging. The new reading strategy precomputes the order in which data blocks are to be read according to where they are located on disk and when they are needed for merging. Extra buffer space makes it possible to read blocks in an order that reduces seek overhead, instead of reading them exactly in the order they are needed for merging. A detailed simulation model was used to compare the two layout strategies and three reading strategies. The effects of using multiple work disks were also investigated. We found that, in most cases, interleaved layout does not improve performance, but that the new reading strategy consistently performs better than double buffering and forecasting 相似文献

3.

Run Placement Policies for Concurrent Mergesorts Using Parallel Prefetching

Kun-Lung Wu Philip S. Yu James Z. Teng 《Knowledge and Information Systems》1999,1(4):435-457

We study the performance of various run placement policies on disks for the merge phase of concurrent mergesorts using parallel prefetching. The initial sorted runs (input) of a merge and its final sorted run (output) are stored on multiple disks but each run resides only on a single disk. In this paper, we examine through detailed simulations three different run placement policies and the impact of buffer thrashing. The results show that, with buffer thrashing avoidance, the best performance can be achieved by a run placement policy that uses a proper subset of the disks dedicated for writing the output runs while the rest of the disks are used for prefetching the input runs in parallel. However, the proper number of write disks is workload dependent, and if not carefully chosen, it can adversely affect the system performance. In practice, a reasonably good performance can be achieved by a run placement policy that does not place the output run of a merge on any of the disks that store its own input runs but allows the output run to share the same disk with some of the input runs of other merges. 相似文献

4.

A high‐performance sorting algorithm for multicore single‐instruction multiple‐data processors

Hiroshi Inoue Takao Moriyama Hideaki Komatsu Toshio Nakatani 《Software》2012,42(6):753-777

Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

5.

Workfile Disk Management for Concurrent Mergesorts in a Multiprocessor Database System

Kun-Lung Wu Philip S. Yu Jen-Yao Chung James Z. Teng 《Distributed and Parallel Databases》2000,8(3):279-296

This paper studies workfile disk management for concurrent mergesorts ina multiprocessor database system. Specifically, we examine the impacts of workfile disk allocation and data striping on the average mergesort response time. Concurrent mergesorts in a multiprocessor system can creat severe I/O interference in which a large number of sequential write requests are continuously issued to the same workfile disk and block other read requests for a long period of time. We examine through detailed simulations a logical partitioning approach to workfile disk management and evaluate the effectiveness of datastriping. The results show that (1) without data striping, the best performance is achieved by using the entire workfile disks as a single partition if there are abundant workfile disks (or system workload is light); (2) however, if there are limited workfile disks (or system workload is heavy), the workfile disks should be partitioned into multiple groups and the optimal partition size is workload dependent; (3) data striping is beneficial only if the striping unit size is properly chosen; and (4) with a proper striping size, the best performance is generally achieved by using the entire disks as a single logical partition. 相似文献

6.

External sorting: run formation revisited 总被引：1，自引：0，他引：1

Larson P.A. 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(4):961-972

External mergesort begins with a run formation phase creating the initial sorted runs. Run formation can be done by a load-sort-store algorithm or by replacement selection. A load-sort-store algorithm repeatedly fills available memory with input records, sorts them, and writes the result to a run file. Replacement selection produces longer runs than load-sort-store algorithms and completely overlaps sorting and I/O, but it has poor locality of reference resulting in frequent cache misses and the classical algorithm works only for fixed-length records. This paper introduces batched replacement selection: a cache-conscious version of replacement selection that works also for variable-length records. The new algorithm resembles AlphaSort in the sense that it creates small in-memory runs and merges them to form the output runs. Its performance is experimentally compared with three other run formation algorithms: classical replacement selection, Quicksort, and AlphaSort. The experiments show that batched replacement selection is considerably faster than classic replacement selection. For small records (average 100 bytes), CPU time was reduced by about 50 percent and elapsed time by 47-63 percent. It was also consistently faster than Quicksort, but it did not always outperform AlphaSort. Replacement selection produces fewer runs than Quicksort and AlphaSort. The experiments confirmed that this reduces the merge time whereas the effect on the overall sort time depends on the number of disks available. 相似文献

7.

A parallel sorting algorithm for a novel model of computation

Amitabha Das Louise E. Moser P. M. Melliar-Smith 《International journal of parallel programming》1991,20(5):403-419

The computational complexity of a parallel algorithm depends critically on the model of computation. We describe a simple and elegant rule-based model of computation in which processors apply rules asynchronously to pairs of objects from a global object space. Application of a rule to a pair of objects results in the creation of a new object if the objects satisfy the guard of the rule. The model can be efficiently implemented as a novel MIMD array processor architecture, the Intersecting Broadcast Machine. For this model of computation, we describe an efficient parallel sorting algorithm based on mergesort. The computational complexity of the sorting algorithm isO(nlog² n), comparable to that for specialized sorting networks and an improvement on theO(n ^1.5) complexity of conventional mesh-connected array processors. 相似文献

8.

Merging sorted runs using large main memory

Salzberg Betty 《Acta Informatica》1989,27(3):195-215

Summary External sorting is usually accomplished by first creating sorted runs, then merging the runs. In the merge phase, writing and calculating can be overlapped by reading if two input buffers are used for each sorted run. If the memory is very large, the input buffers will be large and using two input buffers per sorted run will be more efficient than using only one input buffer per run and risking reduced overlap of reading and writing. In many cases, merging time can be cut in half. We derive a formula for estimating the total time for merging for a given memory size, file size, number of merging passes and for a given disk drive. We present an extreme example where in spite of having two buffers per run, significant non-overlap occurs. However, in realistic problems, we show that making one merge pass with two input buffers per run is near optimal. This contradicts earlier results on merging which do not take large memory into account. 相似文献

9.

Fast Concurrent Access to Parallel Disks

Sanders Egner Korst 《Algorithmica》2003,35(1):21-55

Abstract. High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on each disk in one I/ O step. This restriction leads to a load balancing problem that is perhaps the main inhibitor for the efficient adaptation of single-disk external memory algorithms to multiple disks. We solve this problem for arbitrary access patterns by randomly mapping blocks of a logical address space to the disks. We show that a shared buffer of O (D) blocks suffices to support efficient writing. The analysis uses the properties of negative association to handle dependencies between the random variables involved. This approach might be of independent interest for probabilistic analysis in general. If two randomly allocated copies of each block exist, N arbitrary blocks can be read within

I/ O steps with high probability. The redundancy can be further reduced from 2 to 1+1/r for any integer r without a big impact on reading efficiency. From the point of view of external memory models, these results rehabilitate Aggarwal and Vitter's ``single-disk multi-head' model [1] that allows access to D arbitrary blocks in each I/ O step. This powerful model can be emulated on the physically more realistic independent disk model [2] with small constant overhead factors. Parallel disk external memory algorithms can therefore be developed in the multi-head model first. The emulation result can then be applied directly or further refinements can be added. 相似文献

10.

External quicksort

A. Inkeri Verkamo 《Performance Evaluation》1988,8(4):271-288

An external sorting algorithm based on quicksort is presented. The file to be sorted is kept on a disk and only those blocks are fetched into the main memory which are currently needed. At each time, a block is kept in the main memory, if the expected space-time cost of holding it until its next use is smaller than the expected space-time cost of removing it and fetching it again. The efficiency of the algorithm is tested through simulation experiments and the results are compared to those achieved with mergesort in a corresponding environment. The total execution time and the main memory space-time integral are used for measuring the performance.

When equal block sizes are used, external quicksort results in a much smaller average space requirement than mergesort. On the other hand, mergesort is somewhat faster than external quicksort. The main memory space-time integral of quicksort is always considerably smaller than that of mergesort. External quicksort is less sensitive to the block size and to the file size. With faster disks, the performance of external quicksort improves faster than that of mergesort. The relative difference of the algorithms is independent of the file size.

The external quicksort is also analytically compared to some previous external versions of quicksort. It is shown to be satisfied with less space and fewer block fetches than the others. 相似文献

11.

Distorted mapping techniques to achieve high performance in mirrored disk systems

Jon A. Solworth Cyril U. Orji 《Distributed and Parallel Databases》1993,1(1):81-102

Mirrored disk systems provide high reliability by multiplexing disks. Performance is improved with parallel reads and shorter read seeks. However, writes must be performed by both disks, limiting performance. We introducedistorted mirrors, a mirroring system which combineswrite anywhere semantics with traditional database-specified block locations. This technique radically reduces the cost of small writes, making it attractive for random access applications such as OLTP, while retaining the ability to efficiently perform large sequential accesses. Distorted mirrors also scale better than traditional mirrors in terms of both disk caching and large mirrored sets. We show the effectiveness of distorted mirrors on the TP1 benchmark. 相似文献

12.

Efficient out-of-core sorting algorithms for the Parallel Disks Model

Vamsi Kundeti Sanguthevar RajasekaranAuthor vitae 《Journal of Parallel and Distributed Computing》2011,71(11):1427-1433

In this paper,¹ we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However, many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on the PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper, we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Next, we show analytically that this idea can reduce the number of parallel I/O’s required to sort the input close to the lower bound of . We verify experimentally our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on the PDM significantly. 相似文献

13.

Real-time integrated prefetching and caching

Peter Sanders Johannes Singler Rob van Stee 《Journal of Scheduling》2013,16(1):47-58

The high latencies for access to background memory like hard disks or flash memory can be reduced by caching or hidden by prefetching. We consider the problem of scheduling the resulting I/Os when the available fast cache memory is limited and when we have real-time constraints where for each requested data block we are given a time interval during which this block needs to be in main memory. We give a near linear time algorithm for this problem which produces a feasible schedule whenever one exists. Another algorithm additionally minimizes I/Os and still runs in polynomial-time. For the online variant of the problem, we give a competitive algorithm that uses lookahead and augmented disk speed. We show a tight relationship between the amount of lookahead and the speed required to get a competitive algorithm. 相似文献

14.

Parallel Algorithms for Partitioning Sorted Sets and Related Problems

D. Z. Chen W. Chen K. Wada K. Kawaguchi 《Algorithmica》2000,28(2):217-241

We consider the following partition problem: Given a set S of n elements that is organized as k sorted subsets of size n/k each and given a parameter h with 1/k ≤ h ≤ n/k , partition S into g = O(n/(hk)) subsets D ₁ , D ₂ , . . . , D _g of size Θ(hk) each, such that, for any two indices i and j with 1 ≤ i < j ≤ g , no element in D _i is bigger than any element in D _j . Note that with various combinations of the values of parameters h and k , several fundamental problems, such as merging, sorting, and finding an approximate median, can be formulated as or be reduced to this partition problem. The partition problem also finds many applications in solving problems of parallel computing and computational geometry. In this paper we present efficient parallel algorithms for solving the partition problem and a number of its applications. Our parallel partition algorithm runs in O( log n) time using processors in the EREW PRAM model. The complexity bounds of our parallel partition algorithm on the respective special cases match those of the optimal EREW PRAM algorithms for merging, sorting, and finding an approximate median. Using our parallel partition algorithm, we are also able to obtain better complexity bounds (even possibly on a weaker parallel model) than the previously best known parallel algorithms for several important problems, including parallel multiselection, parallel multiranking, and parallel sorting of k sorted subsets. Received May 5, 1996; revised July 30, 1998. 相似文献

15.

Bulk Synchronous Parallel Algorithms for the External Memory Model

Dehne Dittrich Hutchinson Maheshwari 《Theory of Computing Systems》2008,35(6):567-597

Abstract. Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/ O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchronous Parallel (BSP) algorithms into efficient parallel EM algorithms. It optimizes blockwise data access and parallel disk I/ O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel EM algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including three-dimensional convex hulls (two-dimensional Voronoi diagrams), and various graph problems. We show that certain parallel algorithms known for the BSP model can be used to obtain EM algorithms that meet well known I /O complexity lower bounds for various problems, including sorting. 相似文献

16.

Perfectly Overlapped Generation of Long Runs for Sorting Large Files

《Journal of Parallel and Distributed Computing》1993,19(2):136-142

A parallel algorithm is presented for quick generation of long sorted runs as the first phase of sorting a large file. It can generate runs of length about 2(m + 2)p on a linear array of p processors each with a heap of size m. Internal computations can be completely overlapped with I/O, and only n + (m + 2)p steps are required to generate the runs from a file of n items. Experiments with the algorithm show that it can give highly satisfactory results in a real environment. A variant of the algorithm is also presented. It is possible that no merge phase is required for sorting a file occupying a whole disk. 相似文献

17.

Efficient External Memory Algorithms by Simulating Coarse-Grained Parallel Algorithms

Dehne Dittrich Hutchinson 《Algorithmica》2008,36(2):97-122

Abstract. External memory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACM Working Group on Storage I/ O for Large-Scale Computing. In this paper we provide a simulation technique which produces efficient parallel EM algorithms from efficient BSP-like parallel algorithms. The techniques obtained can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. When applied to existing BSP-like algorithms, our simulation technique produces improved parallel EM algorithms for a large number of problems. 相似文献

18.

Some modified algorithms for Dijkstra's longest upsequence problem

Robert B. K. Dewar Susan M. Merritt Micha Sharir 《Acta Informatica》1982,18(1):1-15

Summary Using the techniques of specification and transformation by parts, algorithms are derived for the longest upsequence problem. First Dijkstra's algorithm and then two new modified merge algorithms are derived and presented in detail. The merge algorithms take advantage of natural runs in the input sequence and have a worst caseO(n logn) time complexity when appropriate merging techniques are used, but can be linear if long runs are present in the sequence. The first merge algorithm is logically equivalent to Dijkstra's algorithm; the second algorithm is based on the first one but uses a different merging technique. Expository remarks describe related results which evolved out of our work in programming by transformation; in particular, parallels are drawn between algorithms for the longest upsequence problem and algorithms for sorting.Work on this paper has been supported in part by: ONR Grant N00014-75-C-0571; NSF Grant MCS-80-04349; USDOE Contract EY-76-C-02-3077 相似文献

19.

Highly available and heterogeneous continuous media storage systems

《Multimedia, IEEE Transactions on》2004,6(6):886-896

A number of recent technological trends have made data intensive applications such as continuous media (audio and video) servers a reality. These servers store and retrieve large volumes of data using magnetic disks. Servers consisting of multiple nodes and large arrays of heterogeneous disk drives have become a fact of life for several reasons. First, magnetic disks might fail. Failed disks are almost always replaced with newer disk models because the current technological trend for these devices is one of annual increase in both performance and storage capacity. Second, storage requirements are ever increasing, forcing servers to be scaled up progressively. In this study, we present a framework to enable parity-based data protection for heterogeneous storage systems and to compute their mean lifetime. We describe the tradeoffs associated with three alternative techniques: independent subservers, dependent subservers, and disk merging. The disk merging approach provides a solution for systems that require highly available secondary storage in environments that also necessitate maximum flexibility. 相似文献

20.

Parallel Self-Index Integer Sorting

Hazem M. Bahig Sameh S. Daoud Mahmoud K. A. Khairat 《The Journal of supercomputing》2002,22(3):269-275

We consider the problem of sorting n integers when the elements are drawn from the restricted domain [1...n]. A new deterministic parallel algorithm for sorting n integers is obtained. Its running time is O(lognlog(n/logn)) using n/logn processors on EREW (exclusive read exclusive write) PRAM (parallel random access machine). Also, our algorithm was modified to become optimal when we use processors. This algorithm belongs to class EP (Efficient, Polynomial fast). 相似文献