首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sorting is a very important task in computer science and becomes a critical operation for programs making heavy use of sorting algorithms. General‐purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize some sorting algorithms. Two GPU‐based implementations of the quicksort were presented in literature: the GPU‐quicksort, a compute‐unified device architecture (CUDA) iterative implementation, and the CUDA dynamic parallel (CDP) quicksort, a recursive implementation provided by NVIDIA Corporation. We propose CUDA‐quicksort an iterative GPU‐based implementation of the sorting algorithm. CUDA‐quicksort has been designed starting from GPU‐quicksort. Unlike GPU‐quicksort, it uses atomic primitives to perform inter‐block communications while ensuring an optimized access to the GPU memory. Experiments performed on six sorting benchmark distributions show that CUDA‐quicksort is up to four times faster than GPU‐quicksort and up to three times faster than CDP‐quicksort. An in‐depth analysis of the performance between CUDA‐quicksort and GPU‐quicksort shows that the main improvement is related to the optimized GPU memory access rather than to the use of atomic primitives. Moreover, in order to assess the advantages of using the CUDA dynamic parallelism, we implemented a recursive version of the CUDA‐quicksort. Experimental results show that CUDA‐quicksort is faster than the CDP‐quicksort provided by NVIDIA, with better performance achieved using the iterative implementation. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

2.
A new parallel sorting algorithm, called parsort, suitable for implementation on tightly coupled multiprocessors is presented. The algorithm is based upon quicksort and two-way merging. An asynchronous parallel partitioning algorithm is used to distribute work evenly during merging to ensure a good load balance amongst processors, which is crucial if we are to achieve high efficiency. The implementation of this parallel sorting algorithm exhibits theoretical and measured near linear speed-up when compared to sequential quicksort. This is illustrated by the results of experiments carried out on the Sequent Balance 8000 multiprocessor.  相似文献   

3.
We present a new scheme for evaluating the performance of multithreaded computers and demonstrate its application to the Cray MTA‐2 and XMT supercomputers. Our scheme is based on the concept of clock cycles per element, ${\cal C}$, plotted against both problem size and the number of processors. This scheme clearly shows if an implementation has achieved its asymptotic efficiency and is more general than (but includes) the commonly used speedup metric. It permits the discovery of any imperfections in both the software as well as the hardware, and is expected to permit a unified comparison of many different parallel architectures. Measurements on a number of well‐known parallel algorithms, ranging from matrix multiply to quicksort, are presented for the MTA‐2 and XMT and highlight some interesting differences between these machines. The performance of sequence alignment using dynamic programming is evaluated on the MTA‐2, XMT, IBM x3755 and SGI Altix 350 and provides a useful comparison of the capabilities of the Cray machines with more conventional shared memory architectures. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

4.
Data distributions have a serious impact on time complexity of parallel programs, developed based on domain decomposition. A new kind of distributions—set distributions, based on set-valued mappings, is introduced. These distributions assign a data object to more than one process. The set distributions can be used especially when the number of processes is greater than the data input size, but, sometimes using set distributions can lead to efficient general parallel algorithms. The work-load properties of these distributions and their impact on the number of communications are discussed. In order to illustrate the implications of data distributions in the construction of parallel programs, some examples are presented. Two parallel algorithms for computation of Lagrange interpolation polynomial are developed, starting from simple distributions and set distributions.  相似文献   

5.
We propose a model of parallel computation, the YPRAM, that allows general parallel algorithms to be designed for a wide class of parallel models. The basic model captures locality among processors, which is measured as a function of two parameters; latency and bandwidth.

We design YPRAM algorithms for solving several fundamental problems: parallel prefix, sorting, sorting numbers from a bounded range, and list ranking. We show that our model predicts, reasonably accurately, the actual known performances of several basic parallel models — PRAM, hypercube, mesh and tree — when solving these problems.  相似文献   


6.
随着多处理器的出现,并行技术受到了广泛的关注,成为了加速处理问题速度的重要技术.但是使用并行技术在加速计算的同时也带来了对处理器数量需求的急剧提升,并行成本的显著增加.针对这一问题,通过研究基于PRAM (Parallel Random Access Machine)下的3种最大值查找并行算法中的不足,提出了一种比平衡树算法,快速查找法,双对数深度树方法并行成本(cost)更优的基于数据划分方法的最大值查找并行算法.基于数据划分方法的最大值查找算法有效的解决了现有并行方法中处理器工作量分配不均,对处理器需求过大,实现条件苛刻等问题.为此后类似并行算法降低并行成本提供一个方向.  相似文献   

7.
Based on the current fiber optic technology, a new computational model, called a linear array with a reconfigurable pipelined abus system (LARPBS), is proposed in this paper. A parallel quicksort algorithm is implemented on the model, and its time complexity is analyzed. For a set of N numbers, the quicksort algorithm reported in this paper runs in O(log2 N) average time on a linear array with a reconfigurable pipelined bus system of size N. If the number of processors available is reduced to P, where P < N, the algorithm runs in O((N/P) log2 N) average time and is still scalable. Besides proposing a new algorithm on the model, some basic data movement operations involved in the algorithm are discussed. We believe that these operations can be used to design other parallel algorithms on the same model. Future research in this area is also identified in this paper.  相似文献   

8.
9.
计算机图形生成的并行处理是结合图形学、并行处理及并行算法交叉而产生的一个新课题.本文首先介绍了近几年来这一领域的发展,然后针对计算机图形学领域发展最为迅速的分支之一——物理场的图形显示,利用最新提出的并行算法进行了研究和实现,并就其在两种并行处理机环境下的具体实现进行了结果分析.  相似文献   

10.
本文详细讨论了串行快速排序的并行化过程,并在Windows2000 Professional和MPI群集系统的基础上实现了并行快速排序算法,然后对算法的性能进行分析和改进.  相似文献   

11.
NAF点乘算法的并行计算研究   总被引:2,自引:0,他引:2  
分析了目前常用的NAF点乘算法,并提出了改进的并行NAF点乘算法,改进后的算法具有并行调度点加和点倍的特点,实验表明改进后的算法比原算法效率有明显提高。  相似文献   

12.
There are two ways, other than the standard fast Fourier transform (FFT) algorithm, of computing Fourier transforms of real data, namely, (1)the real fast Fourier transform (RFFT) algorithm, and (2) the fast Hartley transform (FHT) algorithm. On a sequential computer, it has been shown that both the RFFT and the FHT algorithms are faster than the FFT algorithm. However, it is not obvious that the same is true on a parallel machine. The communication requirements of the RFFT and the FHT algorithms, which are critical to the cost of any parallel implementation, are different from those of the FFT algorithm. In this paper we present efficient implementations of the RFFT and the FHT algorithms on a hypercube machine. Experimental results are given for the implementation of the RFFT and the FHT algorithms on the NCUBE machine.  相似文献   

13.
A new method of classification for numerical stability of parallel algorithms is proposed based on the theoretical foundation of forward error analysis. It partitions the algorithms according to their asymptotic stability—a measure introduced to relate the limiting behavior of the stability to the size of the problem. Using this method, the stability aspect of the pipelined solution technique for first-order and second-order linear recurrences—the core of a tridiagonal linear equation solver—is studied. In particular, it shows that the pipelined solution method of the first-order linear recurrences has the same degree of stability as the commonly used sequential evaluation algorithms. The stability problems of sequential and pipelined solution methods of the second-order linear recurrences are also studied.  相似文献   

14.
An external sorting algorithm based on quicksort is presented. The file to be sorted is kept on a disk and only those blocks are fetched into the main memory which are currently needed. At each time, a block is kept in the main memory, if the expected space-time cost of holding it until its next use is smaller than the expected space-time cost of removing it and fetching it again. The efficiency of the algorithm is tested through simulation experiments and the results are compared to those achieved with mergesort in a corresponding environment. The total execution time and the main memory space-time integral are used for measuring the performance.

When equal block sizes are used, external quicksort results in a much smaller average space requirement than mergesort. On the other hand, mergesort is somewhat faster than external quicksort. The main memory space-time integral of quicksort is always considerably smaller than that of mergesort. External quicksort is less sensitive to the block size and to the file size. With faster disks, the performance of external quicksort improves faster than that of mergesort. The relative difference of the algorithms is independent of the file size.

The external quicksort is also analytically compared to some previous external versions of quicksort. It is shown to be satisfied with less space and fewer block fetches than the others.  相似文献   


15.
51.引言 随着科学技术的发展,对大规模科学计算提出的需求越来越高.一是求解问题的规模越来越大,例如,三维油正模拟、大气和海洋之间的相互作用和核安全分析等都要求解超大规模的非线性方程组(未知数个数高达106~108).另一方面是实时性要求越来越迫切,电力系统安全分析、气象预报等方面提出的实时性需求是最好的铭证. 传统的单机串行式地解决问题的方法已经无法满足客观需求,因此各种形式的向量化和并行(乃至并行十向量化)算法的研究受到普遍的重视. 无论用什么方法求解非线性偏微分方程(组);最终都导致成千上万…  相似文献   

16.
Multigrid methods are powerful techniques to accelerate the solution of computationally-intensive problems arising in a broad range of applications. Used in conjunction with iterative processes for solving partial differential equations, multigrid methods speed up iterative methods by moving the computation from the original mesh covering the problem domain through a series of coarser meshes. But this hierarchical structure leaves domain-parallel versions of the standard multigrid algorithms with a deficiency of parallelism on coarser grids. To compensate, several parallel multigrid strategies with more parallelism, but also more work, have been designed. We examine these parallel strategies and compare them to simpler standard algorithms to try to determine which techniques are more efficient and practical. We consider three parallel multigrid strategies: (1) domain-parallel versions of the standard V-cycle and F-cycle algorithms; (2) a multiple coarse grid algorithm, proposed by Fredrickson and McBryan, which generates several coarse grids for each fine grid; and (3) two Rosendale algorithm, which allow computation on all grids simultaneously. We study an elliptic model problem on simple domains, discretized with finite difference techniques on block-structured meshes in two or three dimensions with up to 106 or 109 points, respectively. We analyze performance using three models of parallel computation: the PRAM and two bridging models. The bridging models reflect the salient characteristics of two kinds of parallel computers: SIMD fine-grain computers, which contain a large number of small (bitserial) processors, and SPMD medium-grain computers, which have a more modest number of powerful (single chip) processors. Our analysis suggests that the standard algorithms are substantially more efficient than algorithms utilizing either parallel strategy. Both parallel strategies need too much extra work to compensate for their extra parallelism. They require a highly impractical number of processors to be competitive with simpler, standard algorithms. The analysis also suggests that the F-cycle, with the appropriate optimization techniques, is more efficient than the V-cycle under a broad range of problem, implementation, and machine characteristics, despite the fact that it exhibits even less parallelism than the V-cycle. Research at Princeton University partially supported by the National Science Foundation, Grant No. CCR-8920505, and the Office of Naval Research, Contract No. N0014-91-J-1463.  相似文献   

17.
基于Java平台先对经典快速排序的改进方法作了介绍,通过测试得出了一个合适的经验阈值,改善了快速排序在小数据量情况下的低效问题。然后对快速排序作了多线程优化,并进行了单、多线程的对比测试,结果显示在多核主机上能有几倍的速度提升。最后对多线程快速排序算法进行了理论分析,得出了该算法速度的理论上限。  相似文献   

18.
《Information Sciences》2005,169(3-4):383-408
Quicksort is usually the best practical choice for sorting because it is, on average, remarkably efficient. Unfortunately, this popular algorithm has a significant drawback: the slowest performance is obtained in the simplest cases when input data are already initially sorted or only a slight perturbation occurs. In this paper, we propose a combination of quicksort and a new algorithm, which shows excellent time performance in sorting such crucial data arrays, and which is not much slower than quicksort in random cases. Our work was inspired by problems met when sorting polygon vertices in the sweep-line algorithms of computational geometry and, therefore, we have named the new algorithm ‘vertex sort’. It splits the input array into three sub-arrays. Two of them are already sorted, and the third one is handled iteratively. A simple test decides whether to continue recursively with vertex sort or to employ quicksort in the second iteration. In this way, we achieve a situation where the worst case time complexity does not exceed the running times of quicksort, but the simplest cases are handled much faster (in linear time) than random cases. We have named the combined algorithm ‘smart quicksort’ because of this desired property. In the last part of the paper, we prove its efficiency by employing it in a well-known sweep-line-based polygon triangulation algorithm.  相似文献   

19.
A unified distance transform algorithm and architecture   总被引:1,自引:0,他引:1  
Standard distance transform algorithms produce approximate results and are unsuitable for real-time implementation since they require massive parallelism. A new unified algorithm that computes distance and related nearest feature transforms concurrently for arbitrary bit maps based on any distance function from a broad class is presented. The algorithm has an efficient implementation on serial processors and a unified transform architecture is proposed for feasible real-time performance based on parallel row followed by parallel column scanning. Its importance lies in that it supports real-time performance and a broader set of machine vision applications than the standard approach.  相似文献   

20.
Object-oriented database management systems (OODBMSs) provide rich facilities for the modeling and processing of structural as well as behavioral properties of complex application objects. However, due to their inherent generality and continuously evolving functionalities, efficient implementations are important for these OODBMSs to support the present and future applications, particularly when the databases are very large. In this paper, we present several parallel, multi-wavefront algorithms based on two processing approaches, i.e., identification and elimination approaches, to verify association patterns specified in queries. Both approaches allow more processors to operate concurrently on a query than the traditional tree-structured query processing approach, thus introducing a higher degree of parallelism in query processing. A heuristic method is presented for partitioning an object-oriented database (OODB). The main consideration for partitioning the database is load balancing. This method also tries to reduce the communication time by reducing the length of the path that wavefronts need to be propagated. Multiple wavefront algorithms based on the two approaches for tree-structured queries have been implemented on an nCUBE 2 parallel computer. The implementation of the query processor allows multiple queries to be executed simultaneously. This implementation provides an environment for evaluating the algorithms and the heuristic method for partitioning the database. The evaluation results are presented in this paper.Recommended by: Patrick Valduriez  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号