首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Summary This paper presents a model of parallel computing. Six examples illustrate the method of programming. An implementation scheme for programs is also presented.  相似文献   

2.
Embedded Parallel computing architecture with Unique Memory Access (ePUMA) is a domain-specific embedded heterogeneous 9-core chip multiprocessor, which has a unique design with low power and high silicon efficiency for high-throughput DSP in emerging telecommunication and multimedia applications. Sorting is one of the most widely studied algorithms, more embedded applications also need efficient sorting. This paper proposes an efficient bitonic sorting algorithm eSORT for the novel ePUMA DSP. eSORT algorithm consists of two parts: an in-core sorting algorithm and an intra-core sorting algorithm. Both algorithms are adapted to the novel architecture and take advantage of the ePUMA platform. This paper implemented and evaluated the eSORT for variable datasets on ePUMA multi-core DSP and compared its performance with the Cell BE processors with the same SIMD parallelization structure. Results show that bitonic sort on ePUMA multi-core DSP has much better performance and scalability. Compared with optimized bitonic sort on Cell BE, the in-core sort is 11 times faster and intra-core sort is 15 times faster in average.  相似文献   

3.
A sorting classification of parallel rendering   总被引:26,自引:0,他引:26  
We describe a classification scheme that we believe provides a more structured framework for reasoning about parallel rendering. The scheme is based on where the sort from object coordinates to screen coordinates occurs, which we believe is fundamental whenever both geometry processing and rasterization are performed in parallel. This classification scheme supports the analysis of computational and communication costs, and encompasses the bulk of current and proposed highly parallel renderers - both hardware and software. We begin by reviewing the standard feed-forward rendering pipeline, showing how different ways of parallelizing it lead to three classes of rendering algorithms. Next, we consider each of these classes in detail, analyzing their aggregate processing and communication costs, possible variations, and constraints they may impose on rendering applications. Finally, we use these analyses to compare the classes and identify when each is likely to be preferable  相似文献   

4.
5.
Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel programming systems in the same application. We introduce an MPI-integrated shared-memory programming model that is incorporated into MPI through a small extension to the one-sided communication interface. We discuss the integration of this interface with the MPI 3.0 one-sided semantics and describe solutions for providing portable and efficient data sharing, atomic operations, and memory consistency. We describe an implementation of the new interface in the MPICH2 and Open MPI implementations and demonstrate an average performance improvement of 40 % to the communication component of a five-point stencil solver.  相似文献   

6.
Parallel prefix circuits are parallel algorithms performing the prefix operation for the combinational circuit model. The size of a prefix circuit is the number of operation nodes in the circuit, and the depth is the maximum level of operation nodes. A circuit with n inputs is depth-size optimal if its depth plus size equals 2n−2. Smaller depth implies faster computation, while smaller size implies less power consumption, smaller VLSI area, and less cost. A circuit should have a small fan-out and small depth for it to be of practical use. In this paper, we present a new approach to easing the design of parallel prefix circuits, and construct a depth-size optimal parallel prefix circuit, named WE4, with fan-out 4. In many cases of n, WE4 has the smallest depth among all known depth-size optimal prefix circuits with bounded fan-out.  相似文献   

7.
This paper presents a new generalized particle (GP) approach to dynamical optimization of network bandwidth allocation, which can also be used to optimize other resource assignments in networks. By using the GP model, the complicated network bandwidth allocation problem is transformed into the kinematics and dynamics of numerous particles in two reciprocal dual force-fields. The proposed model and algorithm are featured by the powerful processing ability under a complex environment that involves the various interactions among network entities, the market mechanism between the demands and service, and other phenomena common in networks, such as congestion, metabolism, and breakdown of network entities. The GP approach also has the advantages in terms of the higher parallelism, lower computation complexities, and the easiness for hardware implementation. The properties of the approach, including the correctness, convergency and stability, are discussed in details. Simulation results attest to the effectiveness and suitability of the proposed approach.  相似文献   

8.
On parallel integer sorting   总被引:1,自引:0,他引:1  
We present an optimal algorithm for sortingn integers in the range [1,n c ] (for any constantc) for the EREW PRAM model where the word length isn , for any >0. Using this algorithm, the best known upper bound for integer sorting on the (O(logn) word length) EREW PRAM model is improved. In addition, a novel parallel range reduction algorithm which results in a near optimal randomized integer sorting algorthm is presented. For the case when the keys are uniformly distributed integers in an arbitrary range, we give an algorithm whose expected running time is optimal.Supported by NSF-DCR-85-03251 and ONR contract N00014-87-K-0310  相似文献   

9.
High-speed electronic sorting networks are difficult to implement with VLSI technology because of the dense and global connectivity required. Optics eliminates this bottleneck by offering global interconnections, massive parallelism, and noninterfering communications. We present a parallel sorting algorithm and its efficient optical implementation using currently available optical hardware. The algorithm sorts n data elements in a few steps, independent of the number of elements to be sorted. Thus, it is a constant-time sorting algorithm, that is, O(1) time  相似文献   

10.
With the popularity of parallel database machines based on the shared-nothing architecture, it has become important to find external sorting algorithms which lead to a load-balanced computation, i.e., balanced execution, communication and output. If during the course of the sorting algorithm each processor is equally loaded, parallelism is fully exploited. Similarly, balanced communication will not congest the network traffic. Since sorting can be used to support a number of other relational operations (joins, duplicate elimination, building indexes etc.) data skew produced by sorting can further lead to execution skew at later stages of these operations. In this paper we present a load-balanced parallel sorting algorithm for shared-nothing architectures. It is a multiple-input multiple-output algorithm with four stages, based on a generalization of Batcher's odd-even merge. At each stage then keys are evenly distributed among thep processors (i.e., there is no final sequential merge phase) and the distribution of keys between stages ensures against network congestion. There is no assumption made on the key distribution and the algorithm performs equally well in the presence of duplicate keys. Hence our approach always guarantees its performance, as long asn is greater thanp 3, which is the case of interest for sorting large relations. In addition, processors can be added incrementally. Recommended by: Patrick Valduriez  相似文献   

11.
A new approach to Kanerva's sparse distributed memory.   总被引:1,自引:0,他引:1  
The sparse distributed memory (SDM) was originally developed to tackle the problem of storing large binary data patterns. The model succeeded well in storing random input data. However, its efficiency, particularly in handling nonrandom data, was poor. In its original form it is a static and inflexible system. Most of the recent work on the SDM has concentrated on improving the efficiency of a modified form of the SDM which treats the memory as a single-layer neural network. This paper introduces an alternative SDM, the SDM signal model which retains the essential characteristics of the original SDM, while providing the memory with a greater scope for plasticity and self-evolution. By removing many of the problematic features of the original SDM the new model is not as dependent upon a priori input values. This gives it an increased robustness to learn either random or correlated input patterns. The improvements in this new SDM signal model should be also of benefit to modified SDM neural network models.  相似文献   

12.
Batcher proposed a parallel sorting scheme whose basic operation sorts two elements. Tseng and Lee extended his results. They proposed a parallel sorting scheme whose basic operation sorts three elements. In this paper, we propose a parallel sorting scheme whose basic operation sortsn elements wheren is an arbitrary number. The correctness of this algorithm is given.This research was partially supported by the National Science Council of the Republic of China under the Contract NSC-73-0201-E-007-01.  相似文献   

13.
In this paper a computer memory system intended for storing an arbitrary sequence of multidimensional arrays is described. This memory system permits a parallel access to the cuts distinguished in the given array by fixing one of the coordinates and to the large set of parallelepipeds which are the same dimension subarrays of the given arrays.  相似文献   

14.
15.
Sorting on a parallel architecture is a communications intensive event which can incur a high penalty in applications where it is required. In the case of particle simulation, only integer sorting is necessary, and sequential implementations easily attain the minimum performance bound of O(N) for N particles. Parallel implementations, however, have to cope with the parallel sorting problem which, in addition to incurring a heavy communications cost, can make the minimum performance bound difficult to attain. This paper demonstrates how the sorting problem in a particle simulation can be reduced to a merging problem, and describes an efficient data parallel algorithm to solve this merging problem in a particle simulation. The new algorithm is shown to be optimal under conditions usual for particle simulation, and its fieldwise implementation on the Connection Machine is analysed in detail. The new algorithm is about four times faster than a fieldwise implementation of radix sort on the Connection Machine.  相似文献   

16.
The computational complexity of a parallel algorithm depends critically on the model of computation. We describe a simple and elegant rule-based model of computation in which processors apply rules asynchronously to pairs of objects from a global object space. Application of a rule to a pair of objects results in the creation of a new object if the objects satisfy the guard of the rule. The model can be efficiently implemented as a novel MIMD array processor architecture, the Intersecting Broadcast Machine. For this model of computation, we describe an efficient parallel sorting algorithm based on mergesort. The computational complexity of the sorting algorithm isO(nlog2 n), comparable to that for specialized sorting networks and an improvement on theO(n 1.5) complexity of conventional mesh-connected array processors.  相似文献   

17.
This paper describes a special-purpose embedded multiprocessor architecture developed for performing real-time multi-line optical character recognition (MLOCR). MLOCR is a computationally intensive real-time application involving pattern recognition, character image extraction, gray-scale thresholding, rotation and scaling of individual characters, and character identification. The computational complexity of the MLOCR application dictated the development of custom hardware in a parallel processing environment in order to meet the real-time system requirements. The overall system organization is described, along with the functional partitioning of algorithms onto processors, development of specific custom hardware to implement the algorithms in real time, interprocess communications, and system control.  相似文献   

18.
Molecular docking is a widely used computational technique that allows studying structure-based interactions complexes between biological objects at the molecular scale. The purpose of the current work is to develop a set of tools that allows performing inverse docking, i.e., to test at a large scale a chemical ligand on a large dataset of proteins, which has several applications on the field of drug research. We developed different strategies to parallelize/distribute the docking procedure, as a way to efficiently exploit the computational performance of multi-core and multi-machine (cluster) environments. The experiments conducted to compare these different strategies encourage the search for decomposing strategies since it improves the execution of inverse docking.  相似文献   

19.
《Performance Evaluation》1986,6(2):135-145
To compare the performance of external sorting and internal sorting in virtual memory, (external) mergesort and (internal) quicksort are performed in corresponding environments. Quicksort is run using a virtual memory with fetch on demand and working set replacement policy. Mergesort uses sublists presorted by replacement selection, double buffering for input and output, and two disks. The performance is measured by main memory space allocation, execution time, and main memory space-time integral.To perform well, quicksort is satisfied with a smaller space allocation than mergesort, and it behaves more consistently with respect to space allocation. Mergesort needs far less execution time than quicksort, mainly because of its efficient overlapping of file handling and processor time. With respect to the space-time integral, quicksort outperforms mergesort only when small files (less than 1000 records) are sorted. With large files, mergesort is better, and the relative difference increases with increasing file size. The optimal page size and window size for quicksort are smaller than those typical to existing virtual memory systems, and they tend to decrease with increasing file size.  相似文献   

20.
This paper is concerned with an external sorting algorithm with no additional disk space. The proposed algorithm is a hybrid one that uses Quicksort and special merging process in two distinct phases. The algorithm excels in sorting a huge file, which is many times larger than the available memory of the computer. This algorithm creates no extra backup file for manipulating huge records. For this, the algorithm saves huge disk space, which is needed to hold the large file. Also our algorithm switches to special merging process after the first phase that uses Quicksort. This reduces the time complexity and makes the algorithm faster.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号