期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast Four‐Way Parallel Radix Sorting on GPUs

Linh Ha Jens Krüger Cláudio T. Silva 《Computer Graphics Forum》2009,28(8):2368-2378

Efficient sorting is a key requirement for many computer science algorithms. Acceleration of existing techniques as well as developing new sorting approaches is crucial for many real‐time graphics scenarios, database systems, and numerical simulations to name just a few. It is one of the most fundamental operations to organize and filter the ever growing massive amounts of data gathered on a daily basis. While optimal sorting models for serial execution on a single processor exist, efficient parallel sorting remains a challenge. In this paper, we present a hardware‐optimized parallel implementation of the radix sort algorithm that results in a significant speed up over existing sorting implementations. We outperform all known General Processing Unit (GPU) based sorting systems by about a factor of two and eliminate restrictions on the sorting key space. This makes our algorithm not only the fastest, but also the first general GPU sorting solution. 相似文献

2.

Performance characteristics of the multi-zone NAS parallel benchmarks

《Journal of Parallel and Distributed Computing》2006,66(5):674-685

We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multi-zone, is derived from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on four different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes. 相似文献

3.

Implementing regularly structured neural networks on the DREAMmachine

Shams S. Gaudiot J.-L. 《Neural Networks, IEEE Transactions on》1995,6(2):407-421

High-throughput implementations of neural network models are required to transfer the technology from small prototype research problems into large-scale "real-world" applications. The flexibility of these implementations in accommodating for modifications to the neural network computation and structure is of paramount importance. The performance of many implementation methods today is greatly dependent on the density and the interconnection structure of the neural network model being implemented. A principal contribution of this paper is to demonstrate an implementation method which exploits maximum amount of parallelism from neural computation, without enforcing stringent conditions on the neural network interconnection structure, to achieve this high implementation efficiency. We propose a new reconfigurable parallel processing architecture, the Dynamically Reconfigurable Extended Array Multiprocessor (DREAM) machine, and an associated mapping method for implementing neural networks with regular interconnection structures. Details of the system execution rate calculation as a function of the neural network structure are presented. Several example neural network structures are used to demonstrate the efficiency of our mapping method and the DREAM machine architecture on implementing diverse interconnection structures. We show that due to the reconfigurable nature of the DREAM machine, most of the available parallelism of neural networks can be efficiently exploited. 相似文献

4.

DaSH: A benchmark suite for hybrid dataflow and shared memory programming models

《Parallel Computing》2015

The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics. 相似文献

5.

Attempting guards in parallel: A data flow approach to execute generalized guarded commands

R. Govindarajan S. Yu V. S. Lakshmanan 《International journal of parallel programming》1992,21(4):225-268

Earlier approaches to execute generalized alternative/repetitive commands of Communicating Sequential Processes (CSP) attempt the selection of guards in a sequential order. Also, these implementations are based on either shared memory or message passing multiprocessor systems. In contrast, we propose an implementation of generalized guarded commands using the data-driven model of computation. A significant feature of our implementation is that it attempts the selection of the guards of a process in parallel. We prove that our implementation is faithful to the semantics of the generalized guarded commands. Further, we have simulated the implementation using discrete-event simulation and measured various performance parameters. The measured parameters are helpful in establishing the fairness of our implementation and its superiority, in terms of efficiency and the parallelism exploited, over other implementations. The simulation study is also helpful in identifying various issues that affect the performance of our implementation. Based on this study, we have proposed an adaptive algorithm which dynamically tunes the extent of parallelism in the implementation to achieve an optimum level of performance.The first author's work was supported by a MICRONET, Network Centers of Excellence, research grant. Support for the second author is from the NSERC (Canada) Grant. The last author's work was supported by grants from NSERC (Canada) and FCAR (Quebec). 相似文献

6.

Parallel implementation of OPS5 on the encore multiprocessor: Results and analysis 总被引：1，自引：0，他引：1

Anoop Gupta Milind Tambe Dirk Kalp Charles Forgy Allen Newell 《International journal of parallel programming》1988,17(2):95-124

Until now, most results reported for parallelism in production systems (rulebased systems) have been simulation results-very few real parallel implementations exist. In this paper, we present results from our parallel implementation of OPS5 on the Encore multiprocessor. The implementation exploits very finegrained parallelism to achieve significant speed-ups. For one of the applications, we achieve 12.4 fold speed-up using 13 processes. Our implementation is also distinct from other parallel implementations in that we parallelize a highly optimized C-based implementation of OPS5. Running on a uniprocessor, our C-based implementation is 10–20 times faster than the standard lisp implementation distributed by Carnegie Mellon University. In addition to presenting the performance numbers, the paper discusses the details of the parallel implementation-the data structures used, the amount of contention observed for shared data structures, and the techniques used to reduce such contention. 相似文献

7.

Parallelizing the Cellular Potts Model on graphics processing units

José Juan Tapia Roshan M. D'Souza 《Computer Physics Communications》2011,182(4):857-865

The Cellular Potts Model (CPM) is a lattice based modeling technique used for simulating cellular structures in computational biology. The computational complexity of the model means that current serial implementations restrict the size of simulation to a level well below biological relevance. Parallelization on computing clusters enables scaling the size of the simulation but marginally addresses computational speed due to the limited memory bandwidth between nodes. In this paper we present new data-parallel algorithms and data structures for simulating the Cellular Potts Model on graphics processing units. Our implementations handle most terms in the Hamiltonian, including cell–cell adhesion constraint, cell volume constraint, cell surface area constraint, and cell haptotaxis. We use fine level checkerboards with lock mechanisms using atomic operations to enable consistent updates while maintaining a high level of parallelism. A new data-parallel memory allocation algorithm has been developed to handle cell division. Tests show that our implementation enables simulations of >¹⁰⁶ cells with lattice sizes of up to 256³ on a single graphics card. Benchmarks show that our implementation runs ∼80× faster than serial implementations, and ∼5× faster than previous parallel implementations on computing clusters consisting of 25 nodes. The wide availability and economy of graphics cards mean that our techniques will enable simulation of realistically sized models at a fraction of the time and cost of previous implementations and are expected to greatly broaden the scope of CPM applications. 相似文献

8.

An Efficient Parallel Algorithm for FFT

下载免费PDF全文

Qiao Xiangzhen 《计算机科学技术学报》1987,2(3):174-190

相似文献

9.

Parthenon: A parallel theorem prover for non-horn clauses

Soumitra Bose Edmund M. Clarke David E. Long Spiro Michaylov 《Journal of Automated Reasoning》1992,8(2):153-181

We describe a parallel resolution theorem prover, called Parthenon, that handles full first order logic. Although there has been much work on parallel implementations of logic programming languages, Parthenon is the first general purpose theorem prover to be developed for a multiprocessor. The system is based on a modification of Warren's SRI model for or-parallelism and implements a variant of Loveland's model elimination procedure. It has been evaluated on various shared memory multiprocessors including a 16-processor Encore Multimax and IBM's 64-processor RP3. We have found that many theorem proving problems exhibit a great deal of potential parallelism. Parthenon has been able to exploit much of this parallelism, producing both good absolute run times and near-linear speedup curves in many cases.This research was partially supported by NSF grant CCR-87-226-33. An earlier version of this paper appeared in the Fourth IEEE Symposium on Logic in Computer Science, Asilomar, CA, June 1989. D.E.L. was partially supported by an NSF graduate fellowship. S.M. was partially supported by an IBM graduate fellowship. 相似文献

10.

Loop Staggering,Loop Compacting:Restructuring Techniques for Thrashing Problem 总被引：1，自引：0，他引：1

下载免费PDF全文

Jin Guohua Yang Xuejun Chen Fujie 《计算机科学技术学报》1993,8(1):49-57

Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel wit low run-time overhead is thus very important for achieving high performance in paralleo processing systems.However,in parallel processing systems with caches of local memories in memory hierarchies,“thrashing problemmay” may arise when data move back and forth frequently between the caches or local memories in different processors.The techniques associated with parallel compiler to solve the problem are not completely developed.In this paper,we present two restructuring techniques called loopg staggering,loop staggering and compacting,with which we can not only eliminate the cache or local memory thrashing phemomena significantly,but also exploit the potential parallelism existing in outer serial loop.Loop staggering benefits the dynamic loop scheduling strategies,whereas loop staggering and compacting is good for static loop scheduling strategies,Our method especially benefits parallel programs,in which a parallel loop is enclosed by a serial loop and array elements are repeatedly used in the different iterations of the parallel loop. 相似文献

11.

Adaptive parallelism and Piranha

Carriero N. Freeman E. Gelernter D. Kaminsky D. 《Computer》1995,28(1):40-49

相似文献

12.

基于分布对象的并行程序设计方法研究

龚向坚邹腊梅马淑萍《现代计算机》2011,(21):9-11,26

研究分布式对象的并行实现及优化,提出一种基于分布式对象的并行程序设计方法,构建一个基于分布式对象的并行程序设计模型,并以此方法完成虚拟计算机网络实验系统的设计和实现实验结果表明,该虚拟计算机网络实验系统并行性较好、响应速度适中,证明基于分布式对象的并行程序设计方法在改善微机系统并行性上具有一定的作用相似文献

13.

Algorithms for asynchronous parallel processing of object-orienteddatabases

Thakore A.K. Su S.Y.W. Lam H.X. 《Knowledge and Data Engineering, IEEE Transactions on》1995,7(3):487-504

Management of large quantities of complex data is essential in many advanced application areas. Object-oriented (OO) database management system have been developed to effectively model and process the complex domain knowledge. They have been shown to outperform some existing relational systems. The existing implementations of OO database management systems attempt to improve the efficiency of OO queries by explicitly capturing the relationships among objects. However, the execution of complex queries involving the retrieval of objects from many classes and relationships among them causes the existing system to operate inefficiently. In this paper, we present parallel algorithms for the processing of queries against a large OO database. The algorithms are based on a closed model of query processing pattern-based access instead of the conventional value-based access. During processing, the algorithms avoid the execution of time-consuming join operations by making use of the explicitly stored object associations. Generation of large quantities of temporary data is avoided by marking objects using their identifiers and by employing a two-phase query processing strategy. A query is processed by concurrent multiple waves, thereby improving parallelism avoiding the complexities introduced in their sequential implementation. The correctness and the performance of the parallel algorithms have been tested and analyzed by running parallel programs on a 32-node transputer based parallel machine designed and developed at the IBM Research Center at Yorktown Heights, New York. Benchmark queries of different semantic complexities are generated, and their performance is analyzed for various data and query parameters 相似文献

14.

Stabilizing large‐scale generalized systems on parallel computers using multithreading and message‐passing

Peter Benner Maribel Castillo Rafael Mayo Enrique S. Quintana‐Ortí Gregorio Quintana‐Ortí 《Concurrency and Computation》2007,19(4):531-542

We discuss the parallelization of an efficient algorithm for the partial stabilization of large‐scale linear control systems in generalized state‐space form. The algorithm is composed of highly parallel iterative schemes that appear in the computation of certain matrix functions. Here we evaluate different approaches to exploit parallelism at two levels, based on threads and processes. Our experimental results on a cluster of symmetric multiprocessors and a CC‐NUMA platform show that the efficiency of the matrix operations underlying the iterative schemes carry over to the parallel implementation of the stabilization algorithm. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

15.

Constructive protocol specification using Cicero

Huang Y.-M. Ravishankar C.V. 《IEEE transactions on pattern analysis and machine intelligence》1998,24(4):252-267

This paper describes Cicero, a set of language constructs to allow constructive protocol specifications. Unlike other protocol specification languages, Cicero gives programmers explicit control over protocol execution, and facilitates both sequential and parallel implementations, especially for protocols above the transport-layer. It is intended to be used in conjunction with domain-specific libraries, and is quite different in philosophy and mode of use from existing protocol specification languages. A feature of Cicero is the use of event patterns to control synchrony, asynchrony, and concurrency in protocol execution, which helps programmers build robust protocol implementations. Event-pattern driven execution also enables implementers to exploit parallelism of varying grains in protocol execution. Event patterns can also be translated into other formal models, so that existing verification techniques may be used 相似文献

16.

GPU-accelerated level-set segmentation

Julián Lamas-Rodríguez Dora B. Heras Francisco Argüello Dagmar Kainmueller Stefan Zachow Montserrat Bóo 《Journal of Real-Time Image Processing》2016,12(1):15-29

The level-set method, a technique for the computation of evolving interfaces, is a solution commonly used to segment images and volumes in medical applications. GPUs have become a commodity hardware with hundreds of cores that can execute thousands of threads in parallel, and they are nowadays ideal platforms to execute computational intensive tasks, such as the 3D level-set-based segmentation, in real time. In this paper, we propose two GPU implementations of the level-set-based segmentation method called Fast Two-Cycle. Our proposals perform computations in independent domains called tiles and modify the structure of the original algorithm to better exploit the features of the GPU. The implementations were tested with real images of brain vessels and a synthetic MRI image of the brain. Results show that they execute faster than a CPU-sequential implementation of the same method, without any significant loss of the segmentation quality and without requiring distributed parallel computer infrastructures. 相似文献

17.

Parallel algorithms for model reduction of discrete-time systems

P. Benner E. S. Quintana-Orti G. Quintana-Orti 《International journal of systems science》2013,44(5):319-333

Computing reduced-order models of controlled dynamical systems is of fundamental importance in many analysis and synthesis problems in systems and control theory. Algorithmic aspects of model reduction methods based on state-space truncation for linear discrete-time systems are addressed here. In contrast to the often-used approach of applying methods for continuous-time systems to discrete-time models employing a bilinear transformation, we devise special algorithms for discrete-time systems. Usually, this is more reliable and efficient. All methods discussed require in an initial stage the computation of the Gramians of the system. Using an accelerated fixed-point iteration for computing the full-rank factors of the Gramians yields some favorable computational aspects, particularly for non-minimal systems. The computations only require efficient implementations of basic linear algebra operations readily available on modern computer architectures. We discuss aspects of the parallel implementation of these methods and show the performance and scalability on distributed memory computers. Our approach enables users to deal with very complex systems using relatively cheap infrastructure, as, for example, a local PC or workstation network. 相似文献

18.

Fifth generation and VLSI architectures

Philip C. Treleaven Apostolos N. Refenes 《Future Generation Computer Systems》1985,1(6):387-396

Most Western Governments (USA, Japan, EEC, etc.) have now launched national programmes to develop computer systems for use in the 1990s. These so-called Fifth Generation computers are viewed as “knowledge” processing systems which support the symbolic computation underlying Artificial Intelligence applications. The major driving force in Fifth Generation computer design is to efficiently support very high level programming languages (i.e. VHLL architecture).

Historycally, however, commercial VHLL architectures have been largely unsuccesful. The driving force in computer designs has principally been advances in hardware which at the present time means architectures to exploit very large scale integration (i.e. VLSI architecture).

This paper examines VHLL architectures and VLSI architectures and their probable influences on Fifth Generation computers. Interestingly the major problem for both architecture classes is parallelism; how to orchestrate a single parallel computation so that it can be distributed across an ensemble of processors. 相似文献

19.

Brain Derived Vision Algorithm on High Performance Architectures

Jayram Moorkanikara Nageswaran Andrew Felch Ashok Chandrasekhar Nikil Dutt Richard Granger Alex Nicolau Alex Veidenbaum 《International journal of parallel programming》2009,37(4):345-369

Even though computing systems have increased the number of transistors, the switching speed, and the number of processors, most programs exhibit limited speedup due to the serial dependencies of existing algorithms. Analysis of intrinsically parallel systems such as brain circuitry have led to the identification of novel architecture designs, and also new algorithms than can exploit the features of modern multiprocessor systems. In this article we describe the details of a brain derived vision (BDV) algorithm that is derived from the anatomical structure, and physiological operating principles of thalamo-cortical brain circuits. We show that many characteristics of the BDV algorithm lend themselves to implementation on IBM CELL architecture, and yield impressive speedups that equal or exceed the performance of specialized solutions such as FPGAs. Mapping this algorithm to the IBM CELL is non-trivial, and we suggest various approaches to deal with parallelism, task granularity, communication, and memory locality. We also show that a cluster of three PS3s (or more) containing IBM CELL processors provides a promising platform for brain derived algorithms, exhibiting speedup of more than 140 × over a desktop PC implementation, and thus enabling real-time object recognition for robotic systems. 相似文献

20.

A parallel execution model of logic programs

Chen A.C. Wu C.-I. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(1):79-92

A parallel-execution model that can concurrently exploit AND and OR parallelism in logic programs is presented. This model employs a combination of techniques in an approach to executing logic problems in parallel, making tradeoffs among number of processes, degree of parallelism, and combination bandwidth. For interpreting a nondeterministic logic program, this model (1) performs frame inheritance for newly created goals, (2) creates data-dependency graphs (DDGs) that represent relationships among the goals, and (3) constructs appropriate process structures based on the DDGs. (1) The use of frame inheritance serves to increase modularity. In contrast to most previous parallel models that have a large single process structure, frame inheritance facilitates the dynamic construction of multiple independent process structures, and thus permits further manipulation of each process structure. (2) The dynamic determination of data dependency serves to reduce computational complexity. In comparison to models that exploit brute-force parallelism and models that have fixed execution sequences, this model can reduce the number of unification and/or merging steps substantially. In comparison to models that exploit only AND parallelism, this model can selectively exploit demand-driven computation, according to the binding of the query and optional annotations. (3) The construction of appropriate process structures serves to reduce communication complexity. Unlike other methods that map DDGs directly onto process structures, this model can significantly reduce the number of data sent to a process and/or the number of communication channels connected to a process 相似文献