首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
There is at present a worldwide effort to overcome the technological barriers to nanoelectronics. Microscopic simulation can significantly enhance our understanding of the physics of nanoscale structures, and constitutes a valuable tool for designing nanoelectronic functional devices. In nanodevices, novel physics effects are used to attain logic functionality which conventional technology can not achieve. Therefore it is necessary to develop quantum-transport simulation methods which include novel physical effects. Moreover, simulation of realistic nanodevices require enormous computing resource, necessitating parallel supercomputing. In this paper, we present massively parallel algorithms for simulating large-scale nanoelectronic networks based on the single-electron tunneling effect, which is arguably the quantum effect of greatest significance to nanoelectronic technology. A MIMD implementation of our simulation algorithm is carried out on a 64-processor nCUBE 2, and a SIMD implementation is carried out on a 16,384-processor MasPar MP-1. By exploiting massive parallelism, both parallel implementations achieve very high parallel efficiency and nearly linear scalability. The result of this work is that we are able to simulate large-scale nanoelectronic network, within a reasonable time period, which would be impractical on conventional workstations.  相似文献   

3.
Earlier approaches to execute generalized alternative/repetitive commands of Communicating Sequential Processes (CSP) attempt the selection of guards in a sequential order. Also, these implementations are based on either shared memory or message passing multiprocessor systems. In contrast, we propose an implementation of generalized guarded commands using the data-driven model of computation. A significant feature of our implementation is that it attempts the selection of the guards of a process in parallel. We prove that our implementation is faithful to the semantics of the generalized guarded commands. Further, we have simulated the implementation using discrete-event simulation and measured various performance parameters. The measured parameters are helpful in establishing the fairness of our implementation and its superiority, in terms of efficiency and the parallelism exploited, over other implementations. The simulation study is also helpful in identifying various issues that affect the performance of our implementation. Based on this study, we have proposed an adaptive algorithm which dynamically tunes the extent of parallelism in the implementation to achieve an optimum level of performance.The first author's work was supported by a MICRONET, Network Centers of Excellence, research grant. Support for the second author is from the NSERC (Canada) Grant. The last author's work was supported by grants from NSERC (Canada) and FCAR (Quebec).  相似文献   

4.
High-throughput implementations of neural network models are required to transfer the technology from small prototype research problems into large-scale "real-world" applications. The flexibility of these implementations in accommodating for modifications to the neural network computation and structure is of paramount importance. The performance of many implementation methods today is greatly dependent on the density and the interconnection structure of the neural network model being implemented. A principal contribution of this paper is to demonstrate an implementation method which exploits maximum amount of parallelism from neural computation, without enforcing stringent conditions on the neural network interconnection structure, to achieve this high implementation efficiency. We propose a new reconfigurable parallel processing architecture, the Dynamically Reconfigurable Extended Array Multiprocessor (DREAM) machine, and an associated mapping method for implementing neural networks with regular interconnection structures. Details of the system execution rate calculation as a function of the neural network structure are presented. Several example neural network structures are used to demonstrate the efficiency of our mapping method and the DREAM machine architecture on implementing diverse interconnection structures. We show that due to the reconfigurable nature of the DREAM machine, most of the available parallelism of neural networks can be efficiently exploited.  相似文献   

5.
The Cellular Potts Model (CPM) is a lattice based modeling technique used for simulating cellular structures in computational biology. The computational complexity of the model means that current serial implementations restrict the size of simulation to a level well below biological relevance. Parallelization on computing clusters enables scaling the size of the simulation but marginally addresses computational speed due to the limited memory bandwidth between nodes. In this paper we present new data-parallel algorithms and data structures for simulating the Cellular Potts Model on graphics processing units. Our implementations handle most terms in the Hamiltonian, including cell–cell adhesion constraint, cell volume constraint, cell surface area constraint, and cell haptotaxis. We use fine level checkerboards with lock mechanisms using atomic operations to enable consistent updates while maintaining a high level of parallelism. A new data-parallel memory allocation algorithm has been developed to handle cell division. Tests show that our implementation enables simulations of >106 cells with lattice sizes of up to 2563 on a single graphics card. Benchmarks show that our implementation runs ∼80× faster than serial implementations, and ∼5× faster than previous parallel implementations on computing clusters consisting of 25 nodes. The wide availability and economy of graphics cards mean that our techniques will enable simulation of realistically sized models at a fraction of the time and cost of previous implementations and are expected to greatly broaden the scope of CPM applications.  相似文献   

6.
Task-based libraries, such as Intel’s Threading Building Blocks (TBB), are promising tools that help programmers to develop parallel code in a productive way, thanks to high-level constructors which simplify the chore of efficiently exploiting system resources. In this paper we focus on one type of task parallelism, pipeline parallelism, which is becoming an increasingly popular parallel programming pattern for streaming applications in the domain of digital signal processing, graphics, compression and encryption. Specifically, TBB provides a high-level template to express pipeline parallelism, but it is limited to representing simple pipeline structures. We address the issue of non-trivial parallel pipeline structures in which one or more stages in the pipeline have more items leaving than arriving, a problem for which the current TBB pipeline template does not provide support. In this work, we describe a new Multioutput filter that we have incorporated into the TBB pipeline framework to deal with these multioutput stages. Using real world streaming applications from different computational domains (dedup and scenerecog), we also compare the performance of our implementation using the Multioutput filter in the TBB pipeline template to other more complex TBB task-based implementations that only use the standard filters. We also develop new analytical models for each implementation to better understand the resources utilization in each case. Performance evaluation and analysis shows that the implementation based on the Multioutput filter outperforms the other solutions because: it promotes finer task parallelism, which is more suited to the TBB task-stealing mechanism in order to better exploit the resources; and it also reduces the overheads related to memory and task management.  相似文献   

7.
Biology is inherently parallel. Models of biological systems and bio-inspired algorithms also share this parallelism, although most are simulated on serial computers. Previous work created the systemic computer – a new model of computation designed to exploit many natural properties observed in biological systems, including parallelism. The approach has been proven through two existing implementations and many biological models and visualizations. However to date the systemic computer implementations have all been sequential simulations that do not exploit the true potential of the model. In this paper the first ever parallel implementation of systemic computation is introduced. The GPU Systemic Computation Architecture is the first implementation that enables parallel systemic computation by exploiting the multiple cores available in graphics processors. Comparisons with the serial implementation when running two programs at different scales show that as the number of systems increases, the parallel architecture is several hundred times faster than the existing implementations, making it feasible to investigate systemic models of more complex biological systems.  相似文献   

8.
Large production systems (rule-based systems) continue to suffer from extremely slow execution which limits their utility in practical applications as well as in research settings. Most investigations in speeding up these systems have focused on match parallelism. These investigations have revealed that the total speed-up available from this source is insufficient to alleviate the problem of slow execution in large-scale production system implementations. In this paper, we focus on task-level parallelism, which is obtained by a high-level decomposition of the production system. Speed-ups obtained from task-level parallelism will multiply with the speed-ups obtained from match parallelism. The vehicle for our investigation of task-level parallelism is SPAM, a high-level vision system, implemented as a production system. SPAM is a mature research system with a typical run requiring between 50,000 and 400,000 production firings. We report very encouraging speed-ups from task-level parallelism in SPAM… -our parallel implementation shows near linear speed-ups of over 12-fold using 14 processors and points the way to substantial (50- to 100-fold) speed-ups. We present a characterization of task-level parallelism in production systems and describe our methodology for selecting and applying a particular approach to parallelize SPAM. Additionally, we report the speed-ups obtained from the use of virtual shared memory. Overall, task-level parallelism has not received much attention in the literature. Our experience illustrates that it is potentially a very important tool for speeding up large-scale production systems.  相似文献   

9.
This paper presents and discusses a blocked parallel implementation of bi- and three-dimensional versions of the Lattice Boltzmann Method. This method is used to represent and simulate fluid flows following a mesoscopic approach. Most traditional parallel implementations use simple data distribution strategies to parallelize the operations on the regular fluid data set. However, it is well known that block partitioning is usually better. Such a parallel implementation is discussed and its communication cost is established. Fluid flows simulations crossing a cavity have also been used as a real-world case study to evaluate our implementation. The presented results with our blocked implementation achieve a performance up to 31% better than non-blocked versions, for some data distributions. Thus, this work shows that blocked, parallel implementations can be efficiently used to reduce the parallel execution time of the method.  相似文献   

10.
Operations on basic data structures such as queues, priority queues, stacks, and counters can dominate the execution time of a parallel program due to both their frequency and their coordination and contention overheads. There are considerable performance payoffs in developing highly optimized, asynchronous, distributed, cache-conscious, parallel implementations of such data structures. Such implementations may employ a variety of tricks to reduce latencies and avoid serial bottlenecks, as long as the semantics of the data structure are preserved. The complexity of the implementation and the difficulty in reasoning about asynchronous systems increases concerns regarding possible bugs in the implementation. In this paper we consider postmortem, black-box procedures for testing whether a parallel data structure behaved correctly. We present the first systematic study of algorithms and hardness results for such testing procedures, focusing on queues, priority queues, stacks, and counters, under various important scenarios. Our results demonstrate the importance of selecting test data such that distinct values are inserted into the data structure (as appropriate). In such cases we present an O(n) time algorithm for testing linearizable queues, an O(n log n) time algorithm for testing linearizable priority queues, and an O( np 2 ) time algorithm for testing sequentially consistent queues, where n is the number of data structure operations and p is the number of processors. In contrast, we show that testing such data structures for executions with arbitrary input values is NP-complete. Our results also help clarify the thresholds between scenarios that admit polynomial time solutions and those that are NP-complete. Our algorithms are the first nontrivial algorithms for these problems.  相似文献   

11.
Metric-space similarity search has proven suitable in a number of application domains such as multimedia retrieval and computational biology to name a few. These applications usually work on very large databases that are often indexed to speed-up on-line searches. To achieve efficient throughput, it is essential to exploit the intrinsic parallelism in the respective search query processing algorithms. Many strategies have been proposed in the literature to parallelize these algorithms either on shared or distributed memory multiprocessor systems. Lately, GPUs have been used to implement brute-force parallel search strategies instead of using index data structures. Indexing poses difficulties when it comes to achieve efficient exploitation of GPU resources. In this paper we propose single and multi GPU metric space techniques that efficiently exploit GPU tailored index data structures for parallel similarity search in large databases. The experimental results show that our proposal outperforms previous index-based sequential and OpenMP search strategies.  相似文献   

12.
Particle swarm optimization (PSO), like other population-based meta-heuristics, is intrinsically parallel and can be effectively implemented on Graphics Processing Units (GPUs), which are, in fact, massively parallel processing architectures. In this paper we discuss possible approaches to parallelizing PSO on graphics hardware within the Compute Unified Device Architecture (CUDA™), a GPU programming environment by nVIDIA™ which supports the company’s latest cards. In particular, two different ways of exploiting GPU parallelism are explored and evaluated. The execution speed of the two parallel algorithms is compared, on functions which are typically used as benchmarks for PSO, with a standard sequential implementation of PSO (SPSO), as well as with recently published results of other parallel implementations. An in-depth study of the computation efficiency of our parallel algorithms is carried out by assessing speed-up and scale-up with respect to SPSO. Also reported are some results about the optimization effectiveness of the parallel implementations with respect to SPSO, in cases when the parallel versions introduce some possibly significant difference with respect to the sequential version.  相似文献   

13.
In this paper we describe a new parallel Frequent Itemset Mining algorithm called “Frontier Expansion.” This implementation is optimized to achieve high performance on a heterogeneous platform consisting of a shared memory multiprocessor and multiple Graphics Processing Unit (GPU) coprocessors. Frontier Expansion is an improved data-parallel algorithm derived from the Equivalent Class Clustering (Eclat) method, in which a partial breadth-first search is utilized to exploit maximum parallelism while being constrained by the available memory capacity. In our approach, the vertical transaction lists are represented using a “bitset” representation and operated using wide bitwise operations across multiple threads on a GPU. We evaluate our approach using four NVIDIA Tesla GPUs and observed a 6–30× speedup relative to state-of-the-art sequential Eclat and FPGrowth implementations executed on a multicore CPU.  相似文献   

14.
Cluster环境下p—HPF编译器支持的并行计算范式   总被引:2,自引:0,他引:2  
p-HPF是研制的一个符合HPF(high performance Fortran)规范的并行编译系统,以HPF为核心实现多范式并行计算是开发大型并行应用系统的基础。首先论述了Cluster环境下的并行运行范式,包括farm parallel范式、流水线并行、流循环并行、基于数据并行和组合数据并行等,抽象分析了它们的性能,接着给出了利用p-HPF的外部过程机制、任务并行机制以以FORALL,INDEPENDENT DO等典型并行语句实现几种典型并行范式的方法,给出了实例程序,对实例进行了实际运行并对运行结果进行了分析。  相似文献   

15.
In this paper we study the impact of the simultaneous exploitation of data‐ and task‐parallelism, so called mixed‐parallelism, on the Strassen and Winograd matrix multiplication algorithms. This work takes place in the context of Grid computing and, in particular, in the Client–Agent(s)–Server(s) model, where data can already be distributed on the platform. For each of those algorithms, we propose two mixed‐parallel implementations. The former follows the phases of the original algorithms while the latter has been designed as the result of a list scheduling algorithm. We give a theoretical comparison, in terms of memory usage and execution time, between our algorithms and classical data‐parallel implementations. This analysis is corroborated by experiments. Finally, we give some hints about heterogeneous and recursive versions of our algorithms. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

16.
Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantadges over previous approaches, present example configurations and usage scenarios as well as scalability results.  相似文献   

17.
Although studies of a number of parallel implementations of logic programming languages are now available, their results are difficult to interpret due to the multiplicity of factors involved, the effect of each of which is difficult to separate. In this paper we present the results of a high-level simulation study of or- and independent and-parallelism with a wide selection of Prolog programs that aims to determine the intrinsic amount of parallelism, independently of implementation factors, thus facilitating this separation. We expect this study will be instrumental in better understanding and comparing results from actual implementations, as shown by some examples provided in the paper. In addition, the paper examines some of the issues and tradeoffs associated with the combination of and- and or-parallelism and proposes reasonable solutions based on the simulation data obtained.  相似文献   

18.
Nowadays, it is common to find problems that require recognizing objects in an image, tracking them along time, or recognizing a complete real‐world scene. One of the most known and used algorithms to solve these problems is the Speeded Up Robust Features (SURF) algorithm. SURF is a fast and robust local, scale and rotation invariant, features detector. This means that it can be used for detecting and describing a set of points of interest (keypoints) from an image. Because of the importance of this algorithm and the rise of the parallelism‐based technologies, in the last years, diverse parallel implementations of SURF have been proposed. These parallel implementations are based on very different techniques: Compute Unified Device Architecture, OpenMp, OpenCL, and so on. In conclusion, we think valuable a comparative study of all of them highlighting the advantages and disadvantages of each parallel implementation. To our best knowledge, this article is the first attempt to do this comparative study. In order to make this study, we have used the standard metrics and image collection in this field, as well as other important metrics in parallelism as speedup and efficiency. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

19.
20.
This paper describes our approach to adapting a text document similarity classifier based on the Term Frequency Inverse Document Frequency (TFIDF) metric to two massively multi-core hardware platforms. The TFIDF classifier is used to detect web attacks in HTTP data. In our parallel hardware approaches, we design streaming, real time classifiers by simplifying the sequential algorithm and manipulating the classifier’s model to allow decision information to be represented compactly. Parallel implementations on the Tilera 64-core System on Chip and the Xilinx Virtex 5-LX FPGA are presented. For the Tilera, we employ a reduced state machine to recognize dictionary terms without requiring explicit tokenization, and achieve throughput of 37 MB/s at a slightly reduced accuracy. For the FPGA, we have developed a set of software tools to help automate the process of converting training data to synthesizable hardware and to provide a means of trading off between accuracy and resource utilization. The Xilinx Virtex 5-LX implementation requires 0.2% of the memory used by the original algorithm. At 166 MB/s (80X the software) the hardware implementation is able to achieve Gigabit network throughput at the same accuracy as the original algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号