首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper describes an object recognition algorithm both on a sequential machine and on a single instruction multiple data (SIMD) parallel processor such as the MIT connection machine. The problem, in the way it is presently formulated on a sequential machine, is essentially a propagation of constraints through a tree of possibilities in an attempt to prune the tree to a small number of leaves. The tree can become excessively large, however, and so implementations on massively parallel machines are sought in order to speed up the problem. Two fast parallel algorithms are described here, a static algorithm and a dynamic algorithm. The static algorithm reformulates the problem by assigning every leaf in the completely expanded unpruned tree to a separate processor in the connection machine. Then pruning is done in nearly constant time by broadcasting constraints to the entire SIMD array. This parallel version is shown to run three to four orders of magnitude faster than the sequential version. For large recognition problems which would exceed the capacity of the machine, a dynamic algorithm is described which performs a series of loading and pruning steps, dynamically allocating and deallocating processors through the use of the connection machine's global router communications mechanism.  相似文献   

2.
We examine two schemes for parametric parallel simulation on SIMD supercomputers. In SIMD machines, the parallel processors execute a common instruction stream using local data-under the control of a front-end processor. In contrast to most parallel simulation approaches-which simulate a single system using multiple processors-we simulate distinct parametric variants at each processor. We extract some of the common computation embedded in these simulations and perform it on the front-end, leaving the rest to the parallel processors.The first simulation approach, which we call time synchronous, is essentially Vakili's standard clock. This approach generates a uniformized event process on the front-end processor which is thinned at each back-end processor based on local state information. The second scheme, which we call event synchronous, generates a standard Poisson process on the front-end, which is time-scaled and marked on the back-end processors.We develop a framawork for comparing these methods based on their simulated event rate (number of simulated events per real time unit). We show that the time synchronous method can be tuned to optimize the event rate for a given family of systems and we solve this optimal standard clock problem for several test cases. Finally we describe implementation issues peculiar to the SIMD architecture. Our focus is primarily on the M/M/1/K queue, but the methods extend to more general Jackson networks.  相似文献   

3.
To achieve maximum efficiency, modern embedded processors for media applications exploit single instruction multiple data (SIMD) instructions. SIMD instructions provide a form of vectorization where a large machine word is viewed as a vector of subwords and the same operation is performed on all subwords in parallel. Systematic usage of SIMD instructions can significantly improve program performance. With C becoming the dominant language for programming embedded devices, there is a clear need for C compilers that use SIMD instructions whenever appropriate. However, SIMD instructions typically require each memory access to be aligned with the instruction's data access size. Therefore an important problem in designing the compiler is to determine whether a C pointer is aligned, i.e. whether it refers to the beginning of a machine word. In this paper, we describe our SIMD generation algorithm and present an analysis method which determines the alignment of pointers at compile time. The alignment information is used to reduce the number of dynamic alignment checks and the overhead incurred by them. Our method uses an interprocedural analysis which propagates pointer alignment information in function bodies and through function calls. The effectiveness of our method is supported by experimental results which show that in typical programs the alignments of about 50% of the pointers can be statically determined. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

4.
邻接矩阵算法在科学计算与信息处理方面有着极为重要的应用,是图论的基础研究之一。针对目前邻接矩阵算法多是基于串行,或并行SIMD模型而无法解决存储冲突的问题,提出一种基于SIMD-EREW共享存储模型的并行邻接矩阵算法。算法使用O(p)个并行处理单元,在O(n2/p)的时间内完成对n个数据点邻接矩阵的计算。将提出算法与现有算法进行的性能对比分析表明:本算法明显改进了现有文献的研究结果,是一种并行无存储冲突的邻接矩阵算法。  相似文献   

5.
Control architectures based on emotions are becoming promising solutions for the implementation of future robotic systems. The basic controllers of this architecture are the emotional processes that decide which behaviors the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, limiting the processing capacity of a main processor to solve the complex problems. Fortunately, the potential parallelism of emotional processes permits their execution in parallel, hence enabling the computing power to tackle the complex dynamic problems. In this paper, Graphic Processing Unit (GPU), multicore processors and single instruction multiple data (SIMD) instructions are used to provide parallelism for the emotional processes. Different GPUs, multicore processors and SIMD instruction sets are evaluated and compared to analyze their suitability to cope with robotic applications. The applications are set-up taking into account different environmental conditions, robot dynamics and emotional states. Experimental results show that, despite the fact that GPUs have a bottleneck in the data transmission between the host and the device, the evaluated GTX 670 GPU provides a performance of more than one order of magnitude greater than the initial implementation of the architecture on a single core. Thus, all complex proposed application problems can be solved using the GPU technology in contrast to the first prototype where only 55% of them could be solved. Using AVX SIMD instructions, the performance of the architecture is increased in 3.25 times in relation to the first implementation. Thus, from the 27 proposed applications about 88.8% are solved. In the case of the SSE SIMD instructions, the performance is almost doubled and the robot could solve about 74% of the proposed application problems. The use of AVX and SSE SIMD instructions provides almost the same performance as a quad- and a dual-core, respectively, with the advantage that they do not add any additional hardware cost.  相似文献   

6.
Data-parallel implementations of the computationally intensive task of solving multiple quadratic forms (MQFs) have been examined. Coupled and uncoupled parallel methods are investigated, where coupling relates to the degree of interaction among the processors. Also, the impact of partitioning a large MQF problem into smaller non-interacting subtasks is studied. Trade-offs among the implementations for various data-size/machine-size ratios are categorized in terms of complex arithmetic operation counts, communication overhead, and memory storage requirements. Furthermore, the impact on performance of the mode of parallelism used is considered, specifically, SIMD versus MIMD versus SIMD/MIMD mixed-mode. From the complexity analyses, it is shown that none of the algorithms presented in this paper is best for all data-size/machine-size ratios. Thus, to achieve scalability (i.e., good performance as the number of processors available in a machine increases), instead of using a single algorithm, the approach discussed is to have a set of algorithms from which the most appropriate algorithm or combination of algorithms is selected based on the ratio calculated from the scaled machine size. The analytical results have been verified by experiments on the MasPar MP-1 (SIMD), nCUBE 2 (MIMD), and PASM (mixed-mode) prototype.  相似文献   

7.
This paper presents count-sort, a parallel algorithm for mesh-connected computers to sort integers where the range of inputs is known. A straightforward counting technique that has not been implemented previously in parallel sorting algorithms is presented. On a mesh-connected computer with N×N processors we are able to sort N integers in the range 1 sN in timewhere cN is very small. For practical values of N, the algorithm is extremely fast. Further, it is possible to expand the range by a factor k to 1 N so that the slowdown is less than k.

We produce two implementations of count-sort on the SIMD MasPar MP-I with 8192 processors. The first sorts 8-bit integers, one per processor, significantly faster than the manufacturer's current library routine for sorting 8-bit integers. The second implementation is a fast version that sorts several elements per processor.  相似文献   

8.
邻接矩阵算法在科学计算与信息处理方面有着极为重要的应用,是图论的基础研究之一。针对目前邻接矩阵算法多是基于串行,或并行SIMD模型而无法解决存储冲突的问题,提出一种基于SIMD—EREW共享存储模型的并行邻接矩阵算法,算法使用O(p)个并行处理单元,在O(n^2/p)的时间内完成对n个数据点邻接矩阵的计算。将提出算法与现有算法进行的性能对比分析表明:本算法明显改进了现有文献的研究结果,是一种并行无存储冲突的邻接矩阵算法。  相似文献   

9.
An important issue for the efficient use of multiprocessor systems is the assignment of parallel processors to nested parallel loops. It is desirable for a processor assignment algorithm to be fast and always generate an optimal processor assignment. The paper proposes two efficient algorithms to decide the optimal number of processors assigned to each individual loop. Efficient parallel counterparts of these two algorithms are also presented. These algorithms not only always generate an optimal processor assignment, but also are much faster than the exiting optimal algorithm in the literature. The paper discusses improving the performance of parallel execution by transforming a nested parallel loop into a semantically equivalent one. Three loop transformations are investigated. It is observed that, in most cases, the parallel execution time is improved after applying these transformations  相似文献   

10.
Considers the use of massively parallel architectures to execute a trace-driven simulation of a single cache set. A method is presented for the least-recently-used (LRU) policy, which, regardless of the set size C, runs in time O(log N) using N processors on the EREW (exclusive read, exclusive write) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. We present timings of this algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference-based line replacement policies are considered, which includes LRU as well as the least-frequently-used (LFU) and random replacement policies. A simulation method is presented for any such policy that, on any trace of length N directed to a C line set, runs in O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation  相似文献   

11.
Li  Kun  Li  Shigang  Huang  Shan  Chen  Yifeng  Zhang  Yunquan 《The Journal of supercomputing》2020,76(7):5501-5520

In the molecular dynamics simulation, an important step is the establishment of neighbor list for each particle, which involves the distance calculation for each particle pair in the simulation space. However, the distance calculation will cause costly floating-point operations. In this paper, we propose a novel algorithm, called Fast Neighbor List, which establishes the neighbor lists mainly using the bitwise operations. Firstly, we design a data layout, which uses an integer value to represent the three-dimensional coordinates of a particle. Then, a bunch of bitwise operations and two subtraction operations are used to judge whether the distance between a pair of particles is within the cutoff radius. We demonstrate that our algorithm can deal with the periodic boundary seamlessly. We also use single instruction multiple data (SIMD) instructions to further improve the performance. We implement our algorithm on Intel Xeon E5-2670, ARM v8, and Sunway many-core processors, respectively. Compared with the traditional method, our algorithm achieves on average 1.79x speedup on Intel Xeon E5-2670 processor, 3.43x speedup on ARM v8 processor, and 4.03x speedup on Sunway many-core processor. After using SIMD instructions, our algorithm achieves on average 2.64x speedup and 14.43x speedup on Intel Xeon E5-2670 and ARM v8 processors, respectively.

  相似文献   

12.
Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

13.
The distance calculation in an image is a basic operation in computer vision, pattern recognition, and robotics. Several parallel algorithms have been proposed for calculating the Euclidean distance transform (EDT). Recently, Chen and Chuang proposed a parallel algorithm for computing the EDT on mesh-connected SIMD computers (1995). For an nxn image, their algorithm runs in O(n) time on a two-dimensional (2-D) nxn mesh-connected processor array. In this paper, we propose a more efficient parallel algorithm for computing the EDT on a reconfigurable mesh model. For the same problem, our algorithm runs in O(log(2)n) time on a 2-D nxn reconfigurable mesh. Since a reconfigurable mesh uses the same amount of VLSI area as a plain mesh of the same size does when implemented in VLSI, our algorithm improves the result in [3] significantly.  相似文献   

14.
The performance of conjugate gradient (CG) algorithms for the solution of the system of linear equations that results from the finite-differencing of the neutron diffusion equation was analyzed on SIMD, MIMD, and mixed-mode parallel machines. A block preconditioner based on the incomplete Cholesky factorization was used to accelerate the conjugate gradient search. The issues involved in mapping both the unpreconditioned and preconditioned conjugate gradient algorithms onto the mixed-mode PASM prototype, the SIMD MasPar MP-1, and the MIMD Intel Paragon XP/S are discussed. On PASM , the mixed-mode implementation outperformed either SIMD or MIMD alone. Theoretical performance predictions were analyzed and compared with the experimental results on the MasPar MP-1 and the Paragon XP/S. Other issues addressed include the impact on execution time of the number of processors used, the effect of the interprocessor communication network on performance, and the relationship of the number of processors to the quality of the preconditioning. Applications studies such as this are necessary in the development of software tools for mapping algorithms onto either a single parallel machine or a heterogeneous suite of parallel machines.  相似文献   

15.
SIMD (single instruction stream - multiple data stream) algorithms for one- and two-dimensional discrete Fourier transforms are presented. Parallel structurings of algorithms for efficient computation for a variety of machine size/problem size combinations are presented and analyzed. Through these algorithms, techniques for exploiting relationships between problem size and machine size are demonstrated. The algorithms are evaluated in terms of the number of arithmetic operations and interprocessor data transfers required. The ability of various interconnection networks presented in the literature to perform the needed transfers is examined. It is shown that the efficiency of a particular data distribution/algorithm decomposition approach is a function of the machine-size/problem-size relationship.  相似文献   

16.
非定常Monte Carlo输运问题的并行算法   总被引:1,自引:0,他引:1  
文中给出了非定常MonteCarlo(下文简写为MC)输运问题的并行算法 ,对并行程序的加载运行模式进行了讨论和优化设计 .针对MC并行计算设计了一种理想情况下无通信的并行随机数发生器算法 .动态MC输运问题有大量的I/O操作 ,特别是读取剩余粒子数据文件需要大量的I/O时间 ,文中针对I/O问题 ,提出了三种并行I/O算法 .最后给出了并行算法的性能测试结果 ,对比串行计算时间 ,使用 6 4台处理机时的并行计算时间缩短了 30倍  相似文献   

17.
In order to program SIMD (single instruction stream-multiple data stream) parallel machines used for tasks such as speech and image processing, a language with explicit parallel constructs is often desirable. The language Ada, developed by the Department of Defense, is used here as a basis for such a language. Extensions of Ada, which allow the user to specify such operations as interprocessor communications and activation of processors, are proposed. These features are demonstrated by showing their use in a common speech processing algorithm, the parallel FFT.  相似文献   

18.
IP core implementation of a self-organizing neural network   总被引:1,自引:0,他引:1  
This paper reports on the design issues and subsequent performance of a soft intellectual property (IP) core implementation of a self-organizing neural network. The design is a development of a previous 0.65-/spl mu/m single silicon chip providing an array of 256 neurons, where each neuron stores a 16 element reference vector. Migrating the design to a soft IP core presents challenges in achieving the required performance as regards area, power, and clock speed. This same migration, however, offers opportunities for parameterizing the design in a manner which permits a single soft core to meet the requirements of many end users. Thus, the number of neurons within the single instruction multiple data (SIMD) array, the number of elements per reference vector, and the number of bits of each such element are defined by synthesis time parameters. The construction of the SIMD array of neurons is presented including performance results as regards power, area, and classifications per second . For typical parameters (256 neurons with 16 elements per reference vector) the design provides over 2 000 000 classifications per second using a mainstream 0.18-/spl mu/m digital process. A RISC processor, the array controller (AC), provides both the instruction stream and data to the SIMD array of neurons and an interface to a host processor. The design of this processor is discussed with emphasis on the control aspects which permit supply of a continuous instruction stream to the SIMD array and a flexible interface with the host processor.  相似文献   

19.
We present four polylog-time parallel algorithms for matching parentheses on an exclusive-read and exclusive-write (EREW) parallel random-access machine (PRAM) model. These algorithms provide new insights into the parentheses-matching problem. The first algorithm has a time complexity of O(log2 n) employing O(n/(log n)) processors for an input string containing n parentheses. Although this algorithm is not cost-optimal, it is extremely simple to implement. The remaining three algorithms, which are based on a different approach, achieve O(log n) time complexity in each case, and represent successive improvements. The second algorithm requires O(n) processors and working space, and it is comparable to the first algorithm in its ease of implementation. The third algorithm uses O(n/(log n)) processors and O(n log n) space. Thus, it is cost-optimal, but uses extra space compared to the standard stack-based sequential algorithm. The last algorithm reduces the space complexity to O(n) while maintaining the same processor and time complexities. Compared to other existing time-optimal algorithms for the parentheses-matching problem that either employ extensive pipelining or use linked lists and comparable data structures, and employ sorting or a linked list ranking algorithm as subroutines, the last two algorithms have two distinct advantages. First, these algorithms employ arrays as their basic data structures, and second, they do not use any pipelining, sorting, or linked list ranking algorithms  相似文献   

20.
A new parallel algorithm for transforming an arithmetic infix expression into a par se tree is presented. The technique is based on a result due to Fischer (1980) which enables the construction of the parse tree, by appropriately scanning the vector of precedence values associated with the elements of the expression. The algorithm presented here is suitable for execution on a shared memory model of an SIMD machine with no read/write conflicts permitted. It uses O(n) processors and has a time complexity of O(log2n) where n is the expression length. Parallel algorithms for generating code for an SIMD machine are also presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号