首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Region growing is a general technique for image segmentation, where image characteristics are used to group adjacent pixels together to form regions. This paper presents a parallel algorithm for solving the region growing problem based on the split-and-merge approach, and uses it to test and compare various parallel architectures and programming models. The implementations were done on the Connection Machine, models CM-2 and CM-5, in the data parallel and message passing programming models. Randomization was introduced in breaking ties during merging to increase the degree of parallelism, and only one- and two-dimensional arrays of data were used in the implementations.  相似文献   

2.
A review of five distinct artificial neural network implementations on the Connection Machine is presented along with a brief discussion of the more general issues surrounding the implementation of artificial neural network models in parallel. The implementation which proves to be fastest on the Connection Machine is parallel in the training patterns and runs at more than 1300 million interconnects per second.  相似文献   

3.
In this paper we present two algorithms for the parallel solution of first-order linear recurrences, We show that the algorithms can be used to efficiently solve both scalar and blocked versions of the problem on vector and SIMD architectures. The first algorithm is a parallel approach whose resulting code can be explicitly vectorized, making it suitable for efficient execution on vector architectures such as the Cray 2. The second algorithm is a modified recursive approach designed to reduce the communication overhead encountered in SIMD architectures such as the Connection Machine 2 (CM-2). We present the performance exhibited by the parallel algorithm implementations on the Cray 2 and CM-2 for both scalar and blocked versions of the recurrence problem.  相似文献   

4.
This paper develops a data-parallel implementation of the L-shaped decomposition algorithm for stochastic linear programs with recourse. The algorithm decomposes the problem into independent scenario subproblems that are solved concurrently. These subproblems are structurally identical and can be solved on SIMD machines, such as the Connection Machine CM-2. The coordinating master program is a dense linear program and is solved efficiently by spreading its non-zero coefficients among multiple processors and using dense linear algebra subroutines. The parallel solution of the master program removes the serial bottleneck of the algorithm. The resulting implementation achieves good speed-ups and is scalable. Numerical results on the Connection Machine CM-2 and comparisons with a benchmark control-parallel implementation are included.  相似文献   

5.
We discuss the transformation of image data from one level of representation to another using the data parallel programming model of the Connection Machine System.1 Emphasis is placed on maintaining locality of reference in order to take advantage of fast, local communications. Image pyramids illustrate the transformation of image-based representations. We review pointer jumping as a transformation from the image to a sequence-based representation, the primary representation of data outside of the image plane. Using communication primitives, especially segmented scans, we review utilities for representing and manipulating such sequences. We then compare several algorithms for matching and evidence accumulation. The techniques emphasize the use of sorting and sparse representations of space, in order to limit the computational requirements of high level vision. Connection Machine is a registered trademark of Thinking Machines Corporation. Address reprint requests to: Library, Thinking Machines Corporation, 245 First St., Cambridge, MA 02142, USA  相似文献   

6.
Finding the area and perimeter of the union/intersection of a set of iso-rectangles is a very important part of circuit extraction in VLSI design. We combine two techniques, the uniform grid and the vertex neighborhoods, to develop a new parallel algorithm for the area and perimeter problems which has an average linear time performance but is not worst-case optimal. The uniform grid technique has been used to generate the candidate vertices of the union or intersection of the rectangles. An efficient point-in-rectangles inclusion test filters the candidate set to obtain the relevant vertices of the union or intersection. Finally, the vertex neighborhood technique is used to compute the mass properties from these vertices. This algorithm has an average time complexity of O(((n + k)/p) + log p) where n is the number of input rectangle edges with k intersections on p processors assuming a PRAM model of computation. The analysis of the algorithm on a SIMD architecture is also presented. This algorithm requires very simple data structures which makes the implementation easy. We have implemented the algorithm on a Sun 4/280 workstation and a Connection Machine. The sequential implementation performs better than the optimal algorithm for large datasets. The parallel implementation on a Connection Machine CM-2 with 32K processors also shows good results.  相似文献   

7.
A new parallel sorting algorithm, called parsort, suitable for implementation on tightly coupled multiprocessors is presented. The algorithm is based upon quicksort and two-way merging. An asynchronous parallel partitioning algorithm is used to distribute work evenly during merging to ensure a good load balance amongst processors, which is crucial if we are to achieve high efficiency. The implementation of this parallel sorting algorithm exhibits theoretical and measured near linear speed-up when compared to sequential quicksort. This is illustrated by the results of experiments carried out on the Sequent Balance 8000 multiprocessor.  相似文献   

8.
Load balancing requirements in parallel image analysis are considered and results on the performance of parallel implementations of two image feature extraction tasks on the Connection Machine and the iPSC/2 hypercube are reported and discussed. A load redistribution algorithm, which makes use of parallel prefix operations and one-to-one permutations among the processors, is described and has been used. The expected improvement in performance resulting from load balancing has been determined analytically and is compared to actual performance results obtained from the above implementations. The analytical results demonstrate the specific dependence of the expected improvement in performance on the computational and communication requirements of each task, characteristic machine parameters, a characterization of prior load distribution in terms of parameters which can be computed dynamically at the start of task execution, and the overhead incurred by load redistribution  相似文献   

9.
In this paper, we first describe a model for mapping the backpropagation artificial neural net learning algorithm onto a massively parallel computer architecture with a 2D-grid communications network. We then show how this model can be sped up by hypercube inter-processor connections that provide logarithmic time segmented parallel prefix operations. This approach can serve as a general model for implementing algorithms for layered neural nets on any massively parallel computers that have 2D-grid or hypercube communication networks.

We have implemented this model on the Connection Machine CM-2 — a general purpose, massively parallel computer with a hypercube topology. Initial tests show that this implementation offers about 180 million interconnections per second (IPS) for feed-forward computation and 40 million weight updates per second (WUPS) for learning. We use our model to evaluate this implementation: what machine-specific features have helped improve the performance and where further improvements can be made.  相似文献   


10.
One data-independent and one data-dependent algorithm for the computation of image histograms on parallel computers are presented, analysed and implemented on the Connection Machine system CM-2. The data-dependent algorithm has a lower requirement on communication bandwidth by only transferring bins with a non-zero count. Both algorithms perform all-to-all reduction, which is implemented through a sequence of exchanges as defined by a butterfly network. The two algorithms are compared based on predicted and actual performance on the Connection Machine CM-2. With few pixels per processor the data-dependent algorithm requires in the order of √B data transfers for B bins compared to B data transfers for the data-independent algorithm. As the number of pixels per processor grows the advantage of the data-dependent algorithm decreases. The advantage of the data-dependent algorithm increases with the number of bins of the histogram.  相似文献   

11.
《Parallel Computing》1997,23(8):1069-1088
We develop a scalable parallel implementation of the classical Benders decomposition algorithm for two-stage stochastic linear programs. Using a primal-dual, path-following algorithm for solving the scenario subproblems we develop a parallel implementation that alleviates the difficulties of load balancing. Furthermore, the dual and primal step calculations can be implemented using a data-parallel programming paradigm. With this approach the code effectively utilizes both the multiple, independent processors and the vector units of the target architecture, the Connection Machine CM-5. The, usually limiting, master program is solved very efficiently using the interior point code LoQo on the front-end workstation. The implementation scales almost perfectly with problem and machine size. Extensive computational testing is reported with several large problems with up to 2 million constraints and 13.8 million variables.  相似文献   

12.
In this paper, we describe a massively parallel implementation of the Splitting Equilibration Algorithm using CM FORTRAN on the Thinking Machines CM-2 system. Numerical results using upwards of 32 768 (32 K) processors on the CM-2 system, the Connection Machine, are presented for both input/output and social accounting matrix estimation problems and compared with those obtained for the same problems on the IBM 3090. Our experiences with the relative ease/difficulty of the implementations on these fine-grain and coarse-grain parallel architectures are also presented and discussed.  相似文献   

13.
Efficient sorting is a key requirement for many computer science algorithms. Acceleration of existing techniques as well as developing new sorting approaches is crucial for many real‐time graphics scenarios, database systems, and numerical simulations to name just a few. It is one of the most fundamental operations to organize and filter the ever growing massive amounts of data gathered on a daily basis. While optimal sorting models for serial execution on a single processor exist, efficient parallel sorting remains a challenge. In this paper, we present a hardware‐optimized parallel implementation of the radix sort algorithm that results in a significant speed up over existing sorting implementations. We outperform all known General Processing Unit (GPU) based sorting systems by about a factor of two and eliminate restrictions on the sorting key space. This makes our algorithm not only the fastest, but also the first general GPU sorting solution.  相似文献   

14.
External sorting—the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks—is a fundamental subroutine in database systems[G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM ) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K,Section 5.4.9] recently identified SRM (which he calls ``randomized striping') as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is based on synchronous parallel I/O primitives provided by the TPIE programming environment[TPI]; whenever our program issues an I/O read (write) operation, one block of data is synchronously read from (written to) each disk in parallel. We compare the performance of SRM over a wide range of input sizes with that of disk-striped mergesort (DSM ), which is widely used in practice. DSM consists of a standard mergesort in conjunction with striped I/O for parallel disk access. SRM merges together significantly more runs at a time compared with DSM, and thus it requires fewer merge passes. We demonstrate in practical scenarios that even though the streaming speeds for merging with DSM are a little higher than those for SRM (since DSM merges fewer runs at a time), sorting using SRM is often significantly faster than with DSM (since SRM requires fewer passes). The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort and multiway partitioning of a file into several other files. Since both parallel disk merging and multimedia processing deal with streams that get ``consumed' at nonuniform and partially predictable rates, our techniques for lookahead based upon forecasting data may have relevance in video server applications. Received June 28, 2000, and in revised form June 5, 2001. Online publication April 8, 2002.  相似文献   

15.
Graphics processing units (GPUs) offer parallel computing power that usually requires a cluster of networked computers or a supercomputer to accomplish. While writing kernel code is fairly straightforward, achieving efficiency and performance requires very careful optimisation decisions and changes to the original serial algorithm. We introduce a parallel canonical ensemble Monte Carlo (MC) simulation that runs entirely on the GPU. In this paper, we describe two MC simulation codes of Lennard-Jones particles in the canonical ensemble, a single CPU core and a parallel GPU implementations. Using Compute Unified Device Architecture, the parallel implementation enables the simulation of systems containing over 200,000 particles in a reasonable amount of time, which allows researchers to obtain more accurate simulation results. A remapping algorithm is introduced to balance the load of the device resources and demonstrate by experimental results that the efficiency of this algorithm is bounded by available GPU resource. Our parallel implementation achieves an improvement of up to 15 times on a commodity GPU over our efficient single core implementation for a system consisting of 256k particles, with the speedup increasing with the problem size. Furthermore, we describe our methods and strategies for optimising our implementation in detail.  相似文献   

16.
Narayanan  P.J. Chen  L.T. Davis  L.S. 《Computer》1992,25(2):68-73
The authors consider two examples from image understanding-focus-of-attention vision and contour image analysis-and present new parallel-processing methods that effectively support these types of computations. The research is a blend of theory and practice. On the one hand, the aim is to develop algorithms whose properties are well understood and can be formally related to key aspects of machine models. On the other hand, it is desired that the algorithms be easy to implement and practical in terms of their actual processing times on existing parallel machines. The experimental research was conducted on a 16384-processor Connection Machine CM2, and results of algorithm implementations on that machine are presented  相似文献   

17.
In this contribution we report about a study of a very versatile neural network algorithm known as “Self-organizing Feature Maps” and based on earlier work of Kohonen [1,2]. In its original version, the algorithm addresses a fundamental issue of brain organization, namely how topographically ordered maps of sensory information can be formed by learning.

This algorithm is investigated for a large number of neurons (up to 16 K) and for an input space of dimension d900. To meet the computational demands this algorithm was implemented on two parallel machines, on a self-built Transputer systolic ring and on a Connection Machine CM-2.

We will present below

1. (i) a simulation based on the feature map algorithm modelling part of the synaptic organization in the “hand-region” of the somatosensory cortex,
2. (ii) a study of the influence of the dimension of the input-space on the learning process,
3. (iii) a simulation of the extended algorithm, which explicitly includes lateral interactions, and
4. (iv) a comparison of the transputer-based “coarse-grained” implementation of the model, and the “fine-grained” implementation of the same system on the Connection Machine.
  相似文献   

18.
《Parallel Computing》1988,9(1):1-24
The Connection Machine is a massively parallel architecture with 65 536 single-bit processors and 32 Mbytes of memory, organized as a high-dimensional hypercube. A sophisticated router system provides efficient communication between remote processors. A rich software environment, including a parallel extension of COMMON LISP, provides access to the processors and network. Virtual processor capability extends the degree of fine-grained parallelism beyond 1 000 000.We describe the hardware and the parallel programming environment. We then present implementations of SOR, Multigrid and Conjugate Gradient algorithms for solving Partial Differential Equations on the Connection Machine. Measurements of computational efficiency are provided as well as an analysis of opportunities for achieving better performance. Despite the lack of floating-point hardware, computation rates above 100 Mflops have been achieved in PDE solution. Virtual processors prove to be a real advantage, easing the effort of software development while improving system performance significantly.  相似文献   

19.
We develop and experiment with a new parallel algorithm to approximate the maximum weight cut in a weighted undirected graph. Our implementation starts with the recent (serial) algorithm of Goemans and Williamson for this problem. We consider several different versions of this algorithm, varying the interior-point part of the algorithm in order to optimize the parallel efficiency of our method. Our work aims for an efficient, practical formulation of the algorithm with close-to-optimal parallelization. We analyze our parallel algorithm in the LogP model and predict linear speedup for a wide range of the parameters. We have implemented the algorithm using the message passing interface (MPI) and run it on several parallel machines. In particular, we present performance measurements on the IBM SP2, the Connection Machine CM5, and a cluster of workstations. We observe that the measured speedups are predicted well by our analysis in the LogP model. Finally, we test our implementation on several large graphs (up to 13,000 vertices), particularly on large instances of the Ising model.  相似文献   

20.
Split and merge segmentation is a popular region-based segmentation scheme for its robustness and computational efficiency. But it is hard to realize for larger size images or video frames in real time due to its iterative sequential data flow pattern. A quad-tree data structure is quite popular for software implementation of the algorithm, where a local parallelism is difficult to establish due to inherent data dependency between processes. In this paper, we have proposed a parallel algorithm of splitting and merging which depends only on local operations. The algorithm is mapped onto a hierarchical cell network, which is a parallel version of Locally Excitory Globally Inhibitory Oscillatory Network (LEGION). Simulation results show that the proposed design is faster than any of the standard split and merge algorithmic implementations, without compromising segmentation quality. The timing performance enhancement is manifested in its Finite State Machine based VLSI implementation in VIRTEX series FPGA platforms. We have also shown that, though segmentation qualitywise split-and-merge algorithm is little bit behind the state-of-the-art algorithms, computational speedwise it over performs those sophisticated and complex algorithms. Good segmentation performance with minimal computational cost enables the proposed design to tackle real time segmentation problem in live video streams. In this paper, we have demonstrated live PAL video segmentation using VIRTEX 5 series FPGA. Moreover, we have extended our design to HD resolution for which the time taken is less than 5 ms rendering a processing throughput of 200 frames per second.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号