共查询到20条相似文献,搜索用时 15 毫秒
1.
Amelia Fong Lochovsky 《Annals of Mathematics and Artificial Intelligence》1991,4(1-2):177-209
Recently a number of machine vision systems have been successfully implemented using pipeline architectures and various new algorithms have been proposed. In this paper we propose a method of analysis of both time complexity and space complexity for algorithms using conventional general purpose pipeline architectures. We illustrate our method by applying it to an algorithm schema for local window operations satisfying a property we define as decomposability. It is shown that the proposed algorithm schema and its analysis generalize previous published results. We further analyse algorithms implementing operators that are not decomposable. In particular the complexities of several median-type operations are compared and the implication on algorithm choice is discussed. We conclude with discussions on space-time trade-offs and implementation issues.This research was partially supported by a grant from the Natural Science and Engineering Research Council of Canada. Part of this work was done while the author was at the University of Guelph, Guelph, Ontario, Canada. 相似文献
2.
A taxonomy is presented that extends M.J. Flynn's (IEEE Trans.Comput., vol. C-21, no.9, p.948-60, Sept. 1972), especially in the multiprocessor category. It is a two-level hierarchy in which the upper level classifies architectures based on the number of processors for data and for instructions and the interconnections between them. A lower level can be used to distinguish variants even more precisely; it is based on a state-machine view of processors. The author suggests why taxonomies are useful in studying architecture and shows how this applies to a number of modern architectures 相似文献
3.
A survey of QoS architectures 总被引:14,自引:0,他引:14
Over the past several years there has been a considerable amount of research within the field of quality-of-service (QoS)
support for distributed multimedia systems. To date, most of the work has been within the context of individual architectural
layers such as the distributed system platform, operating system, transport subsystem and network layers. Much less progress
has been made in addressing the issue of overall end-to-end support for multimedia communications. In recognition of this,
a number of research teams have proposed the development of QoS architectures which incorporate QoS-configurable interfaces
and QoS driven control and management mechanisms across all architectural layers. This paper examines the state-of-the-art
in the development of QoS architectures. The approach taken is to present QoS terminology and a generalized QoS framework
for understanding and discussing QoS in the context of distributed multimedia systems. Following this, we evaluate a number
of QoS architectures that have emerged in the literature. 相似文献
4.
5.
We present a software library for numerically estimating first and second order partial derivatives of a function by finite differencing. Various truncation schemes are offered resulting in corresponding formulas that are accurate to order O(h), O(h2), and O(h4), h being the differencing step. The derivatives are calculated via forward, backward and central differences. Care has been taken that only feasible points are used in the case where bound constraints are imposed on the variables. The Hessian may be approximated either from function or from gradient values. There are three versions of the software: a sequential version, an OpenMP version for shared memory architectures and an MPI version for distributed systems (clusters). The parallel versions exploit the multiprocessing capability offered by computer clusters, as well as modern multi-core systems and due to the independent character of the derivative computation, the speedup scales almost linearly with the number of available processors/cores.
Program summary
Program title: NDL (Numerical Differentiation Library)Catalogue identifier: AEDG_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 73 030No. of bytes in distributed program, including test data, etc.: 630 876Distribution format: tar.gzProgramming language: ANSI FORTRAN-77, ANSI C, MPI, OPENMPComputer: Distributed systems (clusters), shared memory systemsOperating system: Linux, SolarisHas the code been vectorised or parallelized?: YesRAM: The library uses O(N) internal storage, N being the dimension of the problemClassification: 4.9, 4.14, 6.5Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, etc. The parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems.Solution method: Finite differencing is used with carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries.Restrictions: The library uses only double precision arithmetic.Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed.Running time: Running time depends on the function's complexity. The test run took 15 ms for the serial distribution, 0.6 s for the OpenMP and 4.2 s for the MPI parallel distribution on 2 processors. 相似文献6.
A novel hierarchical architectural taxonomic system that appears to possess the desirable characteristics of a good taxonomic scheme is presented. The author focuses on the endoarchitecture, i.e. the logical structure, control and behavior of the integrated system of hardware components. The starting point for this system is D.B. Skillicorn's scheme (see ibid., vol.21, no.11, p.46-57, 1988). However, the system both extends and departs from Skillicorn's scheme, using formulas inspired by chemical notation to classify computer architectures in a way that provides both predictive power and explanatory capabilities 相似文献
7.
Particle filters are able to represent multi-modal beliefs but require a large number of particles in order to do so. The particle filter consists of three sequential steps: the sampling, the importance factor, and the resampling step. Each step processes every particle in oder to acquire the final state estimation. A high number of particles leads to a high processing time, thus reducing the particle filters usefulness for real-time embedded systems. Through parallelization, the processing time can be significantly reduced. However, the resampling step is not easily parallelizable since it requires the importance factor of each particle. In this work, a resampling scheme is proposed which uses virtual particles to solve the parallelization problem of the resampling component. Besides evaluating its performance against the multinomial resampling scheme, it is also implemented on a Xilinx Zynq-7000 FPGA. 相似文献
8.
With the popularity of parallel database machines based on the shared-nothing architecture, it has become important to find external sorting algorithms which lead to a load-balanced computation, i.e., balanced execution, communication and output. If during the course of the sorting algorithm each processor is equally loaded, parallelism is fully exploited. Similarly, balanced communication will not congest the network traffic. Since sorting can be used to support a number of other relational operations (joins, duplicate elimination, building indexes etc.) data skew produced by sorting can further lead to execution skew at later stages of these operations. In this paper we present a load-balanced parallel sorting algorithm for shared-nothing architectures. It is a multiple-input multiple-output algorithm with four stages, based on a generalization of Batcher's odd-even merge. At each stage then keys are evenly distributed among thep processors (i.e., there is no final sequential merge phase) and the distribution of keys between stages ensures against network congestion. There is no assumption made on the key distribution and the algorithm performs equally well in the presence of duplicate keys. Hence our approach always guarantees its performance, as long asn is greater thanp
3, which is the case of interest for sorting large relations. In addition, processors can be added incrementally.
Recommended by: Patrick Valduriez 相似文献
9.
William J. Dally 《New Generation Computing》1993,11(3-4):227-249
Advances in interconnection network performance and interprocessor interaction mechanisms enable the construction of fine-grain parallel computers in which the nodes are physically small and have a small amount of memory. This class of machines has a much higher ratio of processor to memory area and hence provides greater processor throughput and memory bandwidth per unit cost relative to conventional memory-dominated machines. This paper describes the technology and architecture trends motivating fine-grain architecture and the enabling technologies of high-performance interconnection networks and low-overhead interaction mechanisms. We conclude with a discussion of our experiences with the J-Machine, a prototype fine-grain concurrent computer. 相似文献
10.
11.
There are two principal methods used to exploit the parallelism available on a parallel machine: the program to be executed can be optimized by hand, or the program can be automatically converted to parallel machine code by a compiler. The first method usually derives parallelism at the procedure level; a parallel program is written in a high-level language and typically has various modules executing in parallel. By contrast, the compiler methodically transforms the program into parallel code using various transformations, such as code movement. The automatic conversion of a program to parallel code is called compaction or parallelization. This paper describes the evolution of a new compaction program and presents a new algorithm for determining legal code movements. A simulator of the target architecture was used to estimate the execution times of a sample suite of programs before and after compaction. The results verify that substantial advantages arise from applying this compaction technique. 相似文献
12.
Parallel architectures involve communication with the aim of fast receiving of complete information at each node. Several architectures have been proposed to overcome the problem of high communicational and computational time complexity for transferring and receiving information. To reduce the complexity of such communication, we have implemented Linear Network Coding (LNC) in the parallel environment. For verification of our approach, we have considered some parallel architecture for implementing network coding approach and examined our results on these networks in a generic environment. We have formulated a standard approach for parallel networks, showing that by applying this approach effect of faulty nodes, information size and communication complexity exponentially decreases with code length. 相似文献
13.
In this paper, a design methodology for synthesizing efficient parallel algorithms and VLSI architectures is presented. A design process starts with a problem definition specified in the parallel programming language Crystal and is followed by a series of program transformations in Crystal, each aiming at optimizing the target design for a specific purpose. To illustrate the design methodology, a set of design methods for deriving systolic algorithms and architectures is given and the use of these methods in the design of a dynamic programming solver is described. The design methodology, together with this particular set of design methods, can be viewed as a general theory of systolic designs (or multidimensional pipelines). The fact that Crystal is a general purpose language for parallel programming allows new design methods and synthesis techniques, properties and theorems about problems in specific application domains, and new insights into any given problem to be integrated readily within the existing design framework. 相似文献
14.
We provide optimal within a constant explicit upper bounds on the makespan of schedules for tree-structured programs on mesh arrays of processors, and provide polynomial-time algorithms to find schedules with makespan matching these bounds. In particular, we show how to find, in polynomial time, a (nonpreemptive) schedule for a binary tree dag withn unit execution time tasks and heighth on ad-dimensional mesh array withm processors and links of unit bandwidth and unit propagation delay whose makespan isO(n/m+n
1/(d+1)+h), i.e., optimal within a constant factor. Further, we extend these schedules to bounded degree forest dags with arbitrary positive integer execution time tasks and to meshes when the propagation delay of all the links is an arbitrary positive integer. Thus, we provide a polynomial-time approximation algorithm for an NP-hard problem, with a performance ratio that is a constant.We also show how to schedule tree dags on any parallel architecture that satisfies certain natural, not very restrictive, conditions that are satisfied by most parallel architectures used in practice. Let be a fixed positive real number. We provide polynomial time computable schedules for binary tree dags withn unit execution time tasks and heighth (g(n)n
–,g(n) logn) on any parallel architecture satisfying those conditions, with unit bandwidth and unit propagation delay links, with optimal up to a constant makespanO(g(n)+ft), whereg is a function that depends only on that architecture. The number of processors used is optimal within a constant factor ifh g(n)n
–, and is optimal within anO(logn) factor ifhg(n)logn. As an example, for hypercube and complete binary tree architectures, we achieve optimal within a constant makespanO(h) whenh=(log2
n), using an optimal within anO(logn) factor number of processors. Further, we extend these schedules to the case of bounded-degree forest dags with tasks of arbitrary positive integer execution times and architectures when the propagation delay of all the links is a given arbitrary positive integer.The second author was supported in part by the National Science Foundation under Grant CCR-9106062, and in part by the University of Maryland at College Park, Institute for Advanced Computer Studies. 相似文献
15.
Flag transformation, a new design concept for parallel associative memory and processor architectures, maps word-oriented data into flag-oriented data. A flag vector represents each word in a set. The flag position corresponds to the value of the transformed word, and all flags in a vector are processed simultaneously to obtain parallel operations. The results of complex search operations performed by modular, cascadable hardware components are also represented by flags and retransformed into word-oriented data. This transformation method allows parallel processing of associative or content-addressable data in uniprocessor architectures, expedites IC design rule checks, and accelerates complex memory tests. It can also be used to develop associative processor architectures and to emulate very fast, modular, cascadable artificial neural networks 相似文献
16.
17.
18.
We have designed Particle-in-Cell algorithms for emerging architectures. These algorithms share a common approach, using fine-grained tiles, but different implementations depending on the architecture. On the GPU, there were two different implementations, one with atomic operations and one with no data collisions, using CUDA C and Fortran. Speedups up to about 50 compared to a single core of the Intel i7 processor have been achieved. There was also an implementation for traditional multi-core processors using OpenMP which achieved high parallel efficiency. We believe that this approach should work for other emerging designs such as Intel Phi coprocessor from the Intel MIC architecture. 相似文献
19.
A software environment tailored to computer vision and image processing (CVIP) that focuses on how information about the CVIP problem domain can make the high-performance algorithms and the sophisticated algorithm techniques being designed by algorithm experts more readily available to CVIP researchers is presented. The environment consists of three principle components: DISC, Cloner, and Graph Matcher. DISC (dynamic intelligent scheduling and control) supports experimentation at the CVIP task level by creating a dynamic schedule from a user's specification of the algorithms that constitute a complex task. Cloner is aimed at the algorithm development process and is an interactive system that helps a user design new parallel algorithms by building on and modifying existing library algorithms. Graph Matcher performs the critical step of mapping new algorithms onto the target parallel architecture. Initial implementations of DISC and Graph Matcher have been completed, and work on Cloner is in progress 相似文献
20.
Parallelization is necessary to cope with the high computational and communication demands of neuroapplications, but general purpose parallel machines soon reach performance limitations. The article explores two approaches: parallel simulation on general purpose computers, and simulation/emulation on neurohardware. Different parallelization methods are discussed, and the most popular techniques are explained. While the software approach looks for an optimal programming model for neural processing, the hardware approach tries to imitate the neuroparadigm using the best of silicon technology 相似文献