期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Real-time multi-camera video analytics system on GPU

Puren Guler Deniz Emeksiz Alptekin Temizel Mustafa Teke Tugba Taskaya Temizel 《Journal of Real-Time Image Processing》2016,11(3):457-472

In this article, parallel implementation of a real-time intelligent video surveillance system on Graphics Processing Unit (GPU) is described. The system is based on background subtraction and composed of motion detection, camera sabotage detection (moved camera, out-of-focus camera and covered camera detection), abandoned object detection, and object-tracking algorithms. As the algorithms have different characteristics, their GPU implementations have different speed-up rates. Test results show that when all the algorithms run concurrently, parallelization in GPU makes the system up to 21.88 times faster than the central processing unit counterpart, enabling real-time analysis of higher number of cameras. 相似文献

2.

High-Level Parallel Ant Colony Optimization with Algorithmic Skeletons

de Melo Menezes Breno A. Herrmann Nina Kuchen Herbert Buarque de Lima Neto Fernando 《International journal of parallel programming》2021,49(6):776-801

Parallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

相似文献

3.

Methods for fast morphological image transforms using bitmapped binary images

《CVGIP: Graphical Models and Image Processing》1992,54(3):252-258

In this paper we present new implementations for morphological binary image processing on a general-purpose computer, using a bitmap representation of binary images instead of representing binary images as bitplanes inserted in gray value images. The bitmap data representation is a very efficient one, both in terms of memory requirements and in terms of algorithmic efficiency because of the CPU operates on 32 pixels in parallel. The algorithms described in this paper are capable of performing the basic morphological image transforms using structuring elements of arbitrary size and shape. In order to speed up morphological operations with respect to commonly used, large, convex structuring elements, the logarithmic decomposition of structuring elements is used. Experiments indicate that the new algorithms are more than 30 times faster for pixelwise operations and about an order of magnitude faster for the basic morphological transforms than the fastest known software implementations. 相似文献

4.

The V compiler: automatic hardware design

Berstis V. 《Design & Test of Computers, IEEE》1989,6(2):8-17

相似文献

5.

Distributed shared abstractions (DSA) on multiprocessors

Clemencon C. Mukherjee B. Schwan K. 《IEEE transactions on pattern analysis and machine intelligence》1996,22(2):132-152

Any parallel program has abstractions that are shared by the program's multiple processes. Such shared abstractions can considerably affect the performance of parallel programs, on both distributed and shared memory multiprocessors. As a result, their implementation must be efficient, and such efficiency should be achieved without unduly compromising program portability and maintainability. The primary contribution of the DSA library is its representation of shared abstractions as objects that may be internally distributed across different nodes of a parallel machine. Such distributed shared abstractions (DSA) are encapsulated so that their implementations are easily changed while maintaining program portability across parallel architectures. The principal results presented are: a demonstration that the fragmentation of object state across different nodes of a multiprocessor machine can significantly improve program performance; and that such object fragmentation can be achieved without compromising portability by changing object interfaces. These results are demonstrated using implementations of the DSA library on several medium scale multiprocessors, including the BBN Butterfly, Kendall Square Research, and SGI shared memory multiprocessors. The DSA library's evaluation uses synthetic workloads and a parallel implementation of a branch and bound algorithm for solving the traveling salesperson problem (TSP) 相似文献

6.

数字脊波变换的实现与一种改进方法 总被引：5，自引：1，他引：4

贾建焦李成《计算机研究与发展》2006,43(1):115-119

脊波变换作为一种新的连续空间中函数的多尺度表示方法,其离散变换形式仍然有许多问题有待解决．目前大多将离散脊波变换形式看做Radon变换与小波变换的复合变换形式,进而对其分步进行处理．利用计算机图形学中的Bresenham算法思想,使得在实现Radon变换的过程中提高了变换的效率．与先前的最近邻方法相比,快速准确,并可完全重构．数值实验显示,与Zp^2方法实现的脊波变换相比较,利用此方法生成的图像重构、压缩、去噪效果都有显著提高,为进一步的研究工作奠定了基础．相似文献

7.

Introducing and Implementing the Allpairs Skeleton for Programming Multi-GPU Systems

Michel Steuwer Malte Friese Sebastian Albers Sergei Gorlatch 《International journal of parallel programming》2014,42(4):601-618

Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code. 相似文献

8.

Real time FPGA implementation of a high speed and area optimized Harris corner detection algorithm

《Microprocessors and Microsystems》2021

相似文献

9.

High performance FPGA-based image correlation

Almudena Lindoso Luis Entrena 《Journal of Real-Time Image Processing》2007,2(4):223-233

Image correlation is widely used for image and picture processing. Typical applications of image correlation are object location, image registration and sub-image similarity measurement. However, image correlation requires the comparison of a large number of sub-images implying a large computational effort that may prevent its use for real-time applications. On the other hand, correlation computation is very well suited for FPGA implementations. In this work we present efficient architectures for the implementation of Zero-Mean Normalized Cross-Correlation using FPGAs with application to image correlation. In particular, we compare the implementations of correlation in the spatial and spectral domains. Experimental results demonstrate that FPGAs improve performance by at least two orders of magnitude with respect to software implementations on a modern personal computer. This speed-up makes the performance of correlation computation suitable for real-time image processing. The proposed architectures have been applied to a correlation-based fingerprint-matching algorithm, demonstrating that real-time processing requirements can be well satisfied with an FPGA-based implementation. 相似文献

10.

On implementing motion-based Region of Interest detection on multi-core CELL

Avin Kumar Baoxin Li 《Computer Vision and Image Understanding》2010,114(11):1139-1151

Region of Interest (ROI) detection is a well-studied problem in computer vision for applications such as video surveillance and vision-based robotics. ROI detection may be done using background subtraction schemes with change detection and background estimation. When the camera is not static, these schemes will be ineffective and hence there is a need for global motion estimation (GME) to compensate the camera motion. Robust GME algorithms often require high computation power, rendering them unsuitable for real-time, embedded vision applications. In this article, we use a multi-core processor platform – CELL, to meet the computational requirements of the ROI detection system and to explore the feasibility of potential usage of such heterogeneous processor architecture for vision applications. In particular, we analyze the algorithmic components of a typical GME-based ROI detection system and show how to make efficient use of the parallel and vector computation capabilities in the CELL cores for maximizing the gain on speed performance. We have also ported our system on a Sony PS3 system and promising results have been achieved. Based on the study, various design aspects and implementation challenges are discussed which are believed to be useful for future work in porting vision algorithms on multi-core architectures for real-time embedded applications. 相似文献

11.

The Area Method

Predrag Janičić Julien Narboux Pedro Quaresma 《Journal of Automated Reasoning》2012,48(4):489-532

相似文献

12.

Programming with Divide-and-Conquer Skeletons: A Case Study of FFT

Gorlatch Sergei 《The Journal of supercomputing》1998,12(1-2):85-97

We demonstrate an approach to parallel programming, based on skeletons – parameterized program schemas with efficient implementations over diverse architectures. The contribution of the paper is two-fold: (1) we classify divide-and-conquer (DC) algorithms and provide a family of provably correct parallel implementations for a particular DC skeleton, called DH (distributable homomorphism); (2) we adjust the mathematical specification of the Fast Fourier Transform (FFT) to the DH skeleton and, thereby, obtain a generic SPMD program, well suited for implementation under MPI. The generic program includes the efficient FFT solutions used in practice – the binary-exchange and the 2D- and 3D-transpose implementations – as special cases. 相似文献

13.

MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy

Hisham Mohamed Stéphane Marchand-Maillet 《Parallel Computing》2013

MapReduce is a programming model proposed to simplify large-scale data processing. In contrast, the message passing interface (MPI) standard is extensively used for algorithmic parallelization, as it accommodates an efficient communication infrastructure. In the original implementation of MapReduce, the reduce function can only start processing following termination of the map function. If the map function is slow for any reason, this will affect the whole running time. In this paper, we propose MapReduce overlapping using MPI, which is an adapted structure of the MapReduce programming model for fast intensive data processing. Our implementation is based on running the map and the reduce functions concurrently in parallel by exchanging partial intermediate data between them in a pipeline fashion using MPI. At the same time, we maintain the usability and the simplicity of MapReduce. Experimental results based on three different applications (WordCount, Distributed Inverted Indexing and Distributed Approximate Similarity Search) show a good speedup compared to the earlier versions of MapReduce such as Hadoop and the available MPI-MapReduce implementations. 相似文献

14.

Competing for theAC-Unification Race

Alexandre Boudet 《Journal of Automated Reasoning》1993,11(2):185-212

We describe our implementation of the unification algorithm for terms involving some associative-commutative operators plus free function symbols described by Boudetet al. The first goal of this implementation is efficiency, more precisely competing for theAC Unification Race. Although our implementation has been designed for good performance when applied to non-elementaryAC-unification problems, it is also very efficient on elementary problems. Our implementation, written in C and running on Sun workstations, is to be compared with the implementations in LISP, on Symbolics LIPS machines. 相似文献

15.

Object and texture classification using higher order statistics 总被引：5，自引：0，他引：5

Tsatsanis M.K. Giannakis G.B. 《IEEE transactions on pattern analysis and machine intelligence》1992,14(7):733-750

The problem of the detection and classification of deterministic objects and random textures in a noisy scene is discussed. An energy detector is developed in the cumulant domain by exploiting the noise insensitivity of higher order statistics. An efficient implementation of this detector is described, using matched filtering. Its performance is analyzed using asymptotic distributions in a binary hypothesis-testing framework. The object and texture discriminant functions are minimum distance classifiers in the cumulant domain and can be efficiently implemented using a bank of matched filters. They are immune to additive Gaussian noise and insensitive to object shifts. Important extensions, which can handle object rotation and scaling, are also discussed. An alternative texture classifier is derived from a ML viewpoint and is statistically efficient at the expense of complexity. The application of these algorithms to the texture-modeling problem is indicated, and consistent parameter estimates are obtained 相似文献

16.

Fast PRISM: Branch and Bound Hough Transform for Object Class Detection

Alain Lehmann Bastian Leibe Luc Van Gool 《International Journal of Computer Vision》2011,94(2):175-197

相似文献

17.

QoS customization in distributed object systems

Jun He Matti A. Hiltunen Mohan Rajagopalan Richard D. Schlichting 《Software》2003,33(4):295-320

Applications built on networked collections of computers are increasingly using distributed object platforms such as CORBA,Java Remote Method Invocation (RMI), and DCOM to standardize object interactions. With this increased use comes the increased need for enhanced quality of service (QoS) attributes related to fault tolerance, security, and timeliness. This paper describes an architecture called CQoS (configurable QoS) for implementing such enhancements in a transparent, highly customizable, and portable manner. CQoS consists of two parts: application‐ and platform‐dependent interceptors and generic QoS components. The generic QoS components are implemented using Cactus, a system for building highly configurable protocols and services in distributed systems. The CQoS architecture and the interfaces between the different components are described, together with implementations of QoS attributes using Cactus and interceptors for CORBA and Java RMI. Experimental results are given for a test application executing on a Linux cluster using Cactus/J, the Java implementation of Cactus. Compared with other approaches, CQoS emphasizes portability across different distributed object platforms, while the use of Cactus allows custom combinations of fault‐tolerance, security, and timeliness attributes to be realized on a per‐object basis in a straightforward way. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

18.

Nonblocking <Emphasis Type="Italic">k</Emphasis>-Compare-Single-Swap

Victor Luchangco Mark Moir Nir Shavit 《Theory of Computing Systems》2009,44(1):39-66

The current literature offers two extremes of nonblocking software synchronization support for concurrent data structure design: intricate designs of specific structures based on single-location operations such as compare-and-swap (CAS), and general-purpose multilocation transactional memory implementations. While the former are sometimes efficient, they are invariably hard to extend and generalize. The latter are flexible and general, but costly. This paper aims at a middle ground: reasonably efficient multilocation operations that are general enough to reduce the design difficulties of algorithms based on CAS alone. We present an obstruction-free implementation of an atomic k -location-compare single-location-swap (KCSS) operation. KCSS allows for simple nonblocking manipulation of linked data structures by overcoming the key algorithmic difficulty in their design: making sure that while a pointer is being manipulated, neighboring parts of the data structure remain unchanged. Our algorithm is efficient in the common uncontended case: A successful k-location KCSS operation requires only two CAS operations, two stores, and 2k noncached loads when there is no contention. We therefore believe our results lend themselves to efficient and flexible nonblocking manipulation of list-based data structures in today’s architectures. A preliminary version of this paper appeared in the Proceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 314–323, San Diego, California, USA, 2003. 相似文献

19.

Parallel calculations on the CM-2: BBS test programs and spectral methods

Jean-Michel Mal 《Future Generation Computer Systems》1994,10(4):381-389

The implementation of the Because Benchmark Set programs on the Connection Machine CM-2 is presented and discussed. The characteristics of this machine have been taken into account during the BBS implementations, leading to somewhat different formulations, for some of the BBS programs. These situations are detailed to outline the way algorithms work.

In a second section, the generation of a Computational Fluid Dynamics software is described; the numerical method used (spectral method) is explained, in order to understand the algorithmic solutions that are proposed. The resulting code has then been run on an example showing mixing layer instability, and then compared with similar codes running on Convex and CRAY II vector supercomputers. 相似文献

20.

基于CUDA的塔台模拟机冲突检测算法

汤坤费向东季玉龙徐伟《计算机与数字工程》2011,(10):85-88

塔台模拟机冲突检测算法是一种耗时大的并行算法。针对其导致塔台模拟系统核心服务器CPU负担过重的缺点,在常用冲突检测算法的基础上,提出一种基于统一设备构架（CUDA）的塔台模拟机冲突检测实现方案。首先介绍GPU并行运算的体系结构基础,并将基于卡尔曼滤波的目标物体跟踪技术的分层冲突检测算法移植到GPU。然后利用相同价格的CPU和GPU对比运算效果。实验结果表明：与相同算法的CPU实现方案相比,GPU实现方案将计算效率提高10～50倍。使用此方案,极大地减轻了核心服务器的负担,使塔台模拟机的性能得到质的提高。相似文献