期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Decomposition of convex polygonal morphological structuringelements into neighborhood subsets

Xu J. 《IEEE transactions on pattern analysis and machine intelligence》1991,13(2):153-162

A discussion is presented of the decomposition of convex polygon-shaped structuring elements into neighborhood subsets. Such decompositions will lead to efficient implementation of corresponding morphological operations on neighborhood-processing-based parallel image computers. It is proved that all convex polygons are decomposable. Efficient decomposition algorithms are developed for different machine structures. An O(1) time algorithm, with respect to the image size, is developed for the four-neighbor-connected mesh machines; a linear time algorithm for determining the optimal decomposition is provided for the machines that can quickly perform 3×3 morphological operations 相似文献

2.

Fast recursive grayscale morphology operators: from the algorithm to the pipeline architecture

Olivier Déforges Nicolas Normand Marie Babel 《Journal of Real-Time Image Processing》2013,8(2):143-152

相似文献

3.

On Computing the Exact Euclidean Distance Transform on Rectangular and Hexagonal Grids

Andrew J.H. Mehnert Paul T. Jackway 《Journal of Mathematical Imaging and Vision》1999,11(3):223-230

In this paper we prove an equivalence relation between the distance transform of a binary image, where the underlying distance is based on a positive definite quadratic form, and the erosion of its characteristic function by an elliptic poweroid structuring element. The algorithms devised by Shih and Mitchell [18] and Huang and Mitchell [7], for calculating the exact Euclidean distance transform (EDT) of a binary digital image manifested on a square grid, are particular cases of this result. The former algorithm uses erosion by a circular cone to calculate the EDT whilst the latter uses erosion by an elliptic paraboloid (which allows for pixel aspect ratio correction) to calculate the square of the EDT. Huang and Mitchell's algorithm [7] is arguably the better of the two because: (i) the structuring element can be decomposed into a sequence of dilations by 3 × 3 structuring elements (a similar decomposition is not possible for the circular cone) thus reducing the complexity of the erosion, and (ii) the algorithm only requires integer arithmetic (it produces squared distance). The algorithm is amenable to both hardware implementation using a pipeline architecture and efficient implementation on serial machines. Unfortunately the algorithm does not directly transpose to, nor has a corresponding analogue on, the hexagonal grid (the same is also true for Shih and Mitchell's algorithm [7]). In this paper, however, we show that if the hexagonal grid image is embedded in a rectangular grid then Huang and Mitchell's algorithm [7] can be applied, with aspect ratio correction, to obtain the exact EDT on the hexagonal grid. 相似文献

4.

On balancing sorting on a linear array

Lin Yen-Chun 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(5):566-571

A balanced parallel algorithm to sort a sequence of items on a linear array of processors is presented. The length of the sequence may be small to arbitrarily large. For a short sequence, the output of the sorted sequence begins at the step following the last input of the whole sequence. For an arbitrarily long sequence, the time complexity is optimal under realistic hardware conditions. A variation of the algorithm is also introduced. Both algorithms require far less local memory than that required by a different approach of balanced computation. Any number of balanced processors can be connected to deliver more computing power without increasing the memory size of each processor 相似文献

5.

Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems

Ling Zhuo Prasanna V.K. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(4):433-448

The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The processing elements (PEs) used in our algorithms are modular so that it is easy to embed floating-point units into them. Experimental results on a Xilinx Virtex-ll Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1 相似文献

6.

High level synthesis of integrated heterogeneous pipelined processing elements for DSP applications

Ali Jehad M.O. 《Computers & Electrical Engineering》2004,30(8):543-562

A technique for scheduling and processor allocation leading to the synthesis of integrated heterogeneous pipelined processing elements, implementing digital signal processing applications, is proposed. The proposed technique achieves efficient hardware implementations at the logic-level by minimizing the number of processing units used, without compromising the rate and delay optimality criteria.

The proposed algorithm is found to outperform algorithms resulting in homogeneous implementations, as it gives schedules with lower iteration periods, requires less hardware resources, and has lower time complexity at design time. In comparison with the already existing heterogeneous algorithms, the proposed algorithm produces schedules of lower time complexity and lower iteration period for some applications. The optimal performance of the proposed algorithm has been verified on several benchmarks. 相似文献

7.

Architectural support for designing fault-tolerant open distributedsystems

Hariri S. Choudhary A. Sarikaya B. 《Computer》1992,25(6):50-62

An overview of the main techniques for designing fault-tolerant software and hardware systems is provided. The important features of the building blocks (computers, memories, buses, etc.) that can support an efficient implementation of fault-tolerant open distributed systems (FTODSs) are identified. Taking into account the features of these building blocks, an organization for FTODS is proposed. A distributed voting algorithm and a two-level hierarchy for permanent memory are key elements in this scheme. The algorithms needed for transferring files and synchronizing the concurrent activities of the computing modules and for recovery-are ISO standard protocols. Low-level voting and recovery algorithms that can run as a layer of software above the operating system make the open distributed system an attractive environment for applying fault-tolerant techniques 相似文献

8.

Parallel implementation of sequential morphological filters

Jan Bartovský Petr Dokládal Eva Dokládalová Vjačeslav Georgiev 《Journal of Real-Time Image Processing》2014,9(2):315-327

Many useful morphological filters are built as more or less long concatenations of erosions and dilations: openings, closings, size distributions, sequential filters, etc. An efficient implementation of these concatenations would allow all the sequentially concatenated operators run simultaneously, on the time-delayed data. A recent algorithm (see below) for the morphological dilation/erosion allows such inter-operator parallelism. This paper introduces an additional, intra-operator level of parallelism in this dilation/erosion algorithm. Realized in a dedicated hardware, for rectangular structuring elements with programmable size, such an implementation allows obtaining previously unachievable, real-time performances for these traditionally costly operators. Low latency and memory requirements are the main benefits when the performance is not deteriorated even for long concatenations or high-resolution images. 相似文献

9.

基于GPU的交叉相关外推并行化算法

王兴王介君孙宁汪瑶《计算机与现代化》2014,(2):213-218

为克服交叉相关外推算法时间复杂度高、运算时间过长的缺点,提出一种基于GPU的快速并行化算法,应用于地闪落点的外推预测。首先分析串行的算法流程,然后对算法进行并行化分析设计,再针对AMD系列GPU硬件架构特点,运用OpenCL技术从主存与设备内存之间的数据传输、显存访问模式等方面对算法进一步优化。最后将地闪监测实况数据与本算法外推计算结果进行比对,分析不同精度下串行与并行算法的计算效率。实验结果表明,该算法充分利用GPU强大的并行计算能力,计算速度提高了近17倍。相似文献

10.

DBV-Miner: A Dynamic Bit-Vector approach for fast mining frequent closed itemsets

Bay Vo Tzung-Pei Hong Bac Le 《Expert systems with applications》2012,39(8):7196-7206

Frequent closed itemsets (FCI) play an important role in pruning redundant rules fast. Therefore, a lot of algorithms for mining FCI have been developed. Algorithms based on vertical data formats have some advantages in that they require scan databases once and compute the support of itemsets fast. Recent years, BitTable (Dong & Han, 2007) and IndexBitTable (Song, Yang, & Xu, 2008) approaches have been applied for mining frequent itemsets and results are significant. However, they always use a fixed size of Bit-Vector for each item (equal to number of transactions in a database). It leads to consume more memory for storage Bit-Vectors and the time for computing the intersection among Bit-Vectors. Besides, they only apply for mining frequent itemsets, algorithm for mining FCI based on BitTable is not proposed. This paper introduces a new method for mining FCI from transaction databases. Firstly, Dynamic Bit-Vector (DBV) approach will be presented and algorithms for fast computing the intersection between two DBVs are also proposed. Lookup table is used for fast computing the support (number of bits 1 in a DBV) of itemsets. Next, subsumption concept for memory and computing time saving will be discussed. Finally, an algorithm based on DBV and subsumption concept for mining frequent closed itemsets fast is proposed. We compare our method with CHARM, and recognize that the proposed algorithm is more efficient than CHARM in both the mining time and the memory usage. 相似文献

11.

基于压缩树的快速求核算法

下载免费PDF全文

曾德胜《计算机工程》2011,37(10):61-63

利用差别矩阵进行求核运算时,矩阵中大量的空元素和重复差别元素会浪费很多存储空间及计算时间。针对上述问题,结合频繁模式树,设计一种新的数据结构——压缩树(C_Tree),在此基础上提出一种快速求核算法。理论与实例分析结果证明,该算法的时空复杂度取决于求简化决策表和构造C_Tree的时空复杂度,因此求核效率得到较大的提高。相似文献

12.

形态学结构元的二次分解方法 总被引：3，自引：0，他引：3

朱立新王平安夏德深《计算机辅助设计与图形学学报》2005,17(7):1476-1479

利用数学形态学方法进行图像处理的过程中,对形态学结构元进行分解,可以达到降低计算复杂度和便于利用通用的简单形态学硬件模块实现复杂的形态学运算的目的．讨论了将复杂结构元分解为简单结构元而不降低结构元维数的传统分解方法,并提出了对结构元进行两次降维分解的结构元分解方法,以达到提高形态学运算效率的目的,将每个像素的计算时间复杂度从O(n^2)降低到O(n),n为结构元的大小．文中方法还具有利于硬件实现和并行实现的特点,为加快形态学变换运算提出了新的实现思路．相似文献

13.

树型网格计算环境下的独立任务调度 总被引：17，自引：1，他引：17

林伟伟齐德昱李拥军王振宇张志立《软件学报》2006,17(11):2352-2361

任务调度是实现高性能网格计算的一个基本问题,然而,设计和实现高效的调度算法是非常具有挑战性的.讨论了在网格资源计算能力和网络通信速度异构的树型计算网格环境下,独立任务的调度问题.与实现最小化任务总的执行时间不同(该问题已被证明是NP难题),为该任务调度问题建立了整数线性规划模型,并从该线性规划模型中得到最优任务分配方案??各计算节点最优任务分配数.然后,基于最优任务分配方案,构造了两种动态的需求驱动的任务分配启发式算法:OPCHATA(optimization-based priority-computation heuristic algorithm for task allocation)和OPBHATA(optimization-basedpriority-bandwidth heuristic algorithm for task allocation).实验结果表明:在异构的树型计算网格环境下实现大量独立任务调度时,该算法的性能明显优于其他算法. 相似文献

14.

一种基于QoS的自适应网格失效检测器 总被引：2，自引：0，他引：2

董剑左德承刘宏伟杨孝宗《软件学报》2006,17(11):2362-2372

失效检测器是构建可靠的网格计算环境所必需的基础组件之一.由于网格中存在大量对失效检测有着不同QoS需求的分布式应用,对于一个网格失效检测器来说,为保持其有效性和可扩展性,应该既能够准确提供应用程序所需的失效检测QoS,又能够避免为满足不同QoS而设计多套失效检测器所产生的多余负载.基于QoS基本评价指标,采用PULL模式主动检测策略实现了一种新的失效检测器--GA-FD(adaptive failure detector for grid),可以同时支持多个应用程序定量描述的QoS需求,不需要关于消息行为和时钟同步的任何假设.同时,证明了GA-FD在部分同步模型下可实现一个◇P类的失效检测器,并给出了相应的实验及数据. 相似文献

15.

Practical parallel Union-Find algorithms for transitive closure and clustering

G. Cybenko T. G. Allen J. E. Polito 《International journal of parallel programming》1988,17(5):403-423

Practical parallel algorithms, based on classical sequential Union-Find algorithms for computing transitive closures of binary relations, are described and implemented for both shared memory and distributed memory parallel computers. By practical algorithms, we mean algorithms that are efficient for parallel systems with bounded numbers of processors as opposed to algorithms where the number of processors grows with the problem size. Transitive closures are useful for decomposing many applications problems into independent subproblems. The implementations were on an ENCORE Multimax shared memory machine and an NCUBE hypercube. Our implementations indicate that transitive closure computations are intrinsically difficult for distributed memory parallel machines because of the need for global information. By contrast, our results for shared memory machines exhibited excellent speedups.Supported in part by NSF Grant DCR-8619103, ONR contract N000-86-G-0202 and DOE Grant DE-FG02-85ER25001.Supported in part by RADC contract F30602-85-C-0303.Supported in part by RADC contract F30602-85-C-0303. 相似文献

16.

Parallel Algorithms for Dynamic Shortest Path Problems

Ismail Chabini & Sridevi Ganugapati 《International Transactions in Operational Research》2002,9(3):279-302

The development of intelligent transportation systems (ITS) and the resulting need for the solution of a variety of dynamic traffic network models and management problems require faster‐than‐real‐time computation of shortest path problems in dynamic networks. Recently, a sequential algorithm was developed to compute shortest paths in discrete time dynamic networks from all nodes and all departure times to one destination node. The algorithm is known as algorithm DOT and has an optimal worst‐case running‐time complexity. This implies that no algorithm with a better worst‐case computational complexity can be discovered. Consequently, in order to derive algorithms to solve all‐to‐one shortest path problems in dynamic networks, one would need to explore avenues other than the design of sequential solution algorithms only. The use of commercially‐available high‐performance computing platforms to develop parallel implementations of sequential algorithms is an example of such avenue. This paper reports on the design, implementation, and computational testing of parallel dynamic shortest path algorithms. We develop two shared‐memory and two message‐passing dynamic shortest path algorithm implementations, which are derived from algorithm DOT using the following parallelization strategies: decomposition by destination and decomposition by transportation network topology. The algorithms are coded using two types of parallel computing environments: a message‐passing environment based on the parallel virtual machine (PVM) library and a multi‐threading environment based on the SUN Microsystems Multi‐Threads (MT) library. We also develop a time‐based parallel version of algorithm DOT for the case of minimum time paths in FIFO networks, and a theoretical parallelization of algorithm DOT on an ‘ideal’ theoretical parallel machine. Performances of the implementations are analyzed and evaluated using large transportation networks, and two types of parallel computing platforms: a distributed network of Unix workstations and a SUN shared‐memory machine containing eight processors. Satisfactory speed‐ups in the running time of sequential algorithms are achieved, in particular for shared‐memory machines. Numerical results indicate that shared‐memory computers constitute the most appropriate type of parallel computing platforms for the computation of dynamic shortest paths for real‐time ITS applications. 相似文献

17.

A family of compact genetic algorithms for intrinsic evolvable hardware 总被引：1，自引：0，他引：1

Gallagher J.C. Vigraham S. Kramer G. 《Evolutionary Computation, IEEE Transactions on》2004,8(2):111-126

For many evolvable hardware applications, small size and power efficiency are critical design considerations. One manner in which significant memory, and thus, power and space savings can be realized in a hardware-based evolutionary algorithm is to represent populations of candidate solutions as probability vectors rather than as sets of bit strings. The compact genetic algorithm (CGA) is a probability vector-based evolutionary algorithm that can be efficiently and elegantly implemented in digital hardware. Unfortunately, the CGA is a very weak, first order, evolutionary algorithm that is unlikely to possess sufficient search power to enable intrinsic evolvable hardware applications. In this paper, we further develop a number of modifications to the basic CGA that significantly improve its search efficacy without substantially increasing the size and complexity of its hardware implementation. The paper provides both benchmark results demonstrating increased efficacy and a conceptual data path/microcontroller design suitable for implementation in digital hardware. Following, it demonstrates efficient implementation by making a head-to-head comparison of field programmable gate array implementations of both the classic CGA and a member of our family of modifications. The paper concludes with a discussion of future research, including several additional extensions that we expect will further increase search efficacy without increasing implementation cost. 相似文献

18.

A massively parallel molecular dynamics algorithm for the MasPar supercomputer

《Computer Physics Communications》1998,108(2-3):200-210

Massively parallel computers are emerging as a valuable tool for supercomputer applications. Their processing speed and memory size makes them ideal for solving large applications. An implementation of a molecular dynamics simulation using a neighbour list type algorithm is presented. By efficient use and understanding of the architecture, an extremely efficient neighbour list algorithm (without the need to store the list) has been developed. The large number of processors has allowed us to model large samples (up to one million atoms), reducing the artefacts which may be caused by having a small sample size. This implementation has provided performance results that surpass those of standard machines. The improvements are by factors of hundreds in terms of speed of calculation, and the sizes of the systems that can be modelled. 相似文献

19.

Integrated video motion estimator with Retinex-like pre-processing for robust motion analysis in automotive scenarios: algorithmic and real-time architecture design

Stefano Marsi Sergio Saponara 《Journal of Real-Time Image Processing》2010,5(4):275-289

The paper presents a novel technique for robust motion analysis in real automotive scenarios based on integrated Retinex-like pre-processing algorithm with block matching video motion estimator. Both algorithmic and real-time hardware design issues are discussed. The benefits of the proposed technique are manifold: the entire system is more robust; the estimated motion vectors are more reliable and less dependent on critical ambient conditions like shadows or flashes; the proposed algorithm may allow to perform motion estimation using very few bits and running as a 2- or 1-bit transform, still maintaining good performances. Real-time hardware implementation is achieved by design and synthesis in 65 nm CMOS standard-cells technology of an Application Specific Instruction-set Processor. Design optimizations for both the processing core and the memory organization are presented. With respect to the state of the art the proposed hardware implementation ensures bounded circuit complexity, low power consumption and reprogrammability of the technique. 相似文献

20.

Parallelization and scalability of a spectral element channel flow solver for incompressible Navier–Stokes equations

C. W. Hamman R. M. Kirby M. Berzins 《Concurrency and Computation》2007,19(10):1403-1422

Direct numerical simulation (DNS) of turbulent flows is widely recognized to demand fine spatial meshes, small timesteps, and very long runtimes to properly resolve the flow field. To overcome these limitations, most DNS is performed on supercomputing machines. With the rapid development of terascale (and, eventually, petascale) computing on thousands of processors, it has become imperative to consider the development of DNS algorithms and parallelization methods that are capable of fully exploiting these massively parallel machines. A highly parallelizable algorithm for the simulation of turbulent channel flow that allows for efficient scaling on several thousand processors is presented. A model that accurately predicts the performance of the algorithm is developed and compared with experimental data. The results demonstrate that the proposed numerical algorithm is capable of scaling well on petascale computing machines and thus will allow for the development and analysis of high Reynolds number channel flows. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献