首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A software environment tailored to computer vision and image processing (CVIP) that focuses on how information about the CVIP problem domain can make the high-performance algorithms and the sophisticated algorithm techniques being designed by algorithm experts more readily available to CVIP researchers is presented. The environment consists of three principle components: DISC, Cloner, and Graph Matcher. DISC (dynamic intelligent scheduling and control) supports experimentation at the CVIP task level by creating a dynamic schedule from a user's specification of the algorithms that constitute a complex task. Cloner is aimed at the algorithm development process and is an interactive system that helps a user design new parallel algorithms by building on and modifying existing library algorithms. Graph Matcher performs the critical step of mapping new algorithms onto the target parallel architecture. Initial implementations of DISC and Graph Matcher have been completed, and work on Cloner is in progress  相似文献   

2.
Embedded manycore architectures are often organized as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used with a crossbar-like medium inside each cluster and a network-on-chip (NoC) at the global level which make memory operations nonuniform (NUMA). Due to NUMA, regular applications typically employed in the embedded domain (e.g., image processing, computer vision, etc.) ultimately behave as irregular workloads if a flat memory system is assumed at the program level. Nested parallelism represents a powerful programming abstraction for these architectures, provided that (i) streamlined middleware support is available, whose overhead does not dominate the run-time of fine-grained applications; (ii) a mechanism to control thread binding at the cluster-level is supported. We present a lightweight runtime layer for nested parallelism on cluster-based embedded manycores, integrating our primitives in the OpenMP runtime system, and implementing a new directive to control NUMA-aware nested parallelism mapping. We explore on a set of real application use cases how NUMA makes regular parallel workloads behave as irregular, and how our approach allows to control such effects and achieve up to 28 × speedup versus flat parallelism.  相似文献   

3.
Efficient run-time mapping of tasks onto Multiprocessor System-on-Chip (MPSoC) is very challenging especially when new tasks of other applications are also required to be supported at run-time. In this paper, we present a number of communication-aware run-time mapping heuristics for the efficient mapping of multiple applications onto an MPSoC platform in which more than one task can be supported by each processing element (PE). The proposed mapping heuristics examine the available resources prior to recommending the adjacent communicating tasks on to the same PE. In addition, the proposed heuristics give priority to the tasks of an application in close proximity so as to further minimize the communication overhead. Our investigations show that the proposed heuristics are capable of alleviating Network-on-Chip (NoC) congestion bottlenecks when compared to existing alternatives. We map tasks of applications onto an 8 × 8 NoC-based MPSoC to show that our mapping heuristics consistently leads to reduction in the total execution time, energy consumption, average channel load and latency. In particular, we show that energy savings can be up to 44% and average channel load is improved by 10% for some cases.  相似文献   

4.
There are many design challenges in the hardware-software co-design approach for performance improvement of data-intensive streaming applications with a general-purpose microprocessor and a hardware accelerator. These design challenges are mainly to prevent hardware area fragmentation to increase resource utilization, to reduce hardware reconfiguration cost and to partition and schedule the tasks between the microprocessor and the hardware accelerator efficiently for performance improvement and power savings of the applications.In this paper a modular and block based hardware configuration architecture named memory-aware run-time reconfigurable embedded system (MARTRES) is proposed for efficient resource management and performance improvement of streaming applications. Subsequently we design a task placement algorithm named hierarchical best fit ascending (HBFA) algorithm to prove that MARTRES configuration architecture is very efficient in increased resource utilization and flexible in task mapping and power savings. The time complexity of HBFA algorithm is reduced to O(n) compared to traditional Best Fit (BF) algorithm’s time complexity of O(n2), when the quality of the placement solution by HBFA is better than that of BF algorithm. Finally we design an efficient task partitioning and scheduling algorithm named balanced partitioned and placement-aware partitioning and scheduling algorithm (BPASA). In BPASA we exploit the temporal parallelism in streaming applications to reduce reconfiguration cost of the hardware, while keeping in mind the required throughput of the output data. We balance the exploitation of spatial parallelism and temporal parallelism in streaming applications by considering the reconfiguration cost vs. the data transfer cost. The scheduler refers to the HBFA placement algorithm to check whether contiguous area on FPGA is available before scheduling the task for HW or for SW.  相似文献   

5.
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of optimizing throughput with a latency constraint. The problem formulation uses a general and realistic model of intertask communication and addresses the entire problem of mapping, which includes clustering tasks into modules, assigning of processors to modules, and possible replicating of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.  相似文献   

6.
The complex nature of two-dimensional image data has presented problems for traditional information systems designed strictly for alphanumeric data. Systems aimed at effectively managing image data have generally approached the problem from two different views: They either possess a strong database component with little image understanding, or they serve as an image repository for computer vision applications, with little emphasis on the image retrieval process. A general architecture for visual information-management systems (VIMS), which combine the strengths of both approaches, is presented. The system utilizes computer vision routines for both insertion and retrieval and allows easy query-by-example specifications. The vision routines are used to segment and evaluate objects based on domain-knowledge describing the objects and their attributes. The vision system can then assign feature values to be used for similarity-measures and image retrieval. A VIMS developed for face-image retrieval is presented to demonstrate these ideas  相似文献   

7.
Knowledge-based computer vision creates a large variety of different design tasks including low-level image processing tasks, components for symbol manipulation, and complex cognitive processes. The diversity of requirements of these tasks calls for radically new concepts in terms of hardware and software structures. Classical design tools, such as image processing systems, support only small portions of this large task set. In this paper we report on an interactive and homogeneous software environment termed the Vision Kernel System (VKS) that aims at supporting the entire spectrum of problems encountered in knowledge-based computer vision.  相似文献   

8.
Three-dimensional integrated circuits (3D ICs) are suitable alternatives to traditional two-dimensional (2D) ICs by leveraging its advantage of better performance and packaging; therefore, they have been highly considered by researchers. On the other hand, emerging network-on-chip (NoC) based many-core chips provides great potential for running multiple applications simultaneously. However, using this approach leads to the increase of the interference between applications, resulting in lowering the performance of each application. Hence, mapping tasks belonging to various applications onto the nodes of an architecture is a very important issue. In this study, based on partitioning concept, a novel methodology for mapping of multiple applications at run-time onto an irregular wireless 3D NoC-based multiprocessor system-on-chip (MPSoC) platform in which more than one task can be supported by each processing element (PE) was presented. In the second algorithm (enhanced irregular-partitioning best neighbor), according to the number of applications running simultaneously, the partitioning of network will be dynamically changed to minimize the communication overhead and congestion on the NoC that leads to more efficient task mapping. The simulation results reveal that the second proposed algorithm (enhanced IPBN) in comparison with NPBN (non-partitioning best neighbor) algorithm and our first proposed algorithm (basic IPBN) enhances the performance by decreasing the total execution time, average hop count, average channel load and energy consumption.  相似文献   

9.
Many computer vision and image processing problems can be posed as solving partial differential equations (PDEs). However, designing a PDE system usually requires high mathematical skills and good insight into the problems. In this paper, we consider designing PDEs for various problems arising in computer vision and image processing in a lazy manner: learning PDEs from training data via an optimal control approach. We first propose a general intelligent PDE system which holds the basic translational and rotational invariance rule for most vision problems. By introducing a PDE-constrained optimal control framework, it is possible to use the training data resulting from multiple ways (ground truth, results from other methods, and manual results from humans) to learn PDEs for different computer vision tasks. The proposed optimal control based training framework aims at learning a PDE-based regressor to approximate the unknown (and usually nonlinear) mapping of different vision tasks. The experimental results show that the learnt PDEs can solve different vision problems reasonably well. In particular, we can obtain PDEs not only for problems that traditional PDEs work well but also for problems that PDE-based methods have never been tried before, due to the difficulty in describing those problems in a mathematical way.  相似文献   

10.
Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel with low run-time overhead is thus very important for achieving high performance in parallel processing systems.However,in parallel processing systems with caches or local memories in memory hierarchies,“thrashing problemmay”may arise whenever data move back and forth between the caches or local memories in different processors.Previous techniques can only deal with the rather simple cases with one linear function in the perfactly nested loop.In this paper,we present a parallel program optimizing technique called hybri loop interchange(HLI)for the cases with multiple linear functions and loop-carried data dependences in the nested loop.With HLI we can easily eliminate or reduce the thrashing phenomena without reucing the program parallelism.  相似文献   

11.
Human beings can become experts in performing specific vision tasks, for example, doctors analysing medical images, or botanists studying leaves. With sufficient knowledge and experience, people can become very efficient at such tasks. When attempting to perform these tasks with a machine vision system, it would be highly beneficial to be able to replicate the process which the expert undergoes. Advances in eye-tracking technology can provide data to allow us to discover the manner in which an expert studies an image. This paper presents a first step towards utilizing these data for computer vision purposes. A growing-neural-gas algorithm is used to learn a set of Gabor filters which give high responses to image regions which a human expert fixated on. These filters can then be used to identify regions in other images which are likely to be useful for a given vision task. The algorithm is evaluated by learning filters for locating specific areas of plant leaves.  相似文献   

12.
《Real》1996,2(3):187-199
Many software libraries have been created to support the commonly used primitive operations needed in image processing, image analysis and image understanding. Generally, these libraries are based on thesingle-layeredApplication Program Interface (API). While a single-layered API provides the useful abstraction level to interact with the library and hides unnecessary implementation details from the user, it does not produce an efficient program when a new algorithm is implemented by assembling the selected existing library routines. The composed program suffers from the inefficient data movement and additional loop control overhead. Furthermore, when a system employs a highly integrated processor such as a single-chip multiprocessor, the single-layered API prevents the user from fully utilizing the resources available in the system.In this article, we describe the University of Washington Image Computing Library (UWICL), themulti-layerdhigh-performance parallel image computing library for Texas Instruments TMS320C80 Multimedia Video Processor (MVP)-based time-critical systems. Our goal in designing the UWICL is to provide the TMS320C80 user community with efficient and flexible image computing library routines. The UWICL provides three levels of APIs to the programmers under the multi-layered organization, the MVP-level API, the DSP-level API, and APIs for data flow and processing cores. By optimizing the processing core functions, we have achieved high performance in the individual function level, and by allowing the sub-primitive library routine composition, we can achieve efficient image processing application development, avoiding most problems encountered in using the single-layered library routines. The performance of the multi-layered organizationvs.the single-layered one is analysed and compared using the Canny's edge detection algorithm as an example. The balanced composition based on the multi-layered organization outperforms the single-layered composition by 14 to 41% depending on the system's memory bandwidth available.As an adjunct to the UWICL, we have also developed an integrated MVP performance monitor (MPM). The MPM can identify the performance bottleneck of the TMS320C80 applications and can be used in optimization by enabling the user to select the most efficient library composition level in building the application with the UWICL. In order to provide the overall performance evaluation model of the MVP, the simple MVP functional model has also been defined in the MPM. For the image thresholding operation, the difference between the measured execution time and the analysis prediction is less than 2%. The design and implementation of the MPM, and the applicability and usefulness of the MPM and MVP performance model are described in this article.  相似文献   

13.
提出一种新颖的面向高级网络处理器(NP)的处理资源调度算法,称为基于复制的部分动态调度算法(Duplication-base Partial Dynamic Scheduling,DPDS),结合部分动态映射及任务复制策略,以改善NP的性能。DPDS从多个方面与已有算法不同,如处理单元是异构、全连通、多线程的,应用被分解为以持续数据包为输入的DAG任务,调度在初始化和运行时阶段均可调整。实验结果显示本算法比不具有动态复制阶段的算法在最大平均吞吐量上高出30%左右。  相似文献   

14.
利用计算机视觉库OpenCV和Android NDK编译技术在Android平台上实现道路识别的处理过程。首先简要介绍了开源计算机视觉库OpenCV及其移植到Android平台上的方法,该方法使Android平台的应用更加广泛,能够更好地实现各种复杂图像的处理。图像处理部分采用OpenCV的霍夫变换算法以及对象跟踪技术,经过Android平台上的测试,取得了较好的效果。  相似文献   

15.
Multimedia Tools and Applications - Depth map estimation from a single RGB image is a fundamental computer vision and image processing task for various applications. Deep learning based depth map...  相似文献   

16.
Today, there is a growing demand for computer vision and image processing in different areas and applications such as military surveillance, and biological and medical imaging. Edge detection is a vital image processing technique used as a pre-processing step in many computer vision algorithms. However, the presence of noise makes the edge detection task more challenging; therefore, an image restoration technique is needed to tackle this obstacle by presenting an adaptive solution. As the complexity of processing is rising due to recent high-definition technologies, the expanse of data attained by the image is increasing dramatically. Thus, increased processing power is needed to speed up the completion of certain tasks. In this paper,we present a parallel implementation of hybrid algorithm-comprised edge detection and image restoration along with other processes using Computed Unified Device Architecture (CUDA) platform, exploiting a Single Instruction Multiple Thread (SIMT) execution model on a Graphical Processing Unit (GPU). The performance of the proposed method is tested and evaluated using well-known images from various applications. We evaluated the computation time in both parallel implementation on the GPU, and sequential execution in the Central Processing Unit (CPU) natively and using Hyper-Threading (HT) implementations. The gained speedup for the naïve approach of the proposed edge detection using GPU under global memory direct access is up to 37 times faster, while the speedup of the native CPU implementation when using shared memory approach is up to 25 times and 1.5 times over HT implementation.  相似文献   

17.
快速彩色图像颜色分割算法对于智能机器人和计算机视觉领域的实时应用具有重要价值。CMVision程序库是这类算法的一个典型实现。为了克服CMVision程序库在颜色分类方面的局限性并进一步提高速度,本文提出了一种改进算法。该算法引入了颜色映射表并优化了数据输入接口和连通区操作。分析和实验表明该算法在功能和速度两方面都取得了良好的效果。  相似文献   

18.
多总线多DSP实时图像处理操作系统的设计与实现   总被引:5,自引:0,他引:5  
该文针对多总线多 DSP实时图像识别系统 ,设计并实现了一个并行操作系统 .它包括嵌入到 DSP芯片上的操作系统和运行在 PC机上的协议软件两部分 .协议软件提供一个人机界面 ,接收算法的分解信息 ,并将其按一定的数据结构组织 ,再将所有的子任务及其分解信息连接成一个作业 .DSP上的操作系统支持作业从上位机上加载 ,或通过 EPROM加载 .操作系统支持 VXI总线标准 ,并提供了数据通信、任务分配和并发进程管理等功能 .它根据任务分解信息 ,分配硬件资源 ,构造数据流向 ,建立子任务相互间的同步关系 ,完成与上位机的联络并输出结果 .实验结果表明 ,该文设计的硬件及其操作系统能够适应不同并行结构的需要 ,并得到满意的图像并行处理效果 .  相似文献   

19.
The aim of GRID superscalar is to reduce the development complexity of Grid applications to the minimum, in such a way that writing an application for a computational Grid may be as easy as writing a sequential application. Our assumption is that Grid applications would be in a lot of cases composed of tasks, most of them repetitive. The granularity of these tasks will be of the level of simulations or programs, and the data objects will be files. GRID superscalar allows application developers to write their application in a sequential fashion. The requirements to run that sequential application in a computational Grid are the specification of the interface of the tasks that should be run in the Grid, and, at some points, calls to the GRID superscalar interface functions and link with the run-time library.GRID superscalar provides an underlying run-time that is able to detect the inherent parallelism of the sequential application and performs concurrent task submission. In addition to a data-dependence analysis based on those input/output task parameters which are files, techniques such as file renaming and file locality are applied to increase the application performance. This paper presents the current GRID superscalar prototype based on Globus Toolkit 2.x, together with examples and performance evaluation of some benchmarks.  相似文献   

20.
OpenCV是一个以C函数为主,提供多语言接口,轻量并高效的计算机视觉库,实现了图像处理和计算机视觉方面的多种通用算法。利用其Canny边缘检测算法,实现模糊图像的边缘检测,参照L2BOT视觉机器人实例,对单一背景模式下基于OpenCV的条形码与障碍物识别进行研究,并对机器人移动路线最优选择提供一种参考模型。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号