期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

FPGA implementation of image processing technique for blood samples characterization

Telnaz Zarifi Mahsa Malek 《Computers & Electrical Engineering》2014

This work presents a hardware implementation of an image processing algorithm for blood type determination. The image processing technique proposed in this paper uses the appearance of agglutination to determine blood type by detecting edges and contrast within the agglutinated sample. An FPGA implementation and parallel processing algorithms are used in conjugation with image processing techniques to make this system reliable for the characterization of large numbers of blood samples. The program was developed using Matlab software then transferred and implemented on a Vertex 6 FPGA from Xilinx employing ISE software. Hardware implementation of the proposed algorithm on FPGA demonstrates a power consumption of 770 mW from a 2.5 V power supply. Blood type characterization using our FPGA implementation requires only 6.6 s, while a desktop computer-based algorithm with Matlab implementation on a Pentium 4 processor with a 3 GHz clock takes 90 s. The presented device is faster, more portable, less expensive, and consumes less power than conventional instruments. The proposed hardware solution achieved accuracy of 99.5% when tested with over 500 different blood samples. 相似文献

2.

CellJoin: a parallel stream join operator for the cell processor

Buğra Gedik Rajesh R. Bordawekar Philip S. Yu 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(2):501-519

Low-latency and high-throughput processing are key requirements of data stream management systems (DSMSs). Hence, multi-core processors that provide high aggregate processing capacity are ideal matches for executing costly DSMS operators. The recently developed Cell processor is a good example of a heterogeneous multi-core architecture and provides a powerful platform for executing data stream operators with high-performance. On the down side, exploiting the full potential of a multi-core processor like Cell is often challenging, mainly due to the heterogeneous nature of the processing elements, the software managed local memory at the co-processor side, and the unconventional programming model in general. In this paper, we study the problem of scalable execution of windowed stream join operators on multi-core processors, and specifically on the Cell processor. By examining various aspects of join execution flow, we determine the right set of techniques to apply in order to minimize the sequential segments and maximize parallelism. Concretely, we show that basic windows coupled with low-overhead pointer-shifting techniques can be used to achieve efficient join window partitioning, column-oriented join window organization can be used to minimize scattered data transfers, delay-optimized double buffering can be used for effective pipelining, rate-aware batching can be used to balance join throughput and tuple delay, and finally single-instruction multiple-data (SIMD) optimized operator code can be used to exploit data parallelism. Our experimental results show that, following the design guidelines and implementation techniques outlined in this paper, windowed stream joins can achieve high scalability (linear in the number of co-processors) by making efficient use of the extensive hardware parallelism provided by the Cell processor (reaching data processing rates of ≈13 GB/s) and significantly surpass the performance obtained form conventional high-end processors (supporting a combined input stream rate of 2,000 tuples/s using 15 min windows and without dropping any tuples, resulting in ≈8.3 times higher output rate compared to an SSE implementation on dual 3.2 GHz Intel Xeon). 相似文献

3.

Hardware/software co-design for particle swarm optimization algorithm 总被引：1，自引：0，他引：1

Shih-An Li Ching-Chang Wong 《Information Sciences》2011,181(20):4582-4596

This paper presents a hardware/software (HW/SW) co-design approach using SOPC technique and pipeline design method to improve design flexibility and execution performance of particle swarm optimization (PSO) for embedded applications. Based on modular design architecture, a Particle Updating Accelerator module via hardware implementation for updating velocity and position of particles and a Fitness Evaluation module implemented either on a soft-cored processor or Field Programmable Gate Array (FPGA) for evaluating the objective functions are respectively designed to work closely together to carry out the evolution process at different design stages. Thanks to the design flexibility, the proposed approach can tackle various optimization problems of embedded applications without the need for hardware redesign. To further improve the execution performance of the PSO, a hardware random number generator (RNG) is also designed in this paper in addition to a particle re-initialization scheme to promote exploration search during the optimization process. Experimental results have demonstrated that the proposed HW/SW co-design approach for PSO algorithms has good efficiency for obtaining high-quality solutions for embedded applications. 相似文献

4.

Efficient Hardware/Software Implementation of an Adaptive Neuro-Fuzzy System 总被引：1，自引：0，他引：1

del Campo I. Echanobe J. Bosque G. Tarela J.M. 《Fuzzy Systems, IEEE Transactions on》2008,16(3):761-778

This paper describes the development of efficient hardware/software (HW/SW) neuro-fuzzy systems. The model used in this work consists of an adaptive neuro-fuzzy inference system modified for efficient HW/SW implementation. The design of two different on-chip approaches are presented: a high-performance parallel architecture for offline training and a pipelined architecture suitable for online parameter adaptation. Details of important aspects concerning the design of HW/SW solutions are given. The proposed architectures have been implemented using a system-on-a-programmable-chip. The device contains an embedded-processor core and a large field programmable gate array (FPGA). The processor provides flexibility and high precision to implement the learning algorithms, while the FPGA allows the development of high-speed inference architectures for real-time embedded applications. 相似文献

5.

基于FPGA的航空发动机电子控制器设计技术研究 总被引：1，自引：0，他引：1

刘冬冬张天宏黄向华陈建《测控技术》2012,31(1):57-61

基于FPGA的并行运行、可重配置以及采用软/硬件协同设计的技术特点,提出了一种基于FPGA的片内分布式航空发动机电子控制器设计方法。重点研究了FPGA内嵌处理器选型、硬件协处理器及同步数据总线设计等3个关键技术问题。在此基础上,基于Altera FPGAEP2C35设计了控制器原理样机,并进行了硬件性能测试,结果表明该控制器设计方法在当前的技术条件下具有实施的可行性。所提出的发动机电子控制器设计方法有利于克服当前集中式电子控制器设计时存在的软件高度定制、可重用性差、并行实时任务开发难度大、开发效率低等缺相似文献

6.

Retargeting sequential image-processing programs for data parallel execution

Baumstark L.B. Jr. Wills L.M. 《IEEE transactions on pattern analysis and machine intelligence》2005,31(2):116-136

New compact, low-power implementation technologies for processors and imaging arrays can enable a new generation of portable video products. However, software compatibility with large bodies of existing applications written in C prevents more efficient, higher performance data parallel architectures from being used in these embedded products. If this software could be automatically retargeted explicitly for data parallel execution, product designers could incorporate these architectures into embedded products. The key challenge is exposing the parallelism that is inherent in these applications but that is obscured by artifacts imposed by sequential programming languages. This paper presents a recognition-based approach for automatically extracting a data parallel program model from sequential image processing code and retargeting it to data parallel execution mechanisms. The explicitly parallel model presented, called multidimensional data flow (MDDF), captures a model of how operations on data regions (e.g., rows, columns, and tiled blocks) are composed and interact. To extract an MDDF model, a partial recognition technique is used that focuses on identifying array access patterns in loops, transforming only those program elements that hinder parallelization, while leaving the core algorithmic computations intact. The paper presents results of retargeting a set of production programs to a representative data parallel processor array to demonstrate the capacity to extract parallelism using this technique. The retargeted applications yield a potential execution throughput limited only by the number of processing elements, exceeding thousands of instructions per cycle in massively parallel implementations. 相似文献

7.

基于FPGA的嵌入式多核处理器及SUSAN算法并行化 总被引：1，自引：0，他引：1

王洁张淑燕刘涛季振洲胡铭曾《计算机学报》2008,31(11)

给出了四核心嵌入式并行处理器FPEP的结构设计并建立了FPGA验证平台.为了对多核处理器平台性能进行评测,提出了基于OpenMP的3种可行的图像处理领域的经典算法SUSAN算法的并行化方法:直接并行化SUSAN、图像分块处理和多图像并行处理,并对这3种并行算法在Intel四核心平台和FPEP的FPGA验证平台上进行性能测试.实验表明,3种并行算法在两种四核心平台下均可获得接近3.0的加速比,多图像并行处理在FPEP的FPGA验证平台可以获得接近4.0的加速比. 相似文献

8.

Design space exploration of an open-source, IP-reusable, scalable floating-point engine for embedded applications

Claudio Fabio Claudio Davide Tapani Juha Fabio Jari 《Journal of Systems Architecture》2008,54(12):1143-1154

This paper describes an open-source and highly scalable floating-point unit (FPU) for embedded systems. Our FPU is fast and efficient, due to the high parallelism of its architecture: the functional units inside the datapath can operate in parallel and independently from each other. A comparison between different versions of the FPU has been made to highlight how performance scales accordingly. Logic synthesis results show that our FPU requires 105 Kgates and runs at 400 MHz on a low-power 90 nm std-cells low-power technology, and requires 20 K Logic Elements running at 67 MHz of an Altera Stratix FPGA. The proposed FPU is supported by a software tool suite which compiles programs written using the C/C++ language. A set of DSP and 3D graphics algorithms have been benchmarked, showing that using our FPU the amount of clock cycles required to perform each algorithm is one order of magnitude smaller than what is required by its corresponding software implementation. 相似文献

9.

A co-processed contour tracing algorithm for a smart camera

Harald Jordan Walter van Dyck Rene Smodi? 《Journal of Real-Time Image Processing》2011,6(1):23-31

This paper describes a new approach for a contour-tracing algorithm targeting a low-power smart camera for industrial inspection. This embedded system consists of the three major components: CMOS sensor, FPGA and microprocessor. By analysing a linear-time algorithm used for simultaneously labelling connected components and their contours, two independent tasks could be identified. By efficiently assigning these two parts to the FPGA and the microprocessor achieving high-speed real-time operation is possible. The novelty of the proposed method is the development of a sequential co-processing algorithm for the FPGA. A Contour-Neighbourhood 3 × 3 filter kernel for converting the grey-level data to an intermediate representation containing directional information was added into an FPGA image-processing design. This pre-processed data is then provided to a software component which is executed on a microprocessor. The final result of this analysis is a sorted list of contour points for each object in the image. Further increases of the data throughput and the workload of the hardware resources are achievable by pipelining the subtasks of consecutive images. The runtime behaviour of this parallel operation is sufficient for meeting the real-time requirements of an industrial 2D measurement system. 相似文献

10.

基于ARM+FPGA的嵌入式安全PLC设计

李明时马跃尹震宇《计算机系统应用》2017,26(3):225-229

传统的PLC系统由于自身系统结构和处理器性能等问题,在执行工业控制的过程中往往在执行了一定时间后系统就会发生惯性停机,影响工业生产.提出了基于ARM+FPGA高性能双处理器的嵌入式安全PLC结构模型,可以大幅降低系统失效的概率,提高工业控制可靠性.本系统分为硬件结构和软件系统两大部分.硬件部分采用了1oo2D双通道异构冗余安全体系结构,两条通道配备有安全电路,两个处理器之间设计有安全诊断电路,通过交叉检测判断系统运行是否正常.软件部分主要包括编译系统和执行系统,编译系统将编写的PLC程序转换成机器可执行的代码也叫做目标代码,再由执行系统进行目标代码的执行. 相似文献

11.

小波滤波器低功耗并行的VLSI结构设计

兰旭光郑南宁薛建儒王飞刘跃虎《计算机研究与发展》2005,42(11):1889-1895

提出一种基于行和提升算法,实现JPEG2000编码系统中的小波正反变换(discretewavelettransform)的低功耗、并行的VLSI结构设计方法·利用该方法所得结构一次处理两行数据,分时复用行处理器,使行处理器内以及行、列处理器实现并行处理,且最小化行缓存·对称扩展通过嵌入式电路实现,整个结构采用流水线设计方法优化,加快了变换速度,增加了硬件资源利用率,降低了功耗,效率几乎达到100%·小波滤波器正反变换结构已经经过FPGA验证,可作为单独的IP核应用于正在开发的JPEG2000图像编解码芯片中· 相似文献

12.

基于PowerPC主机处理器的计算机模块设计

郑波祥刘莉王学宝《工业控制计算机》2009,22(3):46-47

PowerPC主机处理器MPC7447A是高性能、低功耗的32位嵌入式处理器,文章设计了基于MPC7447A的单板计算机,详细说明了硬件设计中处理器节点设计、控制器设计等要点,并介绍了VxWorks BSP的开发方法。相似文献

13.

The Bottom-Up Implementation of One MILC Lattice QCD Application on the Cell Blade

Guochun Shi Volodymyr Kindratenko Steven Gottlieb 《International journal of parallel programming》2009,37(5):488-507

We report the results of the bottom-up implementation of one MILC lattice quantum chromodynamics (QCD) application on the Cell Broadband Engine™ processor. In our implementation, we preserve MILC’s framework for scaling the application to run on a large number of compute nodes and accelerate computationally intensive kernels on the Cell’s synergistic processor elements. Speedups of 3.4 × for the 8 × 8 × 16 × 16 lattice and 5.7 × for the 16 × 16 × 16 × 16 lattice are obtained when comparing our implementation of the MILC application executed on a 3.2 GHz Cell processor to the standard MILC code executed on a quad-core 2.33 GHz Intel Xeon processor. We provide an empirical model to predict application performance for a given lattice size. We also show that performance of the compute-intensive part of the application on the Cell processor is limited by the bandwidth between main memory and the Cell’s synergistic processor elements, whereas performance of the application’s parallel execution framework is limited by the bandwidth between main memory and the Cell’s power processor element. 相似文献

14.

一种Zynq SoC片内硬件加速的二维傅里叶变换

曹力陈龙《单片机与嵌入式系统应用》2018,(2):36-40

由于二维傅里叶变换计算量大,会导致在嵌入式应用过程中速度过慢.为此本文实验了一种基于Xilinx Zynq芯片的片内硬件加速实现方式,主要利用片内的可编程逻辑资源来完成变换过程中的大量计算,利用片内的处理器系统完成整个算法实现过程中的数据传输与调度.在获得FPGA提供的并行计算的速度优势同时,又保留了处理器系统软件开发的灵活性.借助于Xilinx提供的一维快速傅里叶变换IP核与Xillybus提供的总线方案,本文的实验通过软硬件结合的方式实现了二维傅里叶变换算法,与OpenCV计算比较,计算速度显著提高. 相似文献

15.

FASA: A software architecture and runtime framework for flexible distributed automation systems

《Journal of Systems Architecture》2015,61(2):82-111

Modern automation systems have to cope with large amounts of sensor data to be processed, stricter security requirements, heterogeneous hardware, and an increasing need for flexibility. The challenges for tomorrow’s automation systems need software architectures of today’s real-time controllers to evolve.This article presents FASA, a modern software architecture for next-generation automation systems. FASA provides concepts for scalable, flexible, and platform-independent real-time execution frameworks, which also provide advanced features such as software-based fault tolerance and high degrees of isolation and security. We show that FASA caters for robust execution of time-critical applications even in parallel execution environments such as multi-core processors.We present a reference implementation of FASA that controls a magnetic levitation device. This device is sensitive to any disturbance in its real-time control and thus, provides a suitable validation scenario. Our results show that FASA can sustain its advanced features even in high-speed control scenarios at 1 kHz. 相似文献

16.

改进的基于嵌入式SoC卷积神经网络识别模型

孙磊肖金球夏禹顾敏明《计算机应用与软件》2020,37(3):257-260

针对当前在FPGA上实现卷积神经网络模型时卷积计算消耗资源大,提高FPGA芯片性能代价较大等问题,提出一种改进的基于嵌入式SoC的优化设计方法。对卷积计算的实现方法和存储访问通道加以优化,以提高并行计算性能;将32位位宽的浮点数量化为16位定点数,加快前向传播的数据传输;结合硬件描述软件的高层次综合技术,将卷积神经网络映射到硬件平台成为一种同步数据流模型从而加快计算速度。通过实验证明,该方案较现有设计节约了89%的BRAM和72%的LUT,在工作频率为100 MHz的测试中,其处理速度比单独使用Cortex-A9的方案提升了42倍。相似文献

17.

Flexible VLIW processor based on FPGA for efficient embedded real-time image processing

Vincent Brost Fan Yang Charles Meunier 《Journal of Real-Time Image Processing》2014,9(1):47-59

Modern field programmable gate array (FPGA) chips, with their larger memory capacity and reconfigurability potential, are opening new frontiers in rapid prototyping of embedded systems. With the advent of high-density FPGAs, it is now possible to implement a high-performance VLIW (very long instruction word) processor core in an FPGA. With VLIW architecture, the processor effectiveness depends on the ability of compilers to provide sufficient ILP (instruction-level parallelism) from program code. This paper describes research result about enabling the VLIW processor model for real-time processing applications by exploiting FPGA technology. Our goals are to keep the flexibility of processors to shorten the development cycle, and to use the powerful FPGA resources to increase real-time performance. We present a flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA. Some common algorithms of image processing were tested and validated using the proposed development cycle. We also realized the rapid prototyping of embedded contactless palmprint extraction on an FPGA Virtex-6 based board for a biometric application and obtained a processing time of 145.6 ms per image. Our approach applies some criteria for co-design tools: flexibility, modularity, performance, and reusability. 相似文献

18.

基于Zynq的图像角点及边缘检测系统的设计与实现

潘青松张怡杨宗明秦剑秀《计算机科学》2017,44(Z11):530-533, 556

以Zynq芯片为基础,采用软硬件协同设计的方法设计并实现整个系统。Zynq芯片内部采用ARM+FPGA的异构架构,既具备ARM处理器的灵活性,又拥有FPGA并行处理的能力。本系统的设计充分发挥了Zynq芯片的优势,在软硬件划分上, 通过ARM处理器来实现图像的采集;图像角点及边缘检测用FPGA来完成,即通过硬件加速提升系统的整体性能。ARM处理器与FPGA通过AXI4总线进行数据交互,在Zynq上实现集图像采集、图像特征提取、图像显示为一体的片上系统。最终系统测试结果表明,采用硬件加速实现图像特征提取的相关算法比在ARM处理器软件上实现的算法的速度提高了6～8倍。相似文献

19.

An embedded system for knowledge-based cost evaluation of molded parts

《Knowledge》2007,20(3):291-299

In this paper, we present an embedded knowledge-base system and apply it to estimate the manufacturing cost of molded parts. This system is designed on a tiny single board computer, called the “Gumstix™”, which is an inexpensive and high-performance miniaturized platform. It consists of a knowledge-base, knowledge processing units and server service unit for user interactions, all of which are implemented on the Gumstix computer. A CAD server is provided for the interaction with commercial CAD software and all operations with users are performed via a web browser. This hardware and software structure features a realization of the Plug & Play concept and provides low-cost, low-power consumption, high-performance and high portability of complex engineering software. The system is demonstrated with cost estimation of molded parts. A knowledge set for the injection molding process is formalized in XML and a knowledge-base is constructed. The system is tested with examples, which illustrate the capability of our system for engineering applications. 相似文献

20.

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hong Jun Choi Dong Oh Son Jong Myon Kim Cheol Hong Kim 《The Journal of supercomputing》2014,69(1):330-356

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead. 相似文献