期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Xiaofang Wang 《Journal of Real-Time Image Processing》2014,9(1):187-204

State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %. 相似文献

2.

FPGA-based detection of SIFT interest keypoints

Leonardo Chang José Hernández-Palancar L. Enrique Sucar Miguel Arias-Estrada 《Machine Vision and Applications》2013,24(2):371-392

The use of local features in images has become very popular due to its promising results. They have shown significant benefits in a variety of applications such as object recognition, image retrieval, robot navigation, panorama stitching, and others. SIFT is one of the local features methods that have shown better results. Among its main disadvantages is its high computational cost. In order to speedup this algorithm, this work proposes the design and implementation of an efficient hardware architecture based on FPGAs for SIFT interest point detection In order to take full advantage of the parallelism in this algorithm and to minimize the device area occupied by its implementation in hardware, part of the algorithm was reformulated. The main contribution of the hardware architecture proposed in this paper and the main difference with the rest of the architectures reported in the literature is that as the number of octaves to be processed is increased, the amount of occupied device area remains almost constant. The evaluations and experiments to the architecture support this contribution, as well as accuracy, repeatability, and distinctiveness of the results. Experiments also showed device area occupation and time constraints of the hardware implementation. The architecture presented in this paper is able to detect interest points in an image of 320 × 240 in 11 ms, which represents a speedup of 250 × with respect to a software implementation. 相似文献

3.

A real-time near infrared image acquisition system based on image quality assessment

Y. H. Lee M. Khalil-Hani Rabia Bakhteri Vishnu P. Nambiar 《Journal of Real-Time Image Processing》2017,13(1):103-120

This paper presents a real-time image acquisition system with an improved image quality assessment module to acquire high-quality near infrared (NIR) images. Thermal imaging plays a vital role in a wide range of medical and military applications. The demand for high-throughput image acquisition and image processing has continuously increased especially for critical medical and military purposes where executions under real-time constraints are required. This work implements an NIR image quality assessment module, which utilizes improved two-dimensional entropy and mask-based edge detection algorithms. The effectiveness of the proposed image quality assessment algorithms is demonstrated through the implementation of a complete finger-vein biometric system. The proposed model is implemented as an embedded system on a field programmable gate array prototyping platform. By including the image quality assessment module, the proposed system is able to achieve a recognition accuracy of 0.87 % equal error rate, and can handle real-time processing at 15 frames/s (live video rate). This is achieved through hardware acceleration of the proposed image quality assessment algorithms via a novel streaming architecture. 相似文献

4.

Fast Gabor texture feature extraction with separable filters using GPU

Wai-Man Pang Kup-Sze Choi Jing Qin 《Journal of Real-Time Image Processing》2016,11(1):5-25

Gabor wavelet transform is one of the most effective texture feature extraction techniques and has resulted in many successful practical applications. However, real-time applications cannot benefit from this technique because of the high computational cost arising from the large number of small-sized convolutions which require over 10 min to process an image of 256 × 256 pixels on a dual core CPU. As the computation in Gabor filtering is parallelizable, it is possible and beneficial to accelerate the feature extraction process using GPU. Conventionally, this can be achieved simply by accelerating the 2D convolution directly, or by expediting the CPU-efficient FFT-based 2D convolution. Indeed, the latter approach, when implemented with small-sized Gabor filters, cannot fully exploit the parallel computation power of GPU due to the architecture of graphics hardware. This paper proposes a novel approach tailored for GPU acceleration of the texture feature extraction algorithm by using separable 1D Gabor filters to approximate the non-separable Gabor filter kernels. Experimental results show that the approach improves the timing performance significantly with minimal error introduced. The method is specifically designed and optimized for computing unified device architecture and is able to achieve a speed of 16 fps on modest graphics hardware for an image of 256² pixels and a filter kernel of 32² pixels. It is potentially applicable for real-time applications in areas such as motion tracking and medical image analysis. 相似文献

5.

基于Adaboost算法的硬件实时人脸检测

下载免费PDF全文

唐奇苏光大《计算机工程》2008,34(7):248-250

实现一个硬件人脸检测系统,该系统工作频率为70 MHz,能够检测一幅256×256的图像中任意位置、任意大小和任意数目的人脸,检测速度为每秒35帧。系统的检测精度为85%,误检率为1.5×10-6。为实现高速人脸检测,在硬件系统架构上做出如下3点创新：实现积分图像和积分平方图像的硬件实时计算,弱分类器特征值计算的深流水线实现,采用并行多内存组织结构。相似文献

6.

Implementing public-key cryptography on passive RFID tags is practical

Alex?Arbit Yoel?Livne Yossef?Oren Email author Avishai?Wool 《International Journal of Information Security》2015,14(1):85-99

Passive radio-frequency identification (RFID) tags have long been thought to be too weak to implement public-key cryptography: It is commonly assumed that the power consumption, gate count and computation time of full-strength encryption exceed the capabilities of RFID tags. In this paper, we demonstrate that these assumptions are incorrect. We present two low-resource implementations of a 1,024-bit Rabin encryption variant called WIPR—in embedded software and in hardware. Our experiments with the software implementation show that the main performance bottleneck of the system is not the encryption time but rather the air interface and that the reader’s implementation of the electronic product code Class-1 Generation-2 RFID standard has a crucial effect on the system’s overall performance. Next, using a highly optimized hardware implementation, we investigate the trade-offs between speed, area and power consumption to derive a practical working point for a hardware implementation of WIPR. Our recommended implementation has a data-path area of 4,184 gate equivalents, an encryption time of 180 ms and an average power consumption of 11 \(\upmu \)W, well within the established operating envelope for passive RFID tags. 相似文献

7.

Fast implementation of dense stereo vision algorithms on a highly parallel SIMD architecture

Fouzhan Hosseini Amir Fijany Saeed Safari Jean-Guy Fontaine 《Journal of Real-Time Image Processing》2013,8(4):421-435

In this paper, we present faster than real-time implementation of a class of dense stereo vision algorithms on a low-power massively parallel SIMD architecture, the CSX700. With two cores, each with 96 Processing Elements, this SIMD architecture provides a peak computation power of 96 GFLOPS while consuming only 9 Watts, making it an excellent candidate for embedded computing applications. Exploiting full features of this architecture, we have developed schemes for an efficient parallel implementation with minimum of overhead. For the sum of squared differences (SSD) algorithm and for VGA (640 × 480) images with disparity ranges of 16 and 32, we achieve a performance of 179 and 94 frames per second (fps), respectively. For the HDTV (1,280 × 720) images with disparity ranges of 16 and 32, we achieve a performance of 67 and 35 fps, respectively. We have also implemented more accurate, and hence more computationally expensive variants of the SSD, and for most cases, particularly for VGA images, we have achieved faster than real-time performance. Our results clearly demonstrate that, by developing careful parallelization schemes, the CSX architecture can provide excellent performance and flexibility for various embedded vision applications. 相似文献

8.

A soft multi-core architecture for edge detection and data analysis of microarray images

George Kornaros 《Journal of Systems Architecture》2010,56(1):48-62

As configurable processing advances, elements from the traditional approaches of both hardware and software development can be combined by incorporating customized, application-specific computational resources into the processor’s architecture, especially in the case of field-programmable-gate-array-based systems with soft-processors, so as to enhance the performance of embedded applications. This paper explores the use of several different microarchitectural alternatives to increase the performance of edge detection algorithms, which are of fundamental importance for the analysis of DNA microarray images. Optimized application-specific hardware modules are combined with efficient parallelized software in an embedded soft-core-based multi-processor. It is demonstrated that the performance of one common edge detection algorithm, namely Sobel, can be boosted remarkably. By exploiting the architectural extensions offered by the soft-processor, in conjunction with the execution of carefully selected application-specific instruction-set extensions on a custom-made accelerating co-processor connected to the processor core, we introduce a new approach that makes this methodology noticeably more efficient across various applications from the same domain, which are often similar in structure. With flexibility to update the processing algorithms, an improvement reaching one order of magnitude over all-software solutions could be obtained. In support of this flexibility, an effective adaptation of this approach is demonstrated which performs real-time analysis of extracted microarray data; the proposed reconfigurable multi-core prototype has been exploited with minor changes to achieve almost 5× speedup. 相似文献

9.

快速视频去雾改进算法的FPGA实现

下载免费PDF全文

庞宇吴天次王元发贾美平周前能《计算机应用研究》2024,41(6)

内窥镜去雾算法在医疗领域具有广泛应用,为临床医生提供清晰、实时的图像。去雾技术虽然已经取得较大的进步,但去雾算法的复杂度较高,在内窥镜等复杂情况下硬件实现较为困难。为了在硬件上实现内窥镜实时去雾效果,对暗通道先验算进行改进,降低硬件资源消耗和时间复杂度。该改进算法选择适合硬件的大气光照强度估计值、透射率补偿值以及采用流水线结构实现有雾图像的处理。采用Xilinx的ZYNQ7020实现该算法硬件电路,实时处理分辨率为640×480的视频图像,速度可达到260 fps,消耗LUT仅为1.28K,寄存器619个单元。实验结果表明,相比于传统算法,改进算法具有处理速度快、功耗低、可移植性强的特点,满足内窥镜需要实时处理视频的要求。相似文献

10.

Hardware accelerator for solving 0–1 knapsack problems using binary harmony search

Mohammed El-Shafei Imtiaz Ahmad 《International Journal of Parallel, Emergent and Distributed Systems》2018,33(1):87-102

The 0–1 knapsack problem (KP) is a well-known intractable optimization problem with wide range of applications. Harmony Search (HS) is one of the most popular metaheuristic algorithms to successfully solve 0–1 KPs. Nevertheless, metaheuristic algorithms are generally compute intensive and slow when implemented in software. In this paper, we present an FPGA-based pipelined hardware accelerator to reduce computation time for solving large dimension 0–1 KPs using Binary Harmony Search algorithm. The proposed architecture exploits the intrinsic parallelism of population based metaheuristic algorithm and the flexibility and parallel processing capabilities of FPGAs to perform the computation concurrently thus enhancing performance. To validate the efficiency of the proposed hardware accelerator, experiments were conducted using a large number of 0–1 KPs. Comparative analysis on experimental results reveals that the proposed approach offers promising speedups of 51× – 111× as compared with a software implementation and 2× – 5× as compared with a hardware implementation of Binary Particle Swarm Optimization algorithm. 相似文献

11.

A pipelined array architecture for Euclidean distance transformation and its FPGA implementation

《Microprocessors and Microsystems》2005,29(8-9):405-410

The Euclidean Distance Transform (EDT) is an important tool in image analysis and machine vision. This paper provides an area-efficient hardware solution to the computation of EDT on a binary image. An O(n) hardware algorithm for computing EDT of an n×n image is presented. A pipelined 2D array architecture for harware implementation is designed. The architecture has a regular structure with locally connected identical processing elements. Further, pipelining reduces hardware resources. Such an array architecture is easily scalable to handle images of different sizes and is suitable for implementation on reconfigurable devices like FPGAs. Results of FPGA-based implementation shows that the hardware can process about 6000 images of size 512×512 per second which is much higher than the video rate of 30 frames per second. 相似文献

12.

High performance FPGA-based image correlation

Almudena Lindoso Luis Entrena 《Journal of Real-Time Image Processing》2007,2(4):223-233

Image correlation is widely used for image and picture processing. Typical applications of image correlation are object location, image registration and sub-image similarity measurement. However, image correlation requires the comparison of a large number of sub-images implying a large computational effort that may prevent its use for real-time applications. On the other hand, correlation computation is very well suited for FPGA implementations. In this work we present efficient architectures for the implementation of Zero-Mean Normalized Cross-Correlation using FPGAs with application to image correlation. In particular, we compare the implementations of correlation in the spatial and spectral domains. Experimental results demonstrate that FPGAs improve performance by at least two orders of magnitude with respect to software implementations on a modern personal computer. This speed-up makes the performance of correlation computation suitable for real-time image processing. The proposed architectures have been applied to a correlation-based fingerprint-matching algorithm, demonstrating that real-time processing requirements can be well satisfied with an FPGA-based implementation. 相似文献

13.

Genetic Algorithm for Boolean minimization in an FPGA cluster 总被引：1，自引：0，他引：1

César Pedraza Javier Castillo José I. Martínez Pablo Huerta Jose L. Bosque Javier Cano 《The Journal of supercomputing》2011,58(2):244-252

Evolutionary algorithms are an alternative option to the Boolean synthesis due to that they allow one to create hardware structures that would not be able to be obtained with other techniques. This paper shows a parallel genetic programming (PGP) Boolean synthesis implementation based on a cluster of FPGAs that takes full advantage of parallel programming and hardware/software co-design techniques. The performance of our cluster of FPGAs implementation has been compared with an HPC implementation. The experimental results have shown an excellent behavior in terms of speed up (up to ×500) and in terms of solving the scalability problems of this algorithms present in previous works. 相似文献

14.

Design and implementation of a realtime co-processor for denoising Fiber Optic Gyroscope signal

Rangababu Peesapati Samrat L. Sabat Kiran Kumar Anumandla Palani Karthik Kandyala Jagannath Nayak 《Digital Signal Processing》2013,23(5):1813-1825

The amount of noise present in the Fiber Optic Gyroscope (FOG) signal limits its applications and has a negative impact on navigation system. Existing algorithms such as Discrete Wavelet Transform (DWT), Kalman Filter (KF) denoise the FOG signal under static environment, however denoising fails in dynamic environment. Therefore in this paper an Adaptive Moving Average Dual Mode Kalman Filter (AMADMKF) is developed for denoising the FOG signal under both the static and dynamic environments. Performance of the proposed algorithm is compared with DWT and KF techniques. Further, a hardware Intellectual Property (IP) of the algorithm is developed for System on Chip (SoC) implementation using Xilinx Virtex-5 Field Programmable Gate Array (Virtex-5FX70T-1136). The developed IP is interfaced as a Co-processor/ Auxiliary Processing Unit (APU) with the PowerPC (PPC440) embedded processor of the FPGA. It is proved that the proposed system is an efficient solution for denoising the FOG signal in real-time environment. Hardware acceleration of developed Co-processor is 65× with respect to its equivalent software implementation of AMADMKF algorithm in the PPC440 embedded processor. 相似文献

15.

A system on chip for melanoma detection using FPGA-based SVM classifier

《Microprocessors and Microsystems》2019

Support Vector Machine (SVM) is a robust machine learning model that shows high accuracy with different classification problems, and is widely used for various embedded applications. However, implementation of embedded SVM classifiers is challenging, due to the inherent complicated computations required. This motivates implementing the SVM on hardware platforms for achieving high performance computing at low cost and power consumption. Melanoma is the most aggressive form of skin cancer that increases the mortality rate. We aim to develop an optimized embedded SVM classifier dedicated for a low-cost handheld device for early detection of melanoma at the primary healthcare. In this paper, we propose a hardware/software co-design for implementing the SVM classifier onto FPGA to realize melanoma detection on a chip. The implemented SVM on a recent hybrid FPGA (Zynq) platform utilizing the modern UltraFast High-Level Synthesis design methodology achieves efficient melanoma classification on chip. The hardware implementation results demonstrate classification accuracy of 97.9%, and a significant hardware acceleration rate of 21 with only 3% resources utilization and 1.69 W for power consumption. These results show that the implemented system on chip meets crucial embedded system constraints of high performance and low resources utilization, power consumption, and cost, while achieving efficient classification with high classification accuracy. 相似文献

16.

FPGA implementation of semi-fragile reversible watermarking by histogram bin shifting in real time

Sambaran Hazra Sudip Ghosh Sayandip De Hafizur Rahaman 《Journal of Real-Time Image Processing》2018,14(1):193-221

In this paper, field programmable gate array (FPGA) implementation of reversible watermarking (RW) algorithm by histogram bin shifting (HBS) that can be used for real-time applications of medical and military images has been presented. As the tolerance level of distortion has to be minimal, RW scheme is necessary here. The reversible mode of the process helps in extracting both the original image and the watermark at the receiver end after undergoing through embedding and decoding procedure. The embedded watermark contains the underlying security information of the host images in case of any infringement. Although software implementations of RW schemes are available, very few attempts have been made for hardware realizations of such schemes. The inherent delay associated with software implementations can be minimized by using an on-chip hardware that performs the watermarking process immediately after capturing the image in real time. In this paper, the embedding and decoding procedures involved in the watermarking scheme are implemented using Xilinx System Generator and carries out a detailed design and analysis of the hardware architectures required for the embedding and extraction processes. The device utilization results for both the embedding and decoding process is low and practically viable. The maximum operating frequency for embedding and extraction processes are 445.330 and 201.824 MHz, respectively, which shows improved performance results over similar existing research work in the literature. The power consumptions for embedding and extraction processes are found to be 1.215 and 0.104 W, respectively. To the best of our knowledge, this is the first FPGA-based hardware implementation of RW by HBS. 相似文献

17.

CA-MPSoC: An automated design flow for predictable multi-processor architectures for multiple applications

A. Shabbir A. Kumar S. Stuijk B. Mesman H. Corporaal 《Journal of Systems Architecture》2010,56(7):265-277

Future embedded systems demand multi-processor designs to meet real-time deadlines. The large number of applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance evaluation of these use-cases. These challenges cannot be overcome by current design methodologies which are semi-automated, time consuming and error prone.In this paper, we present a fully automated design flow to generate communication assist (CA) based multi-processor systems (CA-MPSoC). A worst-case performance model of our CA is proposed so that the performance of the CA-based platform can be analyzed before its implementation. The design flow provides performance estimates and timing guarantees for both hard real-time and soft real-time applications, provided the task to processor mappings are given by the user. The flow automatically generates a super-set hardware that can be used in all use-cases of the applications. The software for each of these use-cases is also generated including the configuration of communication architecture and interfacing with application tasks.CA-MPSoC has been implemented on Xilinx FPGAs for evaluation. Further, it is made available on-line for the benefit of the research community and in this paper, it is used for performance analysis of two real life applications, Sobel and JPEG encoder executing concurrently. The CA-based platform generated by our design flow records a maximum error of 3.4% between analyzed and measured periods. Our tool can also merge use-cases to generate a super-set hardware which accelerates the evaluation of these use-cases. In a case study with six applications, the use-case merging results in a speed up of 18 when compared to the case where each use-case is evaluated individually. 相似文献

18.

Performance‐steered design of software architectures for embedded multicore systems

Alessio Bechini Cosimo Antonio Prete 《Software》2002,32(12):1155-1173

相似文献

19.

基于ZYNQ的高清图像显示及检测系统设计

下载免费PDF全文

林振钰张志杰刘佳琪《计算机测量与控制》2021,29(2):30-34

针对当前基于ARM和DSP的嵌入式图像处理系统前端采集速度慢和图像处理算法不易加速的缺点,设计了一种基于HDMI接口的全高清(分辨率1920×1080)实时视频采集与图像处理系统;采用500万像素级别CMOS摄像头作为前端数据源,主芯片内部采用ARM+FPGA的异构架构,兼备FPGA的并行处理能力与ARM处理器任务调度功能;基于AXI协议设计了自定义数据存储传输的IP核,实现了处理速度与带宽最大化;利用HLS工具将图像预处理算法快速打包生成IP核,在FPGA中实现图像算法的硬件加速,完成图像处理系统平台原型机的设计;与传统的PC机和相机的机器视觉平台相比,该系统运行平均耗时在10 ms以内,实时检测效果令人满意,有效解决了低功耗与高数据带宽和处理速度之间的矛盾,为后端结果分析和边缘加速提供了良好支持。相似文献

20.

孪生网络跟踪算法并行计算结构研究

卢金仪唐维伟徐文辉颜露新钟胜邹旭《测控技术》2021,40(3):39-45

基于嵌入式平台的复杂背景目标跟踪技术在智能视频监控设备、无人机跟踪等领域有重要作用.卷积神经网络在跟踪问题上有准确率高、鲁棒性强的优点,但基于卷积特征的算法计算复杂度高,受嵌入式平台面积和功耗的限制,实时性难以满足嵌入式平台应用场景的需求.针对基于卷积特征的跟踪算法计算复杂度高、存储参数量大的难题,率先提出一种利用FP... 相似文献