首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %.  相似文献   

2.
The use of local features in images has become very popular due to its promising results. They have shown significant benefits in a variety of applications such as object recognition, image retrieval, robot navigation, panorama stitching, and others. SIFT is one of the local features methods that have shown better results. Among its main disadvantages is its high computational cost. In order to speedup this algorithm, this work proposes the design and implementation of an efficient hardware architecture based on FPGAs for SIFT interest point detection In order to take full advantage of the parallelism in this algorithm and to minimize the device area occupied by its implementation in hardware, part of the algorithm was reformulated. The main contribution of the hardware architecture proposed in this paper and the main difference with the rest of the architectures reported in the literature is that as the number of octaves to be processed is increased, the amount of occupied device area remains almost constant. The evaluations and experiments to the architecture support this contribution, as well as accuracy, repeatability, and distinctiveness of the results. Experiments also showed device area occupation and time constraints of the hardware implementation. The architecture presented in this paper is able to detect interest points in an image of 320 × 240 in 11 ms, which represents a speedup of 250 × with respect to a software implementation.  相似文献   

3.
This paper presents a real-time image acquisition system with an improved image quality assessment module to acquire high-quality near infrared (NIR) images. Thermal imaging plays a vital role in a wide range of medical and military applications. The demand for high-throughput image acquisition and image processing has continuously increased especially for critical medical and military purposes where executions under real-time constraints are required. This work implements an NIR image quality assessment module, which utilizes improved two-dimensional entropy and mask-based edge detection algorithms. The effectiveness of the proposed image quality assessment algorithms is demonstrated through the implementation of a complete finger-vein biometric system. The proposed model is implemented as an embedded system on a field programmable gate array prototyping platform. By including the image quality assessment module, the proposed system is able to achieve a recognition accuracy of 0.87 % equal error rate, and can handle real-time processing at 15 frames/s (live video rate). This is achieved through hardware acceleration of the proposed image quality assessment algorithms via a novel streaming architecture.  相似文献   

4.
Gabor wavelet transform is one of the most effective texture feature extraction techniques and has resulted in many successful practical applications. However, real-time applications cannot benefit from this technique because of the high computational cost arising from the large number of small-sized convolutions which require over 10 min to process an image of 256 × 256 pixels on a dual core CPU. As the computation in Gabor filtering is parallelizable, it is possible and beneficial to accelerate the feature extraction process using GPU. Conventionally, this can be achieved simply by accelerating the 2D convolution directly, or by expediting the CPU-efficient FFT-based 2D convolution. Indeed, the latter approach, when implemented with small-sized Gabor filters, cannot fully exploit the parallel computation power of GPU due to the architecture of graphics hardware. This paper proposes a novel approach tailored for GPU acceleration of the texture feature extraction algorithm by using separable 1D Gabor filters to approximate the non-separable Gabor filter kernels. Experimental results show that the approach improves the timing performance significantly with minimal error introduced. The method is specifically designed and optimized for computing unified device architecture and is able to achieve a speed of 16 fps on modest graphics hardware for an image of 2562 pixels and a filter kernel of 322 pixels. It is potentially applicable for real-time applications in areas such as motion tracking and medical image analysis.  相似文献   

5.
唐奇  苏光大 《计算机工程》2008,34(7):248-250
实现一个硬件人脸检测系统,该系统工作频率为70 MHz,能够检测一幅256×256的图像中任意位置、任意大小和任意数目的人脸,检测速度为每秒35帧。系统的检测精度为85%,误检率为1.5×10-6。为实现高速人脸检测,在硬件系统架构上做出如下3点创新:实现积分图像和积分平方图像的硬件实时计算,弱分类器特征值计算的深流水线实现,采用并行多内存组织结构。  相似文献   

6.
Passive radio-frequency identification (RFID) tags have long been thought to be too weak to implement public-key cryptography: It is commonly assumed that the power consumption, gate count and computation time of full-strength encryption exceed the capabilities of RFID tags. In this paper, we demonstrate that these assumptions are incorrect. We present two low-resource implementations of a 1,024-bit Rabin encryption variant called WIPR—in embedded software and in hardware. Our experiments with the software implementation show that the main performance bottleneck of the system is not the encryption time but rather the air interface and that the reader’s implementation of the electronic product code Class-1 Generation-2 RFID standard has a crucial effect on the system’s overall performance. Next, using a highly optimized hardware implementation, we investigate the trade-offs between speed, area and power consumption to derive a practical working point for a hardware implementation of WIPR. Our recommended implementation has a data-path area of 4,184 gate equivalents, an encryption time of 180  ms and an average power consumption of 11 \(\upmu \)W, well within the established operating envelope for passive RFID tags.  相似文献   

7.
In this paper, we present faster than real-time implementation of a class of dense stereo vision algorithms on a low-power massively parallel SIMD architecture, the CSX700. With two cores, each with 96 Processing Elements, this SIMD architecture provides a peak computation power of 96 GFLOPS while consuming only 9 Watts, making it an excellent candidate for embedded computing applications. Exploiting full features of this architecture, we have developed schemes for an efficient parallel implementation with minimum of overhead. For the sum of squared differences (SSD) algorithm and for VGA (640 × 480) images with disparity ranges of 16 and 32, we achieve a performance of 179 and 94 frames per second (fps), respectively. For the HDTV (1,280 × 720) images with disparity ranges of 16 and 32, we achieve a performance of 67 and 35 fps, respectively. We have also implemented more accurate, and hence more computationally expensive variants of the SSD, and for most cases, particularly for VGA images, we have achieved faster than real-time performance. Our results clearly demonstrate that, by developing careful parallelization schemes, the CSX architecture can provide excellent performance and flexibility for various embedded vision applications.  相似文献   

8.
As configurable processing advances, elements from the traditional approaches of both hardware and software development can be combined by incorporating customized, application-specific computational resources into the processor’s architecture, especially in the case of field-programmable-gate-array-based systems with soft-processors, so as to enhance the performance of embedded applications. This paper explores the use of several different microarchitectural alternatives to increase the performance of edge detection algorithms, which are of fundamental importance for the analysis of DNA microarray images. Optimized application-specific hardware modules are combined with efficient parallelized software in an embedded soft-core-based multi-processor. It is demonstrated that the performance of one common edge detection algorithm, namely Sobel, can be boosted remarkably. By exploiting the architectural extensions offered by the soft-processor, in conjunction with the execution of carefully selected application-specific instruction-set extensions on a custom-made accelerating co-processor connected to the processor core, we introduce a new approach that makes this methodology noticeably more efficient across various applications from the same domain, which are often similar in structure. With flexibility to update the processing algorithms, an improvement reaching one order of magnitude over all-software solutions could be obtained. In support of this flexibility, an effective adaptation of this approach is demonstrated which performs real-time analysis of extracted microarray data; the proposed reconfigurable multi-core prototype has been exploited with minor changes to achieve almost 5× speedup.  相似文献   

9.
内窥镜去雾算法在医疗领域具有广泛应用,为临床医生提供清晰、实时的图像。去雾技术虽然已经取得较大的进步,但去雾算法的复杂度较高,在内窥镜等复杂情况下硬件实现较为困难。为了在硬件上实现内窥镜实时去雾效果,对暗通道先验算进行改进,降低硬件资源消耗和时间复杂度。该改进算法选择适合硬件的大气光照强度估计值、透射率补偿值以及采用流水线结构实现有雾图像的处理。采用Xilinx的ZYNQ7020实现该算法硬件电路,实时处理分辨率为640×480的视频图像,速度可达到260 fps,消耗LUT仅为1.28K,寄存器619个单元。实验结果表明,相比于传统算法,改进算法具有处理速度快、功耗低、可移植性强的特点,满足内窥镜需要实时处理视频的要求。  相似文献   

10.
The 0–1 knapsack problem (KP) is a well-known intractable optimization problem with wide range of applications. Harmony Search (HS) is one of the most popular metaheuristic algorithms to successfully solve 0–1 KPs. Nevertheless, metaheuristic algorithms are generally compute intensive and slow when implemented in software. In this paper, we present an FPGA-based pipelined hardware accelerator to reduce computation time for solving large dimension 0–1 KPs using Binary Harmony Search algorithm. The proposed architecture exploits the intrinsic parallelism of population based metaheuristic algorithm and the flexibility and parallel processing capabilities of FPGAs to perform the computation concurrently thus enhancing performance. To validate the efficiency of the proposed hardware accelerator, experiments were conducted using a large number of 0–1 KPs. Comparative analysis on experimental results reveals that the proposed approach offers promising speedups of 51× – 111× as compared with a software implementation and 2× – 5× as compared with a hardware implementation of Binary Particle Swarm Optimization algorithm.  相似文献   

11.
The Euclidean Distance Transform (EDT) is an important tool in image analysis and machine vision. This paper provides an area-efficient hardware solution to the computation of EDT on a binary image. An O(n) hardware algorithm for computing EDT of an n×n image is presented. A pipelined 2D array architecture for harware implementation is designed. The architecture has a regular structure with locally connected identical processing elements. Further, pipelining reduces hardware resources. Such an array architecture is easily scalable to handle images of different sizes and is suitable for implementation on reconfigurable devices like FPGAs. Results of FPGA-based implementation shows that the hardware can process about 6000 images of size 512×512 per second which is much higher than the video rate of 30 frames per second.  相似文献   

12.
Image correlation is widely used for image and picture processing. Typical applications of image correlation are object location, image registration and sub-image similarity measurement. However, image correlation requires the comparison of a large number of sub-images implying a large computational effort that may prevent its use for real-time applications. On the other hand, correlation computation is very well suited for FPGA implementations. In this work we present efficient architectures for the implementation of Zero-Mean Normalized Cross-Correlation using FPGAs with application to image correlation. In particular, we compare the implementations of correlation in the spatial and spectral domains. Experimental results demonstrate that FPGAs improve performance by at least two orders of magnitude with respect to software implementations on a modern personal computer. This speed-up makes the performance of correlation computation suitable for real-time image processing. The proposed architectures have been applied to a correlation-based fingerprint-matching algorithm, demonstrating that real-time processing requirements can be well satisfied with an FPGA-based implementation.  相似文献   

13.
Genetic Algorithm for Boolean minimization in an FPGA cluster   总被引:1,自引:0,他引:1  
Evolutionary algorithms are an alternative option to the Boolean synthesis due to that they allow one to create hardware structures that would not be able to be obtained with other techniques. This paper shows a parallel genetic programming (PGP) Boolean synthesis implementation based on a cluster of FPGAs that takes full advantage of parallel programming and hardware/software co-design techniques. The performance of our cluster of FPGAs implementation has been compared with an HPC implementation. The experimental results have shown an excellent behavior in terms of speed up (up to ×500) and in terms of solving the scalability problems of this algorithms present in previous works.  相似文献   

14.
The amount of noise present in the Fiber Optic Gyroscope (FOG) signal limits its applications and has a negative impact on navigation system. Existing algorithms such as Discrete Wavelet Transform (DWT), Kalman Filter (KF) denoise the FOG signal under static environment, however denoising fails in dynamic environment. Therefore in this paper an Adaptive Moving Average Dual Mode Kalman Filter (AMADMKF) is developed for denoising the FOG signal under both the static and dynamic environments. Performance of the proposed algorithm is compared with DWT and KF techniques. Further, a hardware Intellectual Property (IP) of the algorithm is developed for System on Chip (SoC) implementation using Xilinx Virtex-5 Field Programmable Gate Array (Virtex-5FX70T-1136). The developed IP is interfaced as a Co-processor/ Auxiliary Processing Unit (APU) with the PowerPC (PPC440) embedded processor of the FPGA. It is proved that the proposed system is an efficient solution for denoising the FOG signal in real-time environment. Hardware acceleration of developed Co-processor is 65× with respect to its equivalent software implementation of AMADMKF algorithm in the PPC440 embedded processor.  相似文献   

15.
Support Vector Machine (SVM) is a robust machine learning model that shows high accuracy with different classification problems, and is widely used for various embedded applications. However, implementation of embedded SVM classifiers is challenging, due to the inherent complicated computations required. This motivates implementing the SVM on hardware platforms for achieving high performance computing at low cost and power consumption. Melanoma is the most aggressive form of skin cancer that increases the mortality rate. We aim to develop an optimized embedded SVM classifier dedicated for a low-cost handheld device for early detection of melanoma at the primary healthcare. In this paper, we propose a hardware/software co-design for implementing the SVM classifier onto FPGA to realize melanoma detection on a chip. The implemented SVM on a recent hybrid FPGA (Zynq) platform utilizing the modern UltraFast High-Level Synthesis design methodology achieves efficient melanoma classification on chip. The hardware implementation results demonstrate classification accuracy of 97.9%, and a significant hardware acceleration rate of 21 with only 3% resources utilization and 1.69 W for power consumption. These results show that the implemented system on chip meets crucial embedded system constraints of high performance and low resources utilization, power consumption, and cost, while achieving efficient classification with high classification accuracy.  相似文献   

16.
In this paper, field programmable gate array (FPGA) implementation of reversible watermarking (RW) algorithm by histogram bin shifting (HBS) that can be used for real-time applications of medical and military images has been presented. As the tolerance level of distortion has to be minimal, RW scheme is necessary here. The reversible mode of the process helps in extracting both the original image and the watermark at the receiver end after undergoing through embedding and decoding procedure. The embedded watermark contains the underlying security information of the host images in case of any infringement. Although software implementations of RW schemes are available, very few attempts have been made for hardware realizations of such schemes. The inherent delay associated with software implementations can be minimized by using an on-chip hardware that performs the watermarking process immediately after capturing the image in real time. In this paper, the embedding and decoding procedures involved in the watermarking scheme are implemented using Xilinx System Generator and carries out a detailed design and analysis of the hardware architectures required for the embedding and extraction processes. The device utilization results for both the embedding and decoding process is low and practically viable. The maximum operating frequency for embedding and extraction processes are 445.330 and 201.824 MHz, respectively, which shows improved performance results over similar existing research work in the literature. The power consumptions for embedding and extraction processes are found to be 1.215 and 0.104 W, respectively. To the best of our knowledge, this is the first FPGA-based hardware implementation of RW by HBS.  相似文献   

17.
Future embedded systems demand multi-processor designs to meet real-time deadlines. The large number of applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance evaluation of these use-cases. These challenges cannot be overcome by current design methodologies which are semi-automated, time consuming and error prone.In this paper, we present a fully automated design flow to generate communication assist (CA) based multi-processor systems (CA-MPSoC). A worst-case performance model of our CA is proposed so that the performance of the CA-based platform can be analyzed before its implementation. The design flow provides performance estimates and timing guarantees for both hard real-time and soft real-time applications, provided the task to processor mappings are given by the user. The flow automatically generates a super-set hardware that can be used in all use-cases of the applications. The software for each of these use-cases is also generated including the configuration of communication architecture and interfacing with application tasks.CA-MPSoC has been implemented on Xilinx FPGAs for evaluation. Further, it is made available on-line for the benefit of the research community and in this paper, it is used for performance analysis of two real life applications, Sobel and JPEG encoder executing concurrently. The CA-based platform generated by our design flow records a maximum error of 3.4% between analyzed and measured periods. Our tool can also merge use-cases to generate a super-set hardware which accelerates the evaluation of these use-cases. In a case study with six applications, the use-case merging results in a speed up of 18 when compared to the case where each use-case is evaluated individually.  相似文献   

18.
19.
针对当前基于ARM和DSP的嵌入式图像处理系统前端采集速度慢和图像处理算法不易加速的缺点,设计了一种基于HDMI接口的全高清(分辨率1920×1080)实时视频采集与图像处理系统;采用500万像素级别CMOS摄像头作为前端数据源,主芯片内部采用ARM+FPGA的异构架构,兼备FPGA的并行处理能力与ARM处理器任务调度功能;基于AXI协议设计了自定义数据存储传输的IP核,实现了处理速度与带宽最大化;利用HLS工具将图像预处理算法快速打包生成IP核,在FPGA中实现图像算法的硬件加速,完成图像处理系统平台原型机的设计;与传统的PC机和相机的机器视觉平台相比,该系统运行平均耗时在10 ms以内,实时检测效果令人满意,有效解决了低功耗与高数据带宽和处理速度之间的矛盾,为后端结果分析和边缘加速提供了良好支持。  相似文献   

20.
基于嵌入式平台的复杂背景目标跟踪技术在智能视频监控设备、无人机跟踪等领域有重要作用.卷积神经网络在跟踪问题上有准确率高、鲁棒性强的优点,但基于卷积特征的算法计算复杂度高,受嵌入式平台面积和功耗的限制,实时性难以满足嵌入式平台应用场景的需求.针对基于卷积特征的跟踪算法计算复杂度高、存储参数量大的难题,率先提出一种利用FP...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号