首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper we examine the usefulness of a simple memory array architecture to several image processing tasks. This architecture, called theAccess Constrained Memory Array Architecture (ACMAA) has a linear array of processors which concurrently access distinct rows or columns of an array of memory modules. We have developed several parallel image processing algorithms for this architecture. All the algorithms presented in this paper achieve a linear speed-up over the corresponding fast sequential algorithms. This was made possible by exploiting the efficient local as well as global communication capabilities of the ACMAA.  相似文献   

2.
In modern multimedia applications, memory bottleneck can be alleviated with special stride data accesses. Data elements in stride access can be retrieved in parallel with parallel memories, in which the idea is to increase memory bandwidth with several memory modules working in parallel and feed the processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the accessed data element count equals to the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129 MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively. The complexity of the proposed system is shown to be a trade-off between application specific and highly configurable parallel memory system.  相似文献   

3.
GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices.  相似文献   

4.
邻域图像处理机中的新型邻域功能流水线结构   总被引:5,自引:0,他引:5       下载免费PDF全文
苏光大 《电子学报》2000,28(8):120-123
本文介绍了邻域图像处理机原理,提出了邻域图像处理中新型的收缩型和级联型邻域功能流水线结构.这两种邻域功能流水线的流水线作业是以独立的图像处理算法为基础进行的,可以实时(甚至超实时)地完成多个独立的图像处理算法,高度体现了并行处理机数据并行、处理并行的原则,体现了多个算法的有机集成,因此特别适合于实际问题对综合算法的需求.这种邻域功能流水线结构不仅大大提高了图像处理的速度,而且增强了系统的灵活性.本文论述了收缩型和级联型邻域功能流水线的结构,给出了多个图像处理功能的组合.  相似文献   

5.
FAIR: a hardware architecture for real-time 3-D image registration   总被引:2,自引:0,他引:2  
Mutual information-based image registration, shown to be effective in registering a range of medical images, is a computationally expensive process, with a typical execution time on the order of minutes on a modern single-processor computer. Accelerated execution of this process promises to enhance efficiency and therefore promote routine use of image registration clinically. This paper presents details of a hardware architecture for real-time three-dimensional (3-D) image registration. Real-time performance can be achieved by setting up a network of processing units, each with three independent memory buses: one each for the two image memories and one for the mutual histogram memory. Memory access parallelization and pipelining, by design, allow each processing unit to be 25 times faster than a processor with the same bus speed, when calculating mutual information using partial volume interpolation. Our architecture provides superior per-processor performance at a lower cost compared to a parallel supercomputer.  相似文献   

6.
Image registration is an ubiquitous task occurring in countless image analysis applications. A dedicated implementation of image registration algorithms is the best approach to meet the intensive computation requirements of implementing image registration schemes in real time. This paper presents an efficient VLSI architecture for real-time implementation of image registration algorithms using an exhaustive search method. Normalized cross correlation function, mean square error, and blue screen technique algorithms are implemented for image registration. The architecture is based on a data flow design that allows sequential inputs but performs parallel processing. Based on the architecture, a programmable chip can be designed for image registration. Chips can be cascaded to achieve better performance and sizes of both the search and the reference image which can vary with time from a small to a very large value.  相似文献   

7.
论述了一种结构精简且高效的浮点数蝶形运算单元设计,单元内部模块的使用效率接近100%。采用串行全流水线结构设计,与并行结构相比节省了75%的硬件资源消耗。利用按时间抽取(DIT)的快速傅里叶变换(FFT)算法,通过VHDL编程实现了以该蝶形单元为基础的1 024点浮点FFT处理器。QUARTUS II中的仿真结果证明了设计的正确性。该设计已成功应用于一种音频信号分析仪的信号处理部分。  相似文献   

8.
This paper presents a digital signal processing system that produces the SEASAT synthetic-aperture radar (SAR) imagery. The system consists of a SEL 32/77 host minicomputer and three AP-120B array processors. The partitioning of the SAR processing functions and the design of softwae modules is described. The rationale for selecting the parallel array processor architecture and the methodology for developing the parallel processing scheme on this system is described. This system attains a SEASAT SAR data reduction speed of 2.5 h per 25-m resolution 4-look and 100 km X 100 km image frame. A prelininary performance evaluation of this parallel processing system and potential future applications for remote sensing data reduction are described.  相似文献   

9.
Maximum intensity projections (MIPs) are an important visualization technique for angiographic data sets. Efficient data inspection requires frame rates of at least five frames per second at preserved image quality. Despite the advances in computer technology, this task remains a challenge. On the one hand, the sizes of computed tomography and magnetic resonance images are increasing rapidly. On the other hand, rendering algorithms do not automatically benefit from the advances in processor technology, especially for large data sets. This is due to the faster evolving processing power and the slower evolving memory access speed, which is bridged by hierarchical cache memory architectures. In this paper, we investigate memory access optimization methods and use them for generating MIPs on general-purpose central processing units (CPUs) and graphics processing units (GPUs), respectively. These methods can work on any level of the memory hierarchy, and we show that properly combined methods can optimize memory access on multiple levels of the hierarchy at the same time. We present performance measurements to compare different algorithm variants and illustrate the influence of the respective techniques. On current hardware, the efficient handling of the memory hierarchy for CPUs improves the rendering performance by a factor of 3 to 4. On GPUs, we observed that the effect is even larger, especially for large data sets. The methods can easily be adjusted to different hardware specifics, although their impact can vary considerably. They can also be used for other rendering techniques than MIPs, and their use for more general image processing task could be investigated in the future.  相似文献   

10.
Local processing, which is a dominant type of processing in image and video applications, requires a huge computational power to be performed in real-time. However, processing locality, in space and/or in time, allows to exploit data parallelism and data reusing. Although it is possible to exploit these properties to achieve high performance image and video processing in multi-core processors, it is necessary to develop suitable models and parallel algorithms, in particular for non-shared memory architectures. This paper proposes an efficient and simple model for local image and video processing on non-shared memory multi-core architectures. This model adopts a single program multiple data approach, where data is distributed, processed and reused in an optimal way, regarding the data size, the number of cores and the local memory capacity. The model was experimentally evaluated by developing video local processing algorithms and programming the Cell Broadband Engine multi-core processor, namely for advanced video motion estimation and in-loop deblocking filtering. Furthermore, based on these experiences it is also addressed the main challenges of vectorization, and the reduction of branch mispredictions and computational load imbalances. The limits and advantages of the regular and adaptive algorithms are also discussed. Experimental results show the adequacy of the proposed model to perform local video processing, and that real-time is achieved even to process the most demanding parts of advanced video coding. Full-pixel motion estimation is performed over high resolution video (720×576 pixels) at a rate of 30 frames per second, by considering large search areas and five reference frames.  相似文献   

11.
The key to designing a real-time video coding system is efficient motion estimation, which reduces temporal redundancies. The motion estimation of the H.264/AVC coding standard can use multiple references and multiple block sizes to improve rate-distortion performance. The computational complexity of H.264 is linearly dependent on the number of allowed reference frames and block sizes using a full exhaustive search. Many fast block-matching algorithms reduce the computational complexity of motion estimation by carefully designing search patterns with different shapes or sizes, which have a significant impact on the search speed and distortion performance. However, the search speed and the distortion performance often conflict with each other in these methods, and their high computational complexity incurs a large amount of memory access. This paper presents a novel block-matching scheme with image indexing, which sets a proper priority list of search points, to encode a H.264 video sequence. This study also proposes a computation-aware motion estimation method for the H.264/AVC. Experimental results show that the proposed method achieves good performance and offers a new way to design a cost-effective real-time video coding system.  相似文献   

12.
弹载红外图象处理器是一个复杂的实时红外图象处理系统。由于它的计算量很大,实时性很强,单CPU已经不可能完成如此繁重的任务。本文分析了弹载红外图象处理器中多CPU之间通信的几种方法。在分析几种通信方法的原理和各自的优缺点之后,提出了弹载红外图象处理器中多CPU之间通信应采用双口RAM作共享存储器的并行通信方法,我们所设计的系统中均采用这种方法。实验证明这种方法的数据信息通信具有很高的灵活性和可靠性,完全满足弹载红外图象处理器中多CPU之间通信的要求。  相似文献   

13.

This paper presents a Field Programmable Gate Array (FPGA) based embedded system which is used to achieve high speed segmentation of 3D images. Segmentation is performed using Expectation-Maximization (EM) with Maximization of Posterior Marginals (MPM) Bayesian algorithm. This algorithm segments the 3D image using neighboring pixels based on a Markov Random Field (MRF) model. In this system, the embedded processor controls a custom circuit which performs the MPM and portions of the EM algorithm. The embedded processor completes the EM algorithm and also controls image data transmission between host computer and on-board memory. The whole system has been implemented on Xilinx Virtex 6 FPGA and achieved over 100 times processing improvement compared to standard desktop computer. Three new techniques were the key to achieve this speed: Pipelined computational cores, sixteen parallel data paths and a novel memory interface for maximizing the external memory bandwidth.

  相似文献   

14.
FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT algorithms for a multicore FFT processor. We show that it is possible to to achieve excellent parallel scaling while maintaining power and area efficiency comparable to that of the single-core solution. The result is an architecture that can effectively use up to 16 hybrid cores for transform sizes that can be contained in on-chip SRAM. When configured with 12MiB of on-chip SRAM, our technology evaluation shows that the proposed 16-core FFT accelerator should sustain 388 GFLOPS of nominal double-precision performance, with power and area efficiencies of 30 GFLOPS/W and 2.66 GFLOPS/mm2, respectively.  相似文献   

15.
This paper presents some massively parallel computers, especially those dedicated to image processing and artificial vision. After a discussion on the information processing principle in an architectural point of view, we analyse three representative machines: MPP (processor array), Connection Machine (hypercube machine) and Sphinx (pyramid machine). A comparative study of the implementation of image analysis is carried out. This paper also emphasizes the influence of integration technologies to the realization of the massively parallel computers.  相似文献   

16.
朱玉飞  戴紫彬  徐进辉  李功丽 《电子学报》2017,45(12):2957-2964
以信息安全设备的密码应用需求为基础,融合流体系结构处理器基本架构,设计出流体系结构密码处理器.文章主要研究和设计影响该处理器性能的瓶颈--流存储系统.此系统针对专用密码处理器的存储特点,并采用可配置化设计,满足密码应用对处理器存储系统灵活高效的要求.同时,该设计将层次化-分布-分体式存储、多数据通道流水并行化访存、流访存调度策略相结合,优化存储系统的访存效率,以提高该处理器的整体性能.研究结果表明,相比于典型密码处理器的存储设计,该设计的访存效率最高可提升约6倍.  相似文献   

17.
对网络处理器中多微引擎并行处理的两种编程模型进行分析,讨论了如何将数据包处理任务在多个微引擎之间进行分配,从而取得较高处理性能的一般性策略问题。在基于Intel公司生产的第二代网络处理器IXP2400的NAT-PT系统的开发过程中综合运用两种编程模型,较好的解决了数据包处理功能模块到各微引擎的映射问题。  相似文献   

18.
设计一种以DSP为核心,FPGA协助数据采集、传输的图像处理平台。DSP运行复杂图像处理算法,通过EMIF接口和FPGA进行高速数据传输。FPGA对EMIF接口进行扩展,将图像传感器、SDRAM存储器、USB接口等统一到EMIF接口,提高系统的集成度和灵活性,实现DSP与FPGA、外设之间数据准确、高效、可靠的传输。实验表明,该系统满足实时性设计需求,易于扩展和升级,具备较强的通用性。  相似文献   

19.
提出了一种新的两维全搜索运动估计VLSI结构。该结构基于两维脉动阵列,能够完全实现两维数据重用,减少了对外部存储器数据量的访问,具有100%的硬件效率和高吞吐率。该结构也可以很容易地应用于不同块尺寸、不同的搜索范围的全搜索块匹配运动估计,具有通用性。  相似文献   

20.
This paper presents the philosophy and design of a fault-tolerant dynamically-reconfigurable random access memory (RAM) system with a built-in Self-Testing-And-Repairing “STAR” engine. The STAR engine, supported by SEC–DED capability, provides on-line fault detection, correction, analysis and repair without destroying useful data stored in the memory. Reliability analysis of the presented system has been accomplished using a SMART simulation approach[1], and results show significant reliability enhancement over SEC–DED RAM designs. The memory system employs a hardware parallel address-comparison mechanism for rapid processing of incoming addresses during normal read/write operations to minimize memory access delay. The flexible STAR architecture and the low hardware overhead enables utilization of the proposed approach in VLSI memory chips as well as in WSI and large memory modules.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号