首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Recently, programming tools have become available to researchers and scientists that allow the use of video cards for general-purpose calculations in computational electromagnetics applications. Over the past few years, developments in the field of graphic processing units (GPUs) for video cards have vastly outpaced their general central processing unit (CPU) counterparts. As specifically applied to vector mathematic operations, the newest generation GPUs can generally outperform current CPU architecture by a wide margin. With the addition of large onboard memory units with significantly higher memory bandwidth than those found in the main system, graphic cards can be utilized as a highly efficient vector mathematic coprocessor. In the past, this power has been harnessed by writing low-level assembly code for the video cards. Recently, new tools have become available to make programming possible in high-level languages. By formulating proper procedures to realize general vector computations on the GPU, it will be possible to increase the processing power available by at least an order of magnitude compared to the current generation of CPUs.  相似文献   

2.
为对CUDA并行程序内核性能进行分析和预测,从而指导并行程序设计及性能优化,提出一种性能预测框架.1)从GPU编程模型和设备架构细节入手,以线程束为研究单位,通过整合与GPU程序用时密切相关的软硬件基本特征,定义了并行空间闲置度、流处理器线程束负载、并行效应因子等高层次性能相关特征.2)基于上述特征,框架针对线程负载均衡型GPU程序,评估内核函数在不同问题规模以及执行配置下的执行时间.3)依据性能评估原理提出了内核函数执行配置参数的优化策略.验证实验结果表明,该框架在两种典型情境下对现有程序性能的平均预测准确率分别达到89%和94%,客观归纳了高层次特征与程序性能间的相关关系,且能定性分析并行算法性能水平.  相似文献   

3.
The programmable video signal processor (VSP) is an important category of processors for multimedia systems. Programmable video processors combine the flexibility of programmability with special architectural features that improve performance on video processing applications. VSPs are typically multiple processors with several processing elements (PEs) and a parallel memory system. This paper focuses on the architectural design of the PE's in a video processor and shows how technology and circuit parameters influence the structure of the datapath and, hence, the overall architecture of a programmable VSP. We emphasize the need to consider technological and circuit-level issues during the design of a system architecture and present a method whereby the conceptual organization of the PEs-the number of PEs, pipelining of the datapath, size of the register file, and number of register ports-can be evaluated in terms of a target set of applications before a detailed design is undertaken. We use motion-estimation and discrete cosine transform as example applications to illustrate how various technology parameters affect the architectural design choices. We show that the design of the register file and the datapath-pipeline depth can drastically affect PE utilization and, therefore, the number of PEs required for different applications. Our results demonstrate that pursuing the fastest cycle time can greatly increase the silicon area which must be devoted to PEs, due to both increased pipeline latency and reduced register file bandwidth  相似文献   

4.
张乃燃  侯立刚  吴武臣   《电子器件》2008,31(1):268-272
提出了在视频显示应用中的新型存储器系统架构.为保证流畅的视频图像放映,片下存储器的访存成为影响系统性能的关键因素.首先具体分析图像中所划分块的运动矢量以减少访存行为,从而减少60%的周期数.第二,有效地访存方法为视频数据的传输提供了充足的带宽.为保证不同类型的信号之间没有干扰,安排了专用的存储器系统结构.相应地,一个专用于参考帧存取的本地 AMBA 实现了高速存取,双存储器控制器也使总线更加有效.根据以上方法,实验结果显示在60 兆赫兹下,系统可以完成每秒 30 帧的实时高清(720p)解码.  相似文献   

5.
为解决雷达、电子对抗等高性能计算应用中的存储访问带宽瓶颈,文中设计了一种多通道交织的存储架构,通过存储通道间的地址交织映射和集中式调度器的拆分与重组,实现了多个物理存储通道的并发访问,成倍提高了访存带宽,并具有良好的可配置和可扩展特性。该设计充分利用市场现有成熟的单通道控制器技术,经济高效。为评估性能,以4通道存储系统为例,建立了周期精确的RTL模型及其仿真验证环境。测试结果显示,交织粒度在64 B~512 B内系统获得最优性能,该性能是目前广泛采用的独立多通道存储架构性能的约4倍。  相似文献   

6.
In this paper, for real-time multi-channel video analytics, an architecture of a single-pass connected component labeler without label merging period is proposed; this structure can handle multiple video channels from incoming video signals or from memory. During the line scan period, after determining an event based on the 8-connectivity check of incoming pixels, an 8-connectivity checker notifies all the label registers inside a label shift register of the event and the control signals for label handling. Each label register takes a label input either normally from the shifted label in the label shift register, or from the merging label without additional cycles for the label merging event. A label stack is designed for efficient label allocation with label reuse. An information extractor calculates bounding boxes and centers of gravity for labeled objects using the information from the 8-connectivity checker and the pixel counter. The proposed architecture is finally implemented on a field programmable gate array (FPGA) device, integrated into a video analytics system, and verified as to its functions and performance. More than 28 % of the memory usage by conventional architectures is reduced in the proposed architecture for the same maximum number of labels in video surveillance environments.  相似文献   

7.
介绍了在TMs320DM6446 DSP平台上实现AVS视频编码器的算法设计与优化方法.在软件整体设计优化的基础上,重点对运动估计等算法进行了优化改进;同时针对平台特点给出结构优化方法.主要包括提高代码并行性及存储器和数据搬移的优化.测试结果表明,通过优化,在保证图像质量损失较小的情况下,编码器的编码速率有显著提高.  相似文献   

8.
《Microelectronics Journal》2015,46(7):637-655
This paper proposes a new processor architecture called VVSHP for accelerating data-parallel applications, which are growing in importance and demanding increased performance from hardware. VVSHP merges VLIW and vector processing techniques for a simple, high-performance processor architecture. One key point of VVSHP is the execution of multiple scalar instructions within VLIW and vector instructions on unified parallel execution datapaths. Another key point is to reduce the complexity of VVSHP by designing a two-part register file: (1) shared scalar–vector part with eight-read/four-write ports 64×32-bit registers (64 scalar or 16×4 vector registers) for storing scalar/vector data and (2) vector part with two-read/one-write ports 48 vector-registers, each stores 4×32-bit vector data. Moreover, processing vector data with lengths varying from 1 to 256 represents a key point for reducing the loop overheads. VVSHP can issue up to four scalar/vector operations in each cycle for parallel processing a set of operands and producing up to four results to be written back into VVSHP register file. However, it cannot issue more than one memory operation at a time, which loads/stores 128-bit scalar/vector data from/to data memory. The design of our proposed VVSHP processor is implemented using VHDL targeting the Xilinx FPGA Virtex-5 and its performance is evaluated.  相似文献   

9.
基于TMS320DM642的视频编码器的存储存取优化   总被引:1,自引:1,他引:0  
简要介绍了MPEG-4视频编码器的实现原理,阐述了基于TMS320DM642视频编码的存储存取优化的思想和具体的实现方法.实验结果表明,采用此类优化后提高了视频编码的效率,并保持了较高的图像质量.  相似文献   

10.
本文介绍了在TI C64x DSP平台上实现MPEG-4 Simple Profile视频编码器的算法设计与优化方法。算法上,重点对运动估计进行了改进及优化,在图像质量(PSNR)损失较小的情况下,大大降低了计算复杂度。根据C64x DSP的特性,对整个编码器的程序结构和主要计算模块进行结构级的优化,主要包括增强存储器访问效率及提高代码并行性。实验结果表明,对CIF大小的视频序列,该编码器具有100fps以上的编码速度,可以在C64x DSP上实现多路视频编码。  相似文献   

11.
Multimedia applications such as video and image processing are often characterized by a huge number of data accesses. In many digital signal processing applications, array access patterns are regular and periodic. In these cases, optimized architectures using pipelined memory access controllers can be generated. In this paper, we focus on implementing memory interfacing modules that can be automatically generated from a high-level synthesis tool and which can efficiently handle predictable address patterns as well as random ones (i.e., dynamic address computations). The benefits of balancing dynamic address computations from datapath to dedicated computation units in the memory controller is also analyzed as well as operator bitwidth optimization and data locality to save power consumption and reduce latency.   相似文献   

12.
AVS1-P2 is the newest video standard of Audio Video coding Standard (AVS) workgroup of China, which provides close performance to H.264/AVC main profile with lower complexity. In this paper, a platform-independent software package with macroblock-based (MB-based) architecture is proposed to facilitate AVS video standard implementation on embedded system. Compared with the frame-based architecture, which is commonly utilized for PC platform oriented video applications, the MB-based decoder performs all of the decoding processes, except the high-level syntax parsing, in a set of MB-based buffers with adequate size for saving the information of the current MB and the neighboring reference MBs to minimize the on-chip memory and to save the time consumed in on-chip/off-chip data transfer. By modifying the data flow and decoding hierarchy, simulating the data transfer between the on-chip memory and the off-chip memory, and modularizing the buffer definition and management for low-level decoding kernels, the MB-based system architecture provides over 80% reduction in on-chip memory compared to the frame-based architecture when decoding 720p sequences. The storage complexity is also analyzed by referencing the performance evaluation of the MB-based decoder. The MB-based decoder implementation provides an efficient reference to facilitate development of AVS applications on embedded system. The complexity analysis provides rough storage complexity requirements for AVS video standard implementation and optimization.  相似文献   

13.
温淑鸿  崔慧娟  唐昆 《电子学报》2005,33(12):2246-2249
为了提高视频编码器性能,并降低功耗,本文提出了一种缩短数据生命期减小存储器需求的方法.分阶段计算中间结果,减少临时结果存储,可减小存储器需求和存储器访问次数.另外提出了一种有效利用CPU寄存器字宽和桶型寄存器移位能力,缩减变字长编码输出运算复杂性的方法.实验结果表明,该方法能显著缩短视频编码中变字长编码的时间.  相似文献   

14.
通过分析H.264软件解码器的结构和复杂度,确定了解码器在优化过程中的重点和难点,并结合TMS320DM642DSP性能特点,详细讨论了在TMS320DM642DSP平台上H.264解码器所采用的优化方法。这些方法主要涉及提高程序代码的并行性和增强存储器访问的效率,重点是运动补偿、IDCT等关键模块的优化。通过实验结果表明,本解码器可以实现CIF格式视频流的实时解码。  相似文献   

15.
针对高清视频在客户端解码播放过程中存在的CPU占用率高、图像数据拷贝速度低等问题,提出了一种基于GPU解码数据快速拷贝方法。研究了DXVA硬解码方法在视频解码运算过程中的应用,为了消除解码数据在显存拷贝时产生的CPU占用率高现象,利用显存特点和SSE41多媒体指令新特性,设计并实现了视频帧数据快速拷贝方案。实验结果表明,该方法能在满足高清视频实时播放的同时有效降低CPU占用率,且该方法具有一定的实用性。  相似文献   

16.
为了解决传统雷达信号处理机在研发阶段面临的调试困难,计算能力受硬件限制及程序复用性差等问题,本文提出了使用GPU作为雷达计算核心的方案.在使用GPU实现雷达信号处理算法的过程中,动目标检测(MTD)部分的优化效果远低于脉冲压缩和恒虚警检测.经过分析,MTD过程中的矩阵转置与向量点乘占据了算法的大量时间.本文从GPU的数...  相似文献   

17.
18.
真三维活动视频数据的优化研究   总被引:1,自引:0,他引:1  
江寅川  袁杰 《现代电子技术》2012,35(8):116-119,126
提出了一种基于点阵的真三维视频显示技术,该系统利用LED为单元节点组成三维空间阵列,用于显示真三维活动影像。由于数据量巨大,为了加快处理速度,利用CUDA编程模型对计算过程进行优化,把处理过程中可以并行计算的部分交由GPU执行。先把要处理的视频数据传到内存中,由CPU进行一些预处理,然后传到显存,由GPU对视频运动过程等进行处理,处理完后再传到内存,由CPU进行一些后续处理,最终把处理后的数据传出加以显示或存储。通过比较仅由CPU处理与用GPU优化后的计算时间,发现优化后计算速度比优化前快了几十到几百倍,而且数据量越大,优化效果越好,核心多的GPU所得到的加速比大,最后在实验部分给出了用OpenGL仿真的结果。  相似文献   

19.
In this article an architecture is presented which allows efficient ASIC implementations of high throughput applications. Examples of these applications can be found in real time video applications such as EDTV, IDTV and HDTV. A key issue in the architecture is to provide a balance between memory resources and processing resources. Special attention is paid to the communication between these two types of resources. Architectural techniques are proposed to solve bottlenecks in the memory bandwidth and conflicts between memory accesses. Architectures for address generation in combination with location assignment are presented. The flexibility of the architectural model allows an efficient hardware realization on an ASIC exploiting the inherent parallelism of a particular application. This is illustrated in the article using a complex video algorithm for Progressive Scan Conversion. The proposed architecture is used as a target architecture which drives the high-level synthesis approach of the PHIDEO compiler.  相似文献   

20.
Single GPU scaling is unable to keep pace with the soaring demand for high throughput computing. As such executing an application on multiple GPUs connected through an off-chip interconnect will become an attractive option to explore. However, much of the current code is written for a single GPU system. Porting such a code for execution on multiple GPUs is difficulty task. In particular, it requires programmer effort to determine how data is partitioned across multiple GPU cards and then launch the appropriate thread blocks that mostly accesses the data that is local to that card. Otherwise, cross-card data movement is an expensive operation. In this work we explore hardware support to efficiently parallelize a single GPU code for execution on multiple GPUs. In particular, our approach focuses on minimizing the number of remote memory accesses across the off-chip network without burdening the programmer to perform data partitioning and workload assignment. We propose a data-location aware thread block scheduler to schedule the thread blocks on the GPU that has most of its input data. The scheduler exploits well known observation that GPU workloads tend to launch a kernel multiple times iteratively to process large volumes of data. The memory accesses of the thread block across different iterations of a kernel launch exhibit correlated behavior. Our data location aware scheduler exploits this predictability to track memory access affinity of each thread block to a specific GPU card and stores this information to make scheduling decisions for future iterations. To further reduce the number of remote accesses we propose a hybrid mechanism that enables migrating or copying the pages between the memory of multiple GPUs based on their access behavior. Hence, most of the memory accesses are to the local GPU memory. Over an architecture consisting of two GPUs, our proposed schemes are able to improve the performance by 1.55× when compared to single GPU execution across widely used Rodinia [17], Parboil [18], and Graph [23] benchmarks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号