首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 129 毫秒
1.
为了提高访存效率,提供可以与计算流水线并行执行的多个独立的访存流水线,魂芯DSP片上存储器设计时采用分块内存结构,并在核内提供多个独立的地址生成单元用于访存操作.针对分块内存的结构特点,编译器对程序中的存储访问构建关于变量的冲突图,对分块内存进行存储块分配,优化数据在分块内存的分布.以数据在分块内存的优化分布为基础,指导程序中访存操作在地址生成单元的优化分配,使得编译器生成的代码可以最大程度地挖掘程序中数据访问的并行性.实验表明,基于分块内存的数据分配分布优化为其它优化如地址寄存器的分簇、访存向量化、软件流水等经典优化提供了良好基础,保证了编译器生成的代码可以充分发挥魂芯DSP提供的指令级并行能力.  相似文献   

2.
访存交易的处理顺序对内存访问的性能有重要影响.同一个SoC设备发出的多个未决交易往往地址连续且读写类型相同.然而,传统的总线仲裁方法导致各个设备发出的未决交易序列交错地发送至内存控制器,而内存控制器访存调度的范围有限,最终导致此类序列通常无法连续地访问内存.为解决此问题,提出一种新型的总线仲裁方法CGH,该方法利用SoC设备通信行为的特征,通过识别同一个SoC设备发出的、行地址和读写类型相同的未决交易序列并让其连续获得仲裁授权,减少内存切换行地址和读写类型的次数;同时,在选择将要授权的未决交易序列时,优先考虑行地址和读写类型与最近授权交易相同的申请,进一步提高访存效率.将CGH仲裁方法应用至北大众志-SKSoC后,系统访存性能提高了21.37%,而总线面积仅增加2.83%.此外,由于行地址切换次数减少,内存的能耗也降低了15.15%.  相似文献   

3.
随着多媒体So C中具备密集访存能力的设备数量增加,设备之间频繁争抢存储体资源,严重影响访存性能.为此提出一种面向多媒体So C的存储体访存负载均衡划分方法.通过操作系统对物理内存的管理,将设备所访问的数据映射到独立的存储体中,避免争抢频繁的设备共享存储体,减少设备间的访存冲突;划分过程基于数据量、延迟分析设备访存行为与访存冲突之间的关系,并以此来均衡各存储体的访问负载,同时提升多个设备的访存性能.该方法不依赖特殊硬件也无需修改上层应用,提供了一种透明的纯软件优化手段.将文中方法应用于真实的多媒体So C的实验结果表明,与基于带宽优先的划分方法相比,该方法在提高带宽利用率的同时降低访存延迟,将解码帧率提升8.4%~12.3%;并且在保证服务质量的情况下,可以通过进一步降低内存工作频率来减少系统功耗.  相似文献   

4.
双精度普通矩阵乘法DGEMM是BLAS库中最核心的函数之一,大部分三级BLAS库函数的核心计算都是通过调用DGEM M来实现的.该文针对龙芯3A具有128位访存指令的特点,通过理论分析,找到了最佳的循环展开方式;针对龙芯3A的Cache替换策略(随机替换),通过使用地址交错技术,减少了Cache的冲突失效;针对龙芯3A访存带宽有限的问题,通过使用共享数据的任务划分方式,减少了数据访存量.优化后的DGEMM单核和多核运算速度均是性能最高的开源BLAS库(Goto-BLAS)的2倍多.  相似文献   

5.
邱杰凡  华宗汉  范菁  刘磊 《软件学报》2022,33(2):751-769
在多核计算机时代,多道程序在整个共享内存体系上的“访存干扰”是制约系统总体性能和服务质量的重要因素.即使当前内存资源已相对丰富,但如何优化内存体系的性能、降低访存干扰并高效地管理内存资源,仍是计算机体系结构领域的研究热点.为深入研究该问题,详述将“页着色(pagecoloring)”内存划分技术应用于整个内存体系(包括Cache、内存通道以及内存DRAM Bank),进而消除了并行多道程序在共享内存体系上的访存干扰的一系列先进方法.从DRAM Bank、Channel与Cache以及非易失性内存(non-volatile memory, NVM)等内存体系中介质为切入点,层次分明地展开论述:首先,详述将页着色应用于多道程序在DRAM Bank与通道的划分,消除多道程序间的访存冲突;随后是将页着色应用于在内存体系中Cache和DRAM的“垂直”协同划分,可同时消除多级内存介质上的访存干扰;最后是将页着色应用于包含NVM的混合内存体系,以提高程序运行效率和系统整体效能.实验结果表明,所提内存划分方法提高了系统整体性能(平均5%-15%)、服务质量(QoS),并有效地降低了系统能耗.通过梳理...  相似文献   

6.
随着存储系统的访问速度与处理器运算速度的差距越来越显著,访存性能已成为提高处理器性能的瓶颈.通过对程序的访存行为进行分析,提出快速地址计算的自适应栈高速缓存方案.该方案将栈访问从数据高速缓存的访问中分离出来,充分利用栈空间数据访问的特点,提高指令级并行度,减少数据高速缓存污染,降低数据高速缓存失效率,并采用快速地址计算策略,减少栈访问的命中时间.该栈高速缓存在发生栈溢出时能够自适应地关闭,以避免栈切换对处理器性能的影响.栈高速缓存标志中增加进程标识,进程切换时不需要将数据写到低层存储系统中,适用于多进程环境.SPEC CPU2000程序运行结果表明,采用快速地址计算的自适应栈高速缓存方案,25.8%的访存指令可以并行执行,数据高速缓存失效率平均降低9.4%,IPC值平均提高6.9%.  相似文献   

7.
随着多核/众核成为处理器结构发展的主流,并行任务间共享地使用Cache而导致的冲突越来越成为性能提升的瓶颈.利用页着色可以实现对Cache的分区管理,减少共享Cache导致的冲突.页着色的原理是利用内存与Cache之间的组相联映射关系,通过控制分配固定区域的内存而达到分配固定区域Cache的目的,这一方面限制了任务能够请求的物理内存范围,另一方面调整程序使用的Cache空间需要做大量的内存拷贝,带来了不可忽视的开销.为了克服页着色的缺点,文中通过动态内存分配的方式,只对动态分配的页进行着色,在不修改内核和程序源码的前提下实现了动态Cache分区.文中提出的动态内存分配策略(CachePM)会根据运行时环境为任务分配内存,避免不同任务间共享Cache的冲突和同一任务内出现Cache的访问热点,通过合理划分程序运行时动态分配的内存达到Cache分区的目的.当任务的运行环境改变时,CachePM自适应地改变已经分配的堆中数据在物理内存中的布局,以实现Cache分区的动态调节.为进一步降低动态页着色的开销,作者采用了减少和延迟内存拷贝的策略.实验表明,该方法能够有效实现动态Cache分区,从而提高并行运行的任务的性能;同时由于动态内存分配策略避免了同一任务内出现Cache访问热点,单独运行的任务的性能也较在libc下运行有所提升.  相似文献   

8.
高性能处理器普遍采用片上集成大容量复杂结构的一级Cache提高处理器性能,但随着Cache容量和复杂度的增加,访问Cache所产生的访存延迟和功耗明显增加;基于存储队列,提出了一种通过减少Cache访问次数来降低功耗和延迟的方法,利用存储队列来缓存Load/Store指令的数据,并且当存储队列不满时,通过空闲入口暂存已经完成的仿存数据,提高了连续访存数据的复用率,减少了Cache的访问次数;仿真结果显示,该方法在增加少量的控制逻辑基础上,显著减少了Cache的访问次数,降低了Cache的功耗,减少了访存延迟,加快了执行速度。  相似文献   

9.
多线程和向量技术相结合是当前微处理器设计的一个重要趋势.提出一种多线程向量处理器中向量数据存储结构,利用多线程切换来隐藏访存延迟,并让向量数据直接访问二级cache来提高带宽.模拟实验表明在所提出的存储结构下,访存带宽随线程数线性增长,向量数据访问带宽明显高于标量数据访问带宽.  相似文献   

10.
针对可重构阵列处理器访存数据量大、数据并行性要求高且数据全局重用少、局部性明显的特点,提出了一种分布式Cache结构的簇内局部优先高效互连访问结构,该结构实现了簇内4×4个PE对4×4个Cache的并行访问,选用Xilinx公司的ZYNQ系列芯片XC7Z045 FFG900-2进行FPGA综合。在无冲突情况下,该互连结构支持簇内16个PE的同时读/写访问,最高频率可达221 MHz,访存峰值带宽为7.6 GB/s。在此结构上实现了灰度共生矩阵提取纹理图像特征算法,数据访存带宽达到478.125 MB/s,运行时间为0.24 ms。  相似文献   

11.
提出一种同时基于预知信息和预测机制的SDRAM新型动态页策略。该策略可充分利用待处理访存请求的地址信息,能对后续页命中情况进行精确判断;而当没有待处理访存请求可预知时,则利用所记录的历史信息对后续页命中情况进行预测,以最大程度地选择最合适的页策略。分析证明该策略的硬件实现代价很小。实验证实三类主要的基于预知信息的动态页策略之间的性能差异较小,均能获得较理想的访存带宽,最好情况下,实际访存带宽可提升42%。其中,对于绝大多数测试激励,同时基于预知信息和预测机制的新型动态页策略的性能均为最优或接近最优,适应范围最广。  相似文献   

12.
在分析H.264/AVC编码过程中存储器带宽需求的基础上,提出一种DRAM控制器结构,并实现了几种不同调度策略的DRAM控制器结构设计。实现了令牌环、固定优先级和抢占式等三种结构,结合已有的存储空间映射方法,通过减少换行及Bank切换过程中的冗余周期,进一步提高存储器的带宽利用率。实验结果表明,提出的三种存储器结构中抢占式调度具有最高的宽利用率,可满足150 MHz时钟频率条件下HDTV1080P实时编码的应用。  相似文献   

13.
为支持主流视频编码标准对可变块运动估计(VBSME)的要求,提出一种基于二维脉动阵列的可变块运动估计结构,该结构具有数据重传次数少,存储器带宽要求低,性能高的特性。针对该结构中的大量延迟寄存器开销问题,使用了基于数据流的AON网络模型和动态规划算法,对结构进行了分析,得到了优化的结果。  相似文献   

14.
Accessing pixels in memory is a well-known bottleneck of SIMD (single instruction multiple data) processors in video/imaging. To tackle it, we propose new block and row access modes of parallel on-chip memory subsystem, which enable a higher processing throughput and lower energy consumption than the access modes of the state-of-the-art subsystems. The new access modes significantly reduce the number of on-chip memory accesses, and thereby accelerate one of key video/imaging kernels: sub-pixel block-matching motion estimation. The main idea is to exploit spatial overlaps of blocks/rows accessed for pixel interpolation, which are known at the subsystem design-time, and merge multiple accesses into a single one by accessing somewhat more pixels at a time than with other parallel memories. To avoid the need for a wider, and, therefore, more costly SIMD datapath, we propose new memory read operations that split all pixels accessed at a time into multiple SIMD-wide blocks/rows, in a convenient way for further processing. As a proof of concept, we describe a parametric, scalable, and cost-efficient architecture that supports the new access modes. The architecture is based on a previously proposed set of memory banks with multiple pixels per bank word, and a previously proposed shifted scheme for arranging pixels in the banks. We analytically and experimentally demonstrate advantages of this work on a case study of sub-pixel motion estimation for video frame-rate conversion. The implemented motion estimator processes 2160p video at 60 fps in real time, while clocked at 600 MHz. Compared to the implementations based on the state-of-the-art subsystems, this work enables 40–70 % higher throughput, consumes 17–44 % less energy and has similar silicon area and off-chip memory bandwidth costs. That is 1.8–2.9 times more efficient than the prior art, considering the throughput and all costs, i.e., consumption, area, and off-chip bandwidth. Such a higher efficiency is the result of the new access modes, which reduced the number of on-chip memory accesses by 1.6–2.1 times, and the cost-efficient architecture.  相似文献   

15.
In this paper, a low-cost compatible motion compensator is implemented and integrated into a macroblock-level three-stage-pipelined HDTV decoder, in which an embedded compression (EC) engine is realized as well. The decoder with EC engine is designed to reduce the power consumption and memory bandwidth requirement since memory accesses are reduced. In the motion compensator, a boundary judgment scheme for reference pixel fetching is proposed to provide seamless integration in HDTV video decoder for the block-based EC engines. Furthermore, a buffer sharing mechanism is adopted to reduce extra memory requirement involved by EC. The reference pixel fetching unit costs only 17.3 K logic gates when the working frequency is set to 166.7 MHz. On average, when decoding HD1080 video sequence, 30% memory access reduction and 24% memory power consumption saving are achieved when a near lossless EC algorithm is integrated in the video decoder. In other words, the proposed motion compensator makes the EC engine an integral part of a memory reduced decoder without extra cost. Additionally, since the work in this paper is based on EC schemes, the EC design criterion are discussed, and several useful rules on the selection of EC algorithm are addressed for the video decoder of corresponding VLSI architecture.  相似文献   

16.
Video texts are closely related to the video content. The video text information can facilitate content based video analysis, indexing and retrieval. Video sequences are usually compressed before storage and transmission. A basic step of text-based applications is text detection and localization. In this paper, an overlaid text detection and localization method is proposed for H.264/AVC compressed videos by using the integer discrete cosine transform (DCT) coefficients of intra-frames. The main contributions of this paper are in the following two aspects: 1) coarse text blocks detection using block sizes and quantization parameters adaptive thresholds; 2) text line localization according to the characteristics of text in intra frames of H.264/AVC compressed domain. Comparisons are made with the pixel domain based text detection method for the H.264/AVC compressed video. Text detection results on five H.264/AVC video sequences under various qualities show the effectiveness of the proposed method.  相似文献   

17.
H.264 AVC video compression standard achieves high compression rates at the cost of a high encoder complexity. The encoder performances are greatly linked to the motion estimation operation which requires high computation power and memory bandwidth. High definition context magnifies the difficulty of a real-time implementation. EPZS and HME are two well-known motion estimation algorithms. Both EPZS and HME are implemented in a DSP and their performances are compared in terms of both quality and complexity. Based on these results, a new algorithm called HDS for Hierarchical Diamond Search is proposed. HDS motion estimation is integrated in a AVC encoder to extract timings and resulting video qualities reached. A real-time DSP implementation of H.264 quarter-pixel accuracy motion estimation is proposed for SD and HD video format. Furthermore HDS characteristics make this algorithm well suited for H.264 SVC real-time encoding applications.  相似文献   

18.
With the technological advancement, entertainment has become revolutionized and the high-definition (HD) video has become an integral feature of our modern amusement system. The demand for wireless transmission of HD video is rapidly rising for its ubiquitous nature, easy installation and relocation. Such wireless transmissions of HD video streams require very high bandwidth. The ultra-wideband (UWB) offers a large bandwidth, and short-range high-speed data transmission at low cost and low power consumption. In this paper, we present the feasibility study to transmit HD video wirelessly using H.264/AVC compression over the UWB communication channel. Simulations are carried out by controlling key H.264/AVC encoder parameters such as, in-loop deblocking filter, group of pictures, and quantization parameter. Based on the analysis, an optimum setting of these parameters is proposed for different bandwidth requirements, as well as acceptable video quality. The bandwidth achieved is restricted between 1.5 and 20?Mbps with a minimum reconstruction quality of 34?dB. The HD bit stream is then transmitted over the UWB communication channel and the demonstration shows that the overall encoder performance is satisfactory with the transmission bit-error-rate (BER) in the range of 10?5?C10?8.  相似文献   

19.
Recent developments have given birth to H.264/AVC: a video coding standard offering better bandwidth to video quality ratios than MPEG-2. It is expected that the H.264/AVC will take over the digital video market, replacing the use of MPEG-2 in most digital video applications. The complete migration to the new video-coding algorithm will take several years given the wide scale use of MPEG-2 in the market place today. This creates an important need for MPEG-2/H264 transcoding technologies. However, given the significant differences between both encoding algorithms, the transcoding process of such systems is much more complex to other heterogeneous video transcoding processes. In this work, we start by analyzing the methods defined in the H.264 video coding standard for the intra prediction: a central element of every H.264 encoder. We then introduce and evaluate six fast intra mode decision algorithms which should enable the development of MPEG-2 to H.264 transcoders. Having evaluated all the proposed methods, we have come out with a high-efficient method, namely DC-ABS pixel. Our results show that our algorithm considerable reduces the complexity involved in the intra prediction with respect the mode decision algorithms used in H.264 JM reference software, while exhibiting a slight degradation on the RD function.. Finally, we analyze a comparative study with two of the most prominent fast intra prediction methods presented in the literature. The results show that the proposed DC-ABS pixel method achieves the best results for video transcoding applications.
Hari KalvaEmail:
  相似文献   

20.
为支持H.264/AVC可变块运动估计(VBSME)要求,提出一个高性能的二维脉动阵列结构,该结构具有数据重传次数少,存储器带宽要求低的特性.对结构中存在大量延迟寄存器开销的问题,用基于数据流的AON网络模型对阵列结构的计算依赖关系进行分析,并使用动态规划算法,得到了优化的结构设计.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号