首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
非规则问题是大规模并行应用中普遍存在和影响程序效率的关键问题,软件Cache是Cel处理器上解决该问题的一种普遍手段.鉴于通常的软件Cache忽略了非规则引用的内存访问模式,将Cache行设定为一个固定的长度,而加重内存带宽负荷及制约Cache利用率的问题,文中提出了一种自适应的Cache行算法,它根据非规则内存访问的...  相似文献   

2.
主机通过高速网络访问远程内存的性能已经达到或远高于访问本地磁盘的性能,通过各种优化手段,网络内存系统的性能能得到更好的提升。该文基于一个Linux网络内存系统(LNMS),在客户端一级提出了一种新的预取算法m-ppm,该算法发展了多Markov链预取模型,使之更适合LNMS。在LNMS上实现了另2种常用的预取算法以作比较,实验数据表明,m-ppm算法对多用户模式更有效。  相似文献   

3.
近年来,随着SaaS技术的发展,软件的网络化、服务化访问成为一种新的使用模式.软件的按需动态部署是实现上述模式的重要基础.为了支持软件的按需动态部署,需要能够在执行环境支持软件的流式加载运行.而在软件按需流式加载的执行过程中,程序会因为请求缺失的数据块被阻塞直至数据块被下载过来,从而极大地影响执行性能与用户体验.针对流式加载中的性能问题,提出一种基于N-Gram预测模型和增量数据挖掘技术的预取机制,该预取机制可用于支持软件流式加载执行.预取机制通过收集用户使用软件所产生的历史访问日志,进行数据挖掘分析,来动态更新、完善预取规则,然后根据最合理的预取规则进行软件预取.该预取机制可同时支持基于文件级别和软件块级别的预取.实验结果表明,对于各类软件,该可预取的文件系统能够将软件启动加载时间减少10%~50%,而预取命中卒达到了81%~97%.  相似文献   

4.
片上多处理器中基于步长和指针的预取   总被引:1,自引:1,他引:0       下载免费PDF全文
在对大量程序访存行为进行分析的基础上,提出基于步长和指针的预取方法。能捕获规整的数据访问模式和指针访问模式。在L2cache和内存之间采用全局历史缓存实现该预取方法。全系统模拟结果表明,该预取方法对商业应用测试程序的性能平均提高14%,对科学计算测试程序的性能平均提高34.5%。  相似文献   

5.
陈娟  易会战  董勇  杨学军 《软件学报》2006,17(7):1650-1660
在移动设备和嵌入式设备中,能量的供给是十分有限的,它受限于能量供给设备的容量和节电能力的大小.在能量受限的环境下,电池所提供的能量不足以使系统达到最优的性能目标.因此,提出了一种能量受限环境下最优化预取性能的方法.该方法通过软件控制的手段,能在有限的能量供给条件下达到最优的性能.该方法是基于动态频率可调的CPU和存储器的.根据CPU和存储器的忙闲情况,通过插入频率调节指令,指导调节CPU和存储器的频率,使得预取优化的两个性能指标(一是时间,二是处理器收益)在一定的能量约束条件下达到最优.对该问题建立了详细的模型及模拟环境,并通过一组以数组访问为主的测试程序验证了该方法的有效性.模拟结果表明,该方法对能量受限预取优化问题是有效的.  相似文献   

6.
在基于关系数据库和对象关系映射的持久对象框架中,对象之间通常通过对象引用和各种集合属性将对象相互关联起来,组合成更为复杂的复合对象。应用程序对这些复合对象的访问则是通过使用这些属性逐个访问成员对象来完成。这种在多个成员对象之间的导航操作导致了客户端和后端数据库系统之间的获取操作大幅度增加,从而导致严重的性能问题。对象预取技术根据某种策略,将应用程序可能访问到的对象成组或批量地预先从数据库中装载到客户端,从而减少了应用程序对后端数据库系统进行查询的次数。本文对现有各种对象预取技术并对其进行分析分类,在此基础上,提出了一种基于多级访问模式的对象预取技术。最后,介绍了该算法在软件构件平台StarC-CM的持久对象框架中的实现。  相似文献   

7.
褚瑞  卢锡城  肖侬 《软件学报》2006,17(11):2234-2244
内存网格(RAM(random access memory) grid)是一种面向广域网上内存资源共享的新型网格系统.它的主要目标是在物理内存不足的情况下,提高内存密集型应用或IO密集型应用的系统性能.内存网格的应用效果取决于网络通信开销.在减少或隐藏网络通信开销的情况下,其性能可以进一步提高.通过对内存网格的分析,设计了一种基于"推"数据的内存网格预取机制.借助数据挖掘领域中序列模式挖掘的方法,提出了相应的预取算法.通过基于真实运行状态的模拟,对预取算法进行了评估和验证.  相似文献   

8.
数据预取技术是为缓解微处理器与DRAM之间速度差异而出现的隐藏访存延迟的方法。GCC作为广泛使用的开源编译器,在tree-ssa上对循环级数组实现了预取优化。在深入分析GCC4.9循环级数组预取的基本实现机制,以及剖析基于预取收益和分析时间的三种不予预取的代价模型的基础上,得出影响循环数组预取效果的几个因素,并针对典型测试用例测试了GCC编译器循环数组预取的效果。此项工作对于改进和提高GCC现有循环级数组的预取优化有指导意义。  相似文献   

9.
陈彬  肖侬  蔡志平  王志英 《软件学报》2010,21(12):3186-3198
针对大规模虚拟机环境下软件的按需部署,提出了一种基于预取的按需软件部署优化机制,能够降低用户端虚拟机的启动延迟以及为用户提供更好的虚拟机本地运行性能.基于用户使用软件的行为特点以及虚拟磁盘映像的细粒度分割,预取机制在后台对服务器端存储的虚拟磁盘映像进行预取,通过一种基于访问频率和优先级的预取目标识别算法AFPTR(access frequency and priority-based prefetch target recognition)和一种预取量动态调节机制,将预取集中在用户使用的少数小尺寸的虚拟磁盘映像上,并在预取过程中对预取量进行动态自适应地调节,以提高虚拟磁盘访问的本地命中率,进而提高用户端虚拟机的运行性能.基于QEMU虚拟机和Linux平台,实现了基于预取的按需软件部署原型系统.实验结果表明,预取机制能够有效地降低虚拟机的启动延迟,并能提高虚拟机的本地运行性能,支持虚拟机环境下按需、快速的软件部署.  相似文献   

10.
为了提高网络内存的访存性能,基于一种页面级流缓存和预取结构提出了可变步长的带状流检测算法VSS(variable stride stream)和基于时钟步长的流预取优化算法来优化网络访存性能.带状流检测算法解决了固定步长流检测下循环访问中虚拟页地址的跳跃问题,消除了断流,可以有效提高流检测的覆盖率.基于时钟步长的流预取优化动态调整预取长度,可以解决有些预取不能及时取回的问题,进一步提高预取性能.通过和顺序预取算法的比较可以看出,VSS算法可以实现高准确率、低通信开销的预取.通过模拟分析了这种流缓存和预取机制在网络访存系统中的应用,验证了以少量性能下降换取灵活的远程内存扩展方法的可行性.  相似文献   

11.
硬件数据预取技术可以有效提升处理器的访存性能,是申威处理器性能优化过程中亟需突破的一项技术。硬件开销和处理器架构的制约是硬件预取技术实现中的主要难点。借鉴学术界对硬件预取技术的研究成果和工业界的应用现状,紧密结合申威处理器的结构特点,研究了申威处理器硬件预取技术的实现方法。以流预取为例,在处理器核心面积增加0.97%的情况下,硬件预取技术的应用可以将目前申威处理器的整数性能平均提升5.17%,最高提升28.88%;浮点性能平均提升6.39%,最高提升30.11%。  相似文献   

12.
针对现代计算机系统中的存储墙问题,提出一种适合于链式数据结构的数据预取方法——纯遍历推送方法。采用基于共享高速缓存的多核处理器平台CMP上的多线程技术,在主程序运行时分离出一个推送线程,由其将主线程需要的数据提前预取至处理器共享高速缓存中以隐藏主线程的存储器延迟。实验结果证明该方法在CMP架构下对以链式结构为主的内存受限程序的性能有一定的改进。  相似文献   

13.
Dynamic Data Prefetching in Home-Based Software DSMs   总被引:1,自引:0,他引:1       下载免费PDF全文
1 IntroductionSoftware Distributed Shared Memory (DSM) provides the illusion of shared memoryon the top of distributed memory hardware. Most software DSM systems are page-based,using virtual memory protection to trap accesses to shared memory. These systems sufferfrom the high communication and coherence--induced overheads caused by the high levelof implementation and large granularity of coherence. Many techniques, such as multiplewriter protocolll], lazy release consistency[2], and data …  相似文献   

14.
Helper threaded prefetching based on Chip Multiprocessor is a well known approach to reducing memory latency and has been explored in linked data structures accesses. However, conventional helper threaded prefetching often suffers from useless prefetches and cache thrashing, which affect its effectiveness. In this paper, we first analyzed the shortcomings of conventional helper threaded prefetching for linked data structures. Then we proposed an improved helper threaded prefetching, Skip Helper Threaded Prefetching, for hotspots with two level data traversals. Our solution is to profile the applications and balance delinquent loads between main thread and prefetching thread based on the characteristic of operations in their hotspots. Evaluations show that the proposed solution improves average performance by 8.9% (-O2) and 8.5% (-O3) over the conventional helper threaded prefetching that greedily prefetches all delinquent loads. We also compare our proposal with the active threaded prefetching which synchronizes with main thread by semaphore, and find that our proposal provides better performance for the targeted applications.  相似文献   

15.
This paper proposes using a user-level memory thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: The correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide applicability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53.  相似文献   

16.
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency.  相似文献   

17.
Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory.With hardware and/or software support,data prefetching brings data closer to a processor before it is actually needed.Many prefetching techniques have been developed for single-core processors.Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching t...  相似文献   

18.
The on-chip memory performance of embedded systems directly affects the system designers' decision about how to allocate expensive silicon area. A novel memory architecture, flexible sequential and random access memory (FSRAM), is investigated for embedded systems. To realize sequential accesses, small “links”are added to each row in the RAM array to point to the next row to be prefetched. The potential cache pollution is ameliorated by a small sequential access buyer (SAB). To evaluate the architecture-level performance of FSRAM, we ran the Mediabench benchmark programs on a modified version of the SimpleScalar simulator. Our results show that the FSRAM improves the performance of a baseline processor with a 16KB data cache up to 55%, with an average of 9%; furthermore, the FSRAM reduces 53.1% of the data cache miss count on average due to its prefetching effect. We also designed RTL and SPICE models of the FSRAM, which show that the FSRAM significantly improves memory access time, while reducing power consumption, with negligible area overhead.  相似文献   

19.
Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme.  相似文献   

20.
同步数据触发体系结构SDTA将传统指令级并行细化到微操作级并行,具有较高的数据处理能力,但其特殊的指令格式及指令特性,给指令Cache访问带来了挑战。指令预取技术能够有效地降低指令Cache的访问失效率,增强处理器取指能力,提高性能。本文分析了SDTA指令集特性,提出了一种适合SDTA指令集特性的软硬件相结合的混合指令预取机制,采用硬件预取引擎和软件提示相结合进行预取。该方法能够有效地提高指令Cache命中率,且具有实现简单、无效预取率低、不会增加代码体积等特点。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号