期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

刘利李文龙郭振宇李胜梅汤志忠《软件学报》2005,16(10):1842-1852

软件流水能够加快循环的执行速度.模调度是一种被广泛采用的软件流水的启发式.为了改善存储系统,cache使用了分级机制,但这也带来了额外的存储延迟-cache代价.证明了模调度可能导致cache代价,并提出了一种可以避免模调度的cache代价的PCPMS(prevent cache penalty in modulo scheduling)算法.实验结果表明,PCPMS能够避免模调度中的cache代价,提高程序性能. 相似文献

2.

多级缓存模式下的数据块替换优化算法

兰丽《计算机工程》2013,39(4)

多数处理器中采用多级包含的cache存储层次,现有的末级cache块替换算法带来的性能开销较大.针对该问题,提出一种优化的末级cache块替换算法PLI,在选择丢弃块时考虑其在上级cache的访问频率,以较小的代价选出最优的LLC替换块.在时钟精确模拟器上的评测结果表明,该算法较原算法性能平均提升7％. 相似文献

3.

面向多核处理器系统的Cache感知调度算法

徐远超沈岩谭旭万虎张志敏《小型微型计算机系统》2013,34(2):365-369

Cache空间的不公平使用和争用直接影响系统的整体性能,现有Linux操作系统的默认调度算法不能感知程序的行为,包括访问cache的失效次数,不了解线程之间访存模式和频度上可能存在的差异,因而无法做出更加合理的调度.本文提出并在Linux环境下实现了一种Cache感知的调度算法CAS,通过监测每个任务每千条指令的共享cache失效次数,把cache失效次数相近的任务聚合到同一个核上,使得cache失效次数差异较大的任务运行在不同的核上,避免了cache失效次数都很大的任务在不同的核上同时运行,从而减小了cache空间的不公平使用和争用.实验表明,CAS算法在大多数情况下,减少了整个负载的共享cache失效次数,提高系统的平均吞吐量约5％左右. 相似文献

4.

多处理机系统循环间数据重用的cache优化^* 总被引：2，自引：0，他引：2

丁永华原庆能臧斌宇朱传琪《软件学报》1998,9(8):580-585

cache的使用缓解了CPU和主存储器之间速度差距太大的矛盾，同时，也使cache的命中率成为影响多处理机系统性能发挥的重要因素.人们对如何加强数据的局部性，提高cache命中率，使多处理机系统的性能得到更好的发挥进行了积极的探索.但过去的工作主要集中于如何加强并行循环内的数据局部性，减少甚至消除并行循环内真假共享cache行所引起的cache抖动，对多处理机系统中循环间数据重用的开发和利用却少有论述.该文对如何开发和利用这些循环间数据重用进行了分析和讨论，并提出了一些切实可行、易于实现的方法.这些方法的相似文献

5.

改进的能量最优OpenMP静态调度算法

董勇陈娟杨学军《软件学报》2011,22(9):2235-2247

基于前期工作的EOSS算法,给出了扩展条件下的OpenMP静态调度能量优化算法——改进的能量最优OpenMP静态调度算法(improvedenergy-optimal static scheduling,简称IEOSS).该算法在原有EOSS算法的基础上,建模了数据cache失效造成的访存延迟对并行循环性能及能量的影响... 相似文献

6.

一种低功耗的动态可重构Cache设计 总被引：1，自引：0，他引：1

何勇肖斌陈章龙涂时亮《计算机应用与软件》2009,26(8):247-250

在嵌入式微处理器设计中,cache提高了性能的同时也成了主要的功耗来源.提出一种非统一的动态可重构的低功耗cache结构,和一种动态重构算法DAS(Dynamic Associativity Selection),通过动态重构cache来降低功耗.基于MiBench的仿真结果表明,可重构的cache结构比普通的cache结构的性能更优且能耗更低,指令和数据cache命中率分别平均提高了2.1%和1.4%,内存系统平均能耗降低了8.1%. 相似文献

7.

面向多核CPU和GPU平台的数据库星形连接优化

刘专韩瑞琛张延松陈跃国张宇《计算机应用》2021,41(3):611-617

针对联机分析处理（OLAP）中事实表与多个维表之间的星形连接执行代价较高的问题,提出了一种在先进的多核中央处理器（CPU）和图形处理器（GPU）上的星形连接优化方法。首先,对于多核CPU和GPU平台的星形连接中的物化代价问题,提出了基于向量索引的CPU和GPU平台上的向量化星形连接算法;然后,通过面向CPU cache和GPU shared memory大小的向量划分来提出基于向量粒度的星形连接操作,从而优化星形连接中向量索引的物化代价;最后,提出了基于压缩向量的星形连接算法,将定长向量索引压缩为变长的二元向量索引,从而在低选择率时提高cache内向量索引的存储访问效率。实验结果表明,在CPU平台上向量化星形连接算法相对于常规的行式或列式连接性能提升了40%以上,在GPU平台上向量化星形连接算法相对于常规星形连接算法性能提升超过了15%;与当前主流的内存数据库和GPU数据库相比,优化的星形连接算法性能相对于最优内存数据库Hyper性能提升了130%,相对于最优的GPU数据库OmniSci性能提升了80%。可见基于向量索引的向量化星形连接优化技术有效地提高了多表连接性能,与传统优化技术相比,基于向量索引的向量化处理提高了较小cache上的数据存储访问效率,压缩向量进一步提升了向量索引在cache内的访问效率。相似文献

8.

cache profiling信息指导的软件流水

周谦冯晓兵张兆庆《计算机研究与发展》2008,45(5):834-840

软件流水是一种重要的指令调度技术,它通过同时执行来自不同循环迭代的指令来加快循环的执行时间.随着处理器速度和访存速度差距越拉越大,访存指令尤其是cache miss的访存指令日益成为系统性能提高的瓶颈.由于这些指令的延迟不是固定的,如何在软件流水中预测并掩盖这些访存指令的延迟是非常重要的.与前人预测访存延迟的方法不同,引入cache profiling技术,通过动态收集到profile信息来预测访存延迟,并进行适当的调度.当增加模调度循环中的访存指令的延迟时,启动间隔也会随之增大,导致性能不会随之上升.CSMS算法和FLMS算法在尽量不增大启动间隔的情况下,改变访存指令的延迟.改进了CSMS算法和FLMS算法,根据cache profiling的信息来改变访存延迟,所以比前人的方法更为准确.实验表明,新方法可以有效地提高程序性能,对SPEC2000测试程序平均性能提高1%左右,个别例子的性能改进高达11%. 相似文献

9.

基于移动终端的WLAN快速切换方案

下载免费PDF全文

徐伟杨怡陶军《计算机工程》2009,35(14):135-137

移动终端在AP间切换产生的时延和抖动严重影响实时业务的质量。通过分析移动终端切换的过程和现有的改进方案,提出一种基于动态域值的扫描触发机制,有效地避免移动终端在静止和AP信号较好条件下的cache更新。在STA上实现基于动态域值触发扫描的分片cache的更新算法,该算法在保证cache及时更新的同时降低每次更新cache的开销且能有效减小切换时延。相似文献

10.

机群系统中空闲结点的功耗管理

刘勇鹏卢凯迟万庆《计算机科学》2013,40(4):59-63

针对机群系统中存在的大量空闲活跃结点所造成的严重能耗浪费,提出空闲结点的cache 式动态功耗管理模型,即利用结点多级休眠机制,将空闲结点划分为不同休眠等级的结点集合,每级休眠状态对应一级结点储备cache,力求获得近似活跃状态的系统响应速率,以及近似最深休眠状态的能耗节省。基于cache式功耗管理模型,综合能耗与响应速率两个因素,设计了空闲结点在不同休眠状态之间的动态升降级算法、基于储备池的资源结点分配与回收算法以及储备额阈值自适应算法,以在保证系统响应速率的同时降低系统能耗。实验表明,提出的空闲结点cache式功耗管理技术在作业相对延迟仅增加0.99%的代价下,系统空闲结点功耗降低69.51%,优化效果显著。相似文献

11.

Scalarization Using Loop Alignment and Loop Skewing

Zhao Yuan Kennedy Ken 《The Journal of supercomputing》2005,31(1):5-46

Array syntax, which is supported in many technical programming languages, adds expressive power by allowing operations on and assignments to whole arrays and array sections. To compile an array assignment statement to a uniprocessor, the language processor must convert the statement into a loop that has the same meaning. This process is called scalarization.Scalarization presents a significant technical problem because an array assignment needs to be implemented as if all inputs are fetched before any outputs are stored. Since a loop intermixes loads and stores, the compiler typically allocates a temporary array to hold the intermediate result. Because these extra temporary arrays can cause performance problems in cache, many techniques have been developed to avoid their use or minimize their size.In this paper, we present a novel application of two compiler strategies—loop alignment and loop skewing—to address this problem. We show that these strategies can achieve the asymptotically minimal memory allocation for stencil computations. Our experiments with loop alignment and loop skewing demonstrate that it is extremely effective in improving memory hierarchy performance of Fortran 90 array code on standard uniprocessors. The result should be applicable to other array languages, such as MATLAB. 相似文献

12.

Loop Staggering,Loop Compacting:Restructuring Techniques for Thrashing Problem 总被引：1，自引：0，他引：1

下载免费PDF全文

Jin Guohua Yang Xuejun Chen Fujie 《计算机科学技术学报》1993,8(1):49-57

Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel wit low run-time overhead is thus very important for achieving high performance in paralleo processing systems.However,in parallel processing systems with caches of local memories in memory hierarchies,“thrashing problemmay” may arise when data move back and forth frequently between the caches or local memories in different processors.The techniques associated with parallel compiler to solve the problem are not completely developed.In this paper,we present two restructuring techniques called loopg staggering,loop staggering and compacting,with which we can not only eliminate the cache or local memory thrashing phemomena significantly,but also exploit the potential parallelism existing in outer serial loop.Loop staggering benefits the dynamic loop scheduling strategies,whereas loop staggering and compacting is good for static loop scheduling strategies,Our method especially benefits parallel programs,in which a parallel loop is enclosed by a serial loop and array elements are repeatedly used in the different iterations of the parallel loop. 相似文献

13.

基于循环的指令高速缓存访问预测方法 总被引：1，自引：0，他引：1

梁静陈志坚孟建熠《计算机应用研究》2012,29(7):2491-2493

为了减少高速缓存访问功耗,提出了一种针对循环的基于历史访问路径的指令高速缓存访问预测方法。该方法以循环作为高速缓存访问路预测行为开启的先决条件,通过指令高速缓存的历史访问路径训练预测器。当循环体再次进入时选择对应的访问路径预测器,获取目标指令高速缓存的路进行访问,降低访问功耗。并进一步提出多路径路预测方法,以得到更高的预测准确率。基于Powerstone测试基准的实验结果表明,该预测方法能达到99%的预测准确率。相比传统的指令高速缓存,使用本方法的高速缓存可平均降低65%的访问功耗,仅增加约0.2%的平均指令高速缓存访问周期。相似文献

14.

脉冲神经元环路振荡发放与环路选择

下载免费PDF全文

陈贤富姚海东金燕晖路烽《计算机工程》2008,34(6):219-220

在Izhikevich提出的脉冲神经元模型中,引入随机变化的输入电流,使神经元的脉冲发放具有随机性,不同数量的神经元采用连接权值组成网络的脉冲发放。实验结果表明,选择适当的连接权值可以得到环路的持续振荡发放。通过脉冲发放,可以在网络中选择神经环路,完成环路记忆联想过程,并给出研究脉冲神经智能的新思路。相似文献

15.

一种适用于机群OpenMP系统的有效调度算法

吴少刚章隆兵蔡飞胡伟武《计算机研究与发展》2004,41(7):1298-1305

OpenMP作为共享存储并行编程标准，以其良好的易用性、支持增量并行等特点成为并行程序设计的主流模型之一．OpenMP标准是针对UMA共享存储结构制定的，其循环调度机制只考虑了负载平衡而无须考虑数据分布．然而在机群OpenMP系统中，数据局部性是影响性能的关键因素．针对OpenMP标准中静态调度策略不适合机群计算的缺点，提出了一个充分体现拥有者计算原则的LBS调度算法，并通过扩展制导的方式在机群OpenMP系统（OpenMP/JIAJIA)上加以实现．测试结果表明，LBS算法对于机群OpenMP系统很有效．相似文献

16.

Transforming Complex Loop Nests for Locality

Qing Yi Ken Kennedy Vikram Adve 《The Journal of supercomputing》2004,27(3):219-264

Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficiently optimize simple loop structures such as sequences of perfectly nested loops. However, on more complicated structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. To optimize complex loop structures both effectively and inexpensively, we present a novel loop transformation, dependence hoisting, for optimizing arbitrarily nested loops, and an efficient framework that applies the new technique to aggressively optimize benchmarks for better locality. Our technique is as inexpensive as the traditional unimodular loop transformation techniques and thus can be incorporated into commercial compilers. In addition, it is highly effective and is able to block several linear algebra kernels containing highly challenging loop structures, in particular, Cholesky, QR, LU factorization without pivoting, and LU with partial pivoting. The automatic blocking of QR and pivoting LU is a notable achievement—to our knowledge, few previous compiler techniques, including theoretically more general loop transformation frameworks [1, 21, 23, 27, 31], were able to completely automate the blocking of these kernels, and none has produced the same blocking as produced by our technique. These results indicate that with low compilation cost, our technique can in practice match the effectiveness of much more expensive frameworks that are theoretically more powerful. 相似文献

17.

一个新的循环分块算法

舒辉康绯《计算机研究与发展》2002,39(10):1303-1306

循环分块是一种提高循环Cache命中率的循环变换技术，循环分块的大小是决定循环分块效率的关键因素，CME（cache miss equations)是一种精确分析程序中循环Cache命中率的数学模型，从CME理论模型出发，通过比较循环分块前后CME的变化，结合PADDING技术可以得出一个循环分块算法。实验表明，通过该算法计算出的块大小较之经典的LRW循环分块算法，在确保完全消除循环中数且引用数据访问Cache自冲突的同时，可以获得更大的分块，从而提高了循环分块的分块效率。相似文献

18.

Loop scheduling with memory access reduction subject to register constraints for DSP applications

Yi Wang Zhiping Jia Renhai Chen Meng Wang Duo Liu Zili Shao 《Software》2014,44(8):999-1026

Memory accesses introduce big‐time overhead and power consumption because of the performance gap between processors and main memory. This paper describes and evaluates a technique, loop scheduling with memory access reduction (LSMAR), that replaces hidden redundant load operations with register operations in loop kernels and performs partial scheduling for newly generated register operations subject to register constraints. By exploiting data dependence of memory access operations, the LSMAR technique can effectively reduce the number of memory accesses of loop kernels, thereby improving timing performance. The technique has been implemented into the Trimaran compiler and evaluated using a set of benchmarks from DSPstone and MiBench on the cycle‐accurate simulator of the Trimaran infrastructure. The experimental results show that when the LSMAR technique is applied, the number of memory accesses can be reduced by 18.47% on average over the benchmarks when it is not applied. The measurements also indicate that the optimizations only lead to an average 1.41% increase in code size. With such small code size expansion, the technique is more suitable for embedded systems compared with prior work.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献