提出延迟隐藏的数据预取模型,实现计算与访存的重叠操作,以达到共享二级缓存零缺失;给出基本块的概念,以简化算法的数据结构和减少存储开销;按基本块连续存储方式存储矩阵元素,从存储层次上优化算法,显著地减少页表缓冲缺失;采取非递归调度基本块的策略,充分利用多核计算机的共享二级缓存来减少访问主存的次数,并且不局限于某种特定的存储结构,实现算法缓存无关.多核计算机上的实验结果表明,给出的非递归计算矩阵乘积的线程级并行算法高效、可扩展.  相似文献   

阵列众核处理器由于其较高的计算性能和能效比已经被广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器中,核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。在阵列众核处理器中,在单核心中引入硬件同时多线程技术,针对实验中一级指令缓存命中率随着线程数增加而显著降低的问题,提出了一种面向阵列众核处理器的冗余指令缓存存储结构,基于该结构,提出采用FIFO及类LRU替换策略。通过上述优化的高速缓存结构设计,经实验模拟,双线程整体指令Cache失效率降低了25.2%,整体CPI性能提升了30.2%。  相似文献   

多数处理器中采用多级包含的cache存储层次,现有的末级cache块替换算法带来的性能开销较大.针对该问题,提出一种优化的末级cache块替换算法PLI,在选择丢弃块时考虑其在上级cache的访问频率,以较小的代价选出最优的LLC替换块.在时钟精确模拟器上的评测结果表明,该算法较原算法性能平均提升7%.  相似文献   

现有的SLP优化算法无法处理内层循环中存在的依赖环和归约,并且在基本块边界产生大量的冗余拆包和赋值语句,从而导致向量化效率不高.针对该问题,提出了一种基于跨基本块变换和循环分布的SLP优化算法.该算法以控制流图为基础,根据基本块间各数组变量的Define-Use关系以及跨越基本块之间的数据依赖关系进行跨基本块的向量化变换,有序地采用跨基本块变换和循环分布,尽可能发掘最内层循环基本块内语句的并行性,使SLP自动向量化编译器生成具有更多SIMD指令的向量化代码.实验结果表明,该算法能够隐藏更多跨基本块冗余操作的开销,同时利用跨基本决的数据依较生成更优的SIMD指令,有效地提高了向量化程序的加速比.  相似文献   

软错误由高能粒子撞击所产生,对处理器的可靠性产生很大的损害.随着处理器设计目标转向低功耗、高性能和低供电电压,软错误的发生日益频繁,处理器的可靠性研究也随之受到越来越多的关注.针对传统的基于注错仿真的可靠性评估方法效率低的缺陷,提出了一套系统的cache可靠性评估方法,以可靠性指标之一--体系结构易受损因子(architectural vulnerability factor,AVF))--为研究对象,一方面,基于指令行为分析应用程序运行过程中对最终结果不产生影响的指令,从而确定对cache的AVF产生作用的指令;另一方面,根据cache的存储类型、所采取的写策略,结合cache中数据/指令阵列和地址标识阵列的特点,对cache上的各种相邻操作组合对AVF的影响进行了研究,从而完成AVF评估所需的信息分析.实验部分对PISA体系结构指令cache中的指令阵列进行了AVF评估,说明了该方法的有效性.  相似文献   

基于非线性控制流图的全局指令调度由于非线性控制流的控制流图的复杂性不易计算出一条指令在其所在控制流图中的优先级,因此也不易判断来自不同基本块的指令的优先顺序,从而导致在决定一条指令何时被调度出该指令所在的基本块以及调度到哪儿时倾向于保守和随意。例如D.Bernstein的全局指令调度的启发性方法优先来自这些基本块的指令:调度器当前正在调度的基本块以及与当前基本块控制等价的基本块。然而,这种启发性方法往往导致处在关键路径上的指令被滞后。本文提出的迭代式全局指令调度算法基于D.Bernstein的全局调度算法。它采用与D.Bernstein相同的启发性方法,但有选择地多次调度一个基本块使得处在关键路径上的指令被尽早调度。实验结果表明该算法以增加10%的调度时间开销提高调度器8%的性能。  相似文献   

空间辐射环境中,大量的宇宙射线经常导致星载计算机出现瞬时故障,这些故障的主要影响之一是引发程序控制流错误.文中提出了一种软件实现的控制流检测方法CFCAF.CFCAF基于插入虚拟基本块后的控制流图对基本块分类,并为基本块设计格式化标签,然后在基本块内插装标签更新、比较指令,实现对基本块之间、基本块内和过程间调用的控制流检测.CFCAF的特点是可以根据可靠性和性能的需求进行灵活配置.对CFCAF及目前有代表性的两个同类算法进行的故障注入实验结果表明,CFCAF算法以平均41.7%的性能代价和平均34%的空间代价,使程序的平均失效率降到了5.2%,在3个同类算法中,CFCAF算法具有较低的时空开销和最高的可靠性.  相似文献   

对于节点计算、通信与存储能力不同、节点由多个多核处理器(多个片上多处理器)组成且共享L3cache的机群系统,采取计算与传输重叠模式,提出了主节点以多进程方式并发发送数据给从节点的可分负载调度模型.该调度模型自适应节点具有不同的计算、通信和存储能力,动态计算、确定调度轮数和每轮调度分配给各从节点的负载块规模,以平衡各节点的计算负载、减少节点之间的通信开销,缩短任务调度长度.依据各节点中的L3cache,L2cache和L1cache的可用存储容量,提出了对节点主存中接收到的负载块进行多级缓存划分的数据分配方法,以确保分配给节点中各个多核处理器、各个内核的负载平衡.基于提出的多核机群节点间可分负载调度模型和节点内多级存储数据分配方法,设计实现了节点拥有多个多核处理器的异构机群上通信和存储高效的k-选择并行算法.在曙光TC5000A多核机群系统上,测试了主节点并行与串行发送数据给从节点的任务调度方式、各级缓存利用率、每个核心执行不同数目的线程对并行算法运行性能的影响.实验结果表明:基于主节点并发发送数据给从节点的调度模型设计的k-选择并行算法,其运行性能优于基于主节点串行发送数据给从节点的调度模型设计的k-选择并行算法;L3cache和L2cache利用率大小对算法运行性能影响较大;当L3cache,L2cache和L1cache利用率取其优化组合值、每个核心运行3个线程时,算法所需的运行时间最短.  相似文献   

可执行文件比较广泛应用于软件版权检测、恶意软件家族检测、异常检测的模式更新以及补丁分析.传统方法无法满足应用对速度和精度的要求.在函数、基本块和指令级别上设计了一元指令签名、基于函数控制流程图邻接矩阵的函数一元结构签名、指令的强/中/弱一元签名,并提出了融合签名和属性的函数匹配算法、基本块匹配算法,从而简化了已有指令比较,可抗指令重排,优于SPP.并通过匹配权统计以及严格的最大唯一匹配策略和Hash进一步降低误报,提高效率.最后,实现原型工具PEDiff,并通过实验证实了该比较方法在速度和精度上具有良好的性能.  相似文献   

传统数据管理机制无法感知分布式cache布局的非一致访问延迟特性,导致多核处理器大容量cache失效率和命中延迟之间的矛盾日益加剧.此外,单独依靠数据迁移和盲目复制难以解决共享数据块的竞争访问与长延迟命中问题.基于瓦片式多核处理器分布式cache的虚拟共享域划分机制,提出并实现一种域间数据自适应迁移与复制机制,能够协同感知本地目标bank候选牺牲块状态和远程命中块的局部活跃程度,在多个虚拟共享域间对多核竞争访问的共享数据进行动态迁移和复制决策,综合权衡片上长延迟命中和cache容量有效利用率问题,降低平均存储访问延迟.最后,在全系统模拟器中实现虚拟共享域划分和域间共享数据自适应迁移-复制机制,并采用典型测试程序包SPLASH-2评估性能优化情况.实验表明,与传统固定共享域划分机制和同类优化机制相比,自适应迁移与复制机制在不同共享度下均可获得相应性能提升,面积开销可以忽略不计.  相似文献   

A coarse-grain parallel solver for systems of linear algebraic equations with general sparse matrices by Gaussian elimination is discussed. Before the factorization two other steps are performed. A reordering algorithm is used during the first step in order to obtain a permuted matrix with as many zero elements under the main diagonal as possible. During the second step the reordered matrix is partitioned into blocks for asynchronous parallel processing (normally the number of blocks is equal to the number of processors). It is possible to obtain blocks with nearly the same number of rows, because there is no requirement to produce square diagonal blocks. The first step is much more important than the second one and has a significant influence on the performance of the solver. A straightforward implementation of the reordering algorithm will result inO(n 2) operations. By using binary trees this cost can be reduced toO(NZ logn), whereNZ is the number of non-zero elements in the matrix andn is its order (normallyNZ is much smaller thann 2). Some experiments on parallel computers with shared memory have been performed. The results show that a solver based on the proposed reordering performs better than another solver based on a cheaper (but at the same time rather crude) reordering whose cost is onlyO(NZ) operations.  相似文献   

In this paper we study the area-minimization problem for hierarchical floorplans. We settle an open problem on the complexity of the area-minimization problem for hierarchical floorplans by showing it to be NP-complete (even for balanced hierarchical floorplans). We then present a new algorithm for determining the nonredundant realizations of a wheel. The algorithm has time costO(k 2 logk) and space cost0(k 2) if each block in a wheel has at mostk realizations. Based on the new algorithm for a wheel, we design a new pseudopolynomial area-minimization algorithm for hierarchical floorplans of order-5. The time and space costs of the algorithm are0((nM)2log(nM) and0(n 2 M), respectively, wheren is the number of basic blocks andM is an upper bound on the dimensions of the realizations of the basic blocks. The area-minimization algorithm was implemented. Experimental results show that it is very fast.The research of Peichen Pan and C. L. Liu was partially supported by the NSF under Grant MIP-9222408. The research of Weiping Shi was partially supported by the NSF under Grant MIP-9309120.  相似文献   

In this paper we describe an elegant and efficient approach to coupling reordering and decoding in statistical machine translation, where the n-gram translation model is also employed as distortion model. The reordering search problem is tackled through a set of linguistically motivated rewrite rules, which are used to extend a monotonic search graph with reordering hypotheses. The extended graph is traversed in the global search when a fully informed decision can be taken. Further experiments show that the n-gram translation model can be successfully used as reordering model when estimated with reordered source words. Experiments are reported on the Europarl task (Spanish–English and English–Spanish). Results are presented regarding translation accuracy and computational efficiency, showing significant improvements in translation quality with respect to monotonic search for both translation directions at a very low computational cost.  相似文献   

In this paper, we consider the problem of finding fill-preserving sparse matrix orderings for parallel factorization. That is, given a large sparse symmetric and positive definite matrix A that has been ordered by some fill-reducing ordering, we want to determine a reordering that is appropriate in terms of preserving the sparsity and minimizing the cost to perform the Cholesky factorization in parallel. Past researches on this problem all are based on the elimination tree model, in which each node represents the task for factoring a column, and thus, can be seen as a coarse-grained task dependence model. To exploit more parallelism, Joseph Liu proposed a medium-grained task model, called the column task graph, and showed that it is amenable to the shared-memory supercomputers. Based on the column task graph, we devise a greedy reordering algorithm, and show that our algorithm can find the optimal ordering among the class of all fill-preserving orderings of the given sparse matrix A.  相似文献   

In this paper, a parallel algorithm is presented to find all cut-vertices and blocks of an interval graph. If the list of sorted end points of the intervals of an interval graph is given then the proposed algorithm takes O(log n) time and O(n/log n) processors on an EREW PRAM, if the sorted list is not given then the time and processors complexities are respectively O(log n) and O(n).  相似文献   

Variable Order Panel Clustering   总被引:3,自引:0,他引:3  
Stefan Sauter 《Computing》2000,64(3):223-261
We present a new version of the panel clustering method for a sparse representation of boundary integral equations. Instead of applying the algorithm separately for each matrix row (as in the classical version of the algorithm) we employ more general block partitionings. Furthermore, a variable order of approximation is used depending on the size of blocks. We apply this algorithm to a second kind Fredholm integral equation and show that the complexity of the method only depends linearly on the number, say n, of unknowns. The complexity of the classical matrix oriented approach is O(n 2) while, for the classical panel clustering algorithm, it is O(nlog7 n). Received July 28, 1999; revised September 21, 1999  相似文献   

The paper addresses the problem of multi-slot just-in-time scheduling. Unlike the existing literature on this subject, it studies a more general criterion—the minimization of the schedule makespan rather than the minimization of the number of slots used by schedule. It gives an O(nlog 2 n)-time optimization algorithm for the single machine problem. For arbitrary number of m>1 identical parallel machines it presents an O(nlog n)-time optimization algorithm for the case when the processing time of each job does not exceed its due date. For the general case on m>1 machines, it proposes a polynomial time constant factor approximation algorithm.  相似文献   

New Model and Algorithm for Hardware/Software Partitioning   总被引:1,自引:0,他引:1       下载免费PDF全文
This paper focuses on the algorithmic aspects for the hardware/software (HW/SW) partitioning which searches a reasonable composition of hardware and software components which not only satisfies the constraint of hardware area but also optimizes the execution time. The computational model is extended so that all possible types of communications can be taken into account for the HW/SW partitioning. Also, a new dynamic programming algorithm is proposed on the basis of the computational model, in which source data, rather than speedup in previous work, of basic scheduling blocks are directly utilized to calculate the optimal solution. The proposed algorithm runs in O(n·A) for n code fragments and the available hardware area A. Simulation results show that the proposed algorithm solves the HW/SW partitioning without increase in running time, compared with the algorithm cited in the literature.  相似文献   

SimRank has become an important similarity measure to rank web documents based on a graph model on hyperlinks. The existing approaches for conducting SimRank computation adopt an iteration paradigm. The most efficient deterministic technique yields O(n3)O\left(n^3\right) worst-case time per iteration with the space requirement O(n2)O\left(n^2\right), where n is the number of nodes (web documents). In this paper, we propose novel optimization techniques such that each iteration takes O (min{ n ·m , nr })O \left(\min \left\{ n \cdot m , n^r \right\}\right) time and O ( n + m )O \left( n + m \right) space, where m is the number of edges in a web-graph model and r ≤ log2 7. In addition, we extend the similarity transition matrix to prevent random surfers getting stuck, and devise a pruning technique to eliminate impractical similarities for each iteration. Moreover, we also develop a reordering technique combined with an over-relaxation method, not only speeding up the convergence rate of the existing techniques, but achieving I/O efficiency as well. We conduct extensive experiments on both synthetic and real data sets to demonstrate the efficiency and effectiveness of our iteration techniques.  相似文献   

RNA二级结构预测中动态规划的优化和有效并行   总被引:6,自引:0,他引:6  
谭光明  冯圣中  孙凝晖 《软件学报》2006,17(7):1501-1509
基于最小自由能模型的方法是计算生物学中RNA二级结构预测的主要方法,而计算最小自由能的动态规划算法需要O(n4)的时间,其中n是RNA序列的长度.目前有两种降低时间复杂度的策略:限制二级结构中内部环的大小不超过k,得到O(n2×k2)算法;Lyngso方法根据环的能量规则,不限制环的大小,在O(n3)的时间内获得近似最优解.通过使用额外的O(n)的空间,计算内部环中的冗余计算大为减少,从而在同样不限制环大小的情况下,在O(n3)的时间内能够获得最优解.然而,优化后的算法仍然非常耗时,通过有效的负载平衡方法,在机群系统上实现并行程序.实验结果表明,并行程序获得了很好的加速比.  相似文献   

