首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The performence of scientific programs on modern processors can be significantly degraded by memory references that frequently arise due to load and store operations associated with array references. We have developed techniques for optimally allocating registers to array elements whose values are repeatedly referenced over one or more loop iterations. The resulting placement of loads and stores is optimal in that number of loads and stores encoutered along each path through the loop is minimal for the given program branching structure. To place load, store, and register-to-register shift operations without introducing fully/partially redundant and dead memory operations, a detailed value flow analysis of array references is required. We present an analysis framework to efficiently solve various data flow problems required by array load-store optimizations. The framework determines the collective behavior of recurrent references spread over multiple loop iterations. We also demonstrate how our algorithms can be adapted for various fine-grain architectures.  相似文献   

2.
Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the Address Resolution Buffer (ARB) uses a centralized buffer to support speculative versions. Our proposal, called the Speculative Versioning Cache (SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. Our evaluation for the Multiscalar architecture shows that hit latency is an important factor affecting performance and private cache solutions trade-off hit rate for hit latency  相似文献   

3.
传统的缓存替换策略主要基于经验主义,近年来研究者们使用预测技术推测访存行为,提高缓存替换的准确性,预测技术的应用是当前缓存替换策略研究的热点.由于访存行为自身的复杂性,直接在缓存系统中预测访存行为是困难的,要面对很大的不确定性.当前已有的研究为了解决该问题,使用越来越复杂的预测算法来分析访存行为之间的关联.然而这种方式并未真正减小不确定性,同时现有的缓存替换策略很难避免乱序执行和缓存预取对访存行为分析过程的干扰.为了解决以上问题,提出了一种新的预测缓存访问序列的方法IFAPP(instruction flow access pattern prediction),根据分支预测技术推测程序指令流,定位指令流中的访存指令,进而对其中访存指令的行为逐一进行预测.通过访存序列计算每个替换候选项的重用距离,将重用距离最远的候选项踢出.该方法可以避免乱序执行和缓存预取的干扰,预测对象是行为简单的独立访存指令,减少预测过程中所面对的不确定性.实验结果表明,该算法在一级数据缓存上比LRU算法平均减少3.2%的缓存缺失.相比经典的基于缓存预测的BRRIP和BIP算法,该算法在一级数据缓存上分别减少12.3%和14.4%的缓存缺失.  相似文献   

4.
One of the main challenges of modern processor design is the implementation of a scalable and efficient mechanism to detect memory access order violations as a result of out-of-order execution. Traditional age-ordered associative load and store queues are complex, inefficient, and power-hungry. In this paper, we introduce two new LSQ filtering mechanisms with different design tradeoffs, but both explicitly rely on timing information as a primary instrument to rule out dependence violation and enforce memory dependences. Our timing-centric design operates at a fraction of the energy cost of an associative LQ and SQ with no performance degradation.  相似文献   

5.
Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the store-load order violation.But this in-flight forwarding only occupies about 15% of all store-load communications,which makes the CAM-based micro-architecture the major bottleneck to scale store-load communication further.This paper presents a new micro-architecture named ASW (short for active store window).It provides a new structure named speculative active store window to implement more aggressively speculative store-load forwarding than conventional LSQ.This structure could forward the data of committed stores to the executing loads without accessing to L1 data cache,which is referred to as far forwarding in this paper.At the back-end of the pipeline,it uses in-order load re-execution filtered by the tagged SSBF (short for store sequence bloom filter) to verify the correctness of the store-load forwarding.The speculative active store window and tagged store sequence bloom filter are all set-associate structures that are more efficient and scalable than fully associative structures.Experiments show that this simpler and faster design outperforms a conventional load/store queue based design and the NoSQ design on most benchmarks by 10.22% and 8.71% respectively.  相似文献   

6.
Pointer swizzling is the conversion of database objects between an external form (object identifiers) and an internal form (direct memory pointers). Swizzling is used in some object-oriented databases, persistent object stores, and persistent and database programming language implementations to speed manipulation of memory resident data. The author describes a simplifying model of application behavior, revealing those aspects where swizzling is most relevant in both benefits and costs. The model has a number of parameters, which the authors have measured for a particular instance of the Mneme persistent object store, varying the swizzling technique used. The results confirm most of the intuitive, qualitative tradeoffs, with the quantitative data showing that some performance differences between schemes are smaller than might be expected. However, there are some interesting effects that run counter to naive intuition, most of which are explained using deeper analysis of the algorithms and data structures  相似文献   

7.
为神经网络提供有效学习算法是神经网络研究的关键问题。文章利用t-模的伴随蕴涵算子,为基于Max和Tes合成的模糊联想记忆网络Max-TesFAM提供了一种新的学习算法,此处Tes是由爱因斯坦提出的一种t-模算子。从理论上严格证明了,只要Max-TesFAM能完整可靠地存储所给的多个模式对,则该新的学习算法一定能找到使得网络能完整可靠存储这些模式对的所有连接权矩阵的最大者。最后,用实验说明了所提出的学习算法的有效性。  相似文献   

8.
Since a static work distribution does not allow for satisfactory speed‐ups of parallel irregular algorithms, there is a need for a dynamic distribution of work and data that can be adapted to the runtime behavior of the algorithm. Task pools are data structures which can distribute tasks dynamically to different processors where each task specifies computations to be performed and provides the data for these computations. This paper discusses the characteristics of task‐based algorithms and describes the implementation of selected types of task pools for shared‐memory multiprocessors. Several task pools have been implemented in C with POSIX threads and in Java. The task pools differ in the data structures to store the tasks, the mechanism to achieve load balance, and the memory manager used to store the tasks. Runtime experiments have been performed on three different shared‐memory systems using a synthetic algorithm, the hierarchical radiosity method, and a volume rendering algorithm. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

9.
This paper introduces the concept of problem stores: static stores whose dependent loads often miss in the cache. Accurately identifying problem stores allows the early determination of addresses likely to cause later misses, potentially allowing for the development of novel, proactive prefetching and memory hierarchy management schemes. We present a detailed empirical characterization of problem stores using the SPEC2000 CPU benchmarks. The data suggests several key observations about problem stores. First, we find that the number of important problem stores is typically quite small; the worst 100 problem stores write the values that will lead to about 90% of non-cold misses for a variety of cache configurations. We also find that problem stores only account for 1 in 8 dynamic stores, though they result in 9 of 10 misses. Additionally, the problem stores’ dependent loads miss in the L2 cache a larger fraction of the time than loads not dependent on problem stores. We also observe the set of problem stores is stable across a variety of cache configurations. Finally, we found that the instruction distance from problem store to miss and problem store to evict is often greater than one million instructions, but the value is often needed within 100,000 instructions of the eviction.  相似文献   

10.
In this paper, a new methodology for speeding up edge and line detection algorithms is presented, achieving improved performance over the state of the art software library OpenCV (speedup from 1.35 up to 2.22) and other conventional implementations, in both general and embedded processors, by reducing the number of load/store and arithmetic instructions, the number of data cache accesses and data cache misses in memory hierarchy and the algorithm memory size. This is achieved by fully exploiting the combination of the software and hardware parameters which are considered simultaneously as one problem and not separately. Furthermore, the edge and line detection algorithms have been simplified for a computer vision application in a Virtex-5 Xilinx FPGA using Microblaze soft processor (detection and measurement of flow fronts in a microfluid device); it achieves speedup up to 660 times in comparison with conventional software implementations.  相似文献   

11.
This paper concentrates on the grid implementation of software tools for the comparison of protein structures. We have developed comparison algorithms based on indexing techniques that store transformation invariant properties of the 3D protein structures into tables. The method has large memory requirements and is computationally intensive. Furthermore, the dataset needs frequent updates as new proteins are added to the Protein Data Bank. Thus a significant advantage is obtained from a computational framework such as a grid. We report on a distributed implementation of the matching procedures on a grid using Globus MPI-CH, focusing on the data partition strategy to achieve good load balancing and to minimize the number of secondary memory accesses of the out-of-core computation.  相似文献   

12.
For given input the global trace generated by a parallel program in a shared memory multiprocessing environment may change as the memory architecture, and management policies change. A method is proposed for ensuring that a correct global trace is generated in the new environment. This method involves a new characterization of a parallel program that identifies its address change points and address affecting points. An extension of traditional process traces, called the intrinsic trace of each process, is developed. The intrinsic traces maximize the decoupling of program execution from simulation by describing the address flow graph and path expressions of each process program. At each point where an address is issued, the trace-driven simulator uses the intrinsic traces and the sequence of loads and stores before the current cycle, to determine the next address. The mapping between load and store sequences and next addresses to issue, sometimes, requires partial program reexecution. Programs that do not require partial program reexecution are called graph-traceable  相似文献   

13.
Sha  T. Martin  M.M.K. Roth  A. 《Micro, IEEE》2007,27(1):106-113
The NoSQ microarchitecture performs store-load communication without a store queue and without executing stores in the out-of-order engine. It uses speculative memory bypassing for all in-flight store-load communication, enabled by a 99.8 percent accurate store-load communication predictor. The result is a simple, fast core data path containing no dedicated store-load forwarding structures  相似文献   

14.
An elastic and highly available data store is a key component of many cloud applications. Existing data stores with strong consistency guarantees are designed and optimized for small updates, key-value access, and (if supported) small range queries over a predefined key column. This raises performance and availability problems for applications which inherently require large updates, non-key access, and large range queries. This paper presents a solution to these problems: Crescando/RB; a distributed, scan-based, main memory, relational data store (single table) with robust performance and high availability. The system addresses a real, large-scale industry use case: the Amadeus travel management system. This paper focuses on the distribution layer of Crescando/RB, the problem and theory behind it, the rationale underlying key design decisions, and the novel multicast protocol and replication framework it is composed of. Highlighting the key features of the distribution layer, we present experimental results showing that even under permanent node failures and large-scale data repartitioning, Crescando/RB remains fully available and capable of sustaining a heavy query and update load.  相似文献   

15.
硬件的发展是为数据库研究开辟了新课题,随着大容量内存的出现,在内存中存放越来越大的数据库已成为可能,使得主存数据库系统和技术成为现实,本文介绍GKD-MMDB主存数据库管理系统的系统结构和主要内部数据结构及实现算法。  相似文献   

16.
臧继昆  喻剑 《计算机科学》2015,42(5):221-224, 229
利用HDFS进行大规模交通监控视频的存储和处理是一种可靠、高效、可扩展的数据存储方案.针对HDFS默认的机架感知策略可能造成存储热点这一问题,提出了一种基于事件密集度的交通监控视频放置策略.该策略利用交通视频可按事件类型进行分类这一特征,在数据放置时将数据节点中已存储的各类型的事件视频可能对其造成的负载作为节点的主要评价因素之一,同时结合节点的实时负载、磁盘容量等因素进行综合评价,选择最佳的数据放置节点,从而平衡数据节点的负载.实验表明,基于事件密集度的交通监控视频放置策略可以改善数据节点的吞吐量,提高存储系统性能.  相似文献   

17.
硬件的发展为数据库研究开辟了新课题,随着大容量内存的出现,在内存中存放越来越大的数据库成为可能,使得主存数据库系统和技术成为现实。本文介绍GKD-MMDB主存数据库管理系统的系统结构和主要内部数据结构及实现算法。  相似文献   

18.
关键业务中内存数据库的T树索引优化   总被引:3,自引:0,他引:3  
林鹏  李航  徐学洲 《计算机工程》2004,30(17):75-76,97
在关键业务中,提高DBMS性能的一个途径是把数据库放在主存巾而不是硬盘中,这样便可以设计新的数据结构和算法,来提高内存数据库(MMDB)的效率。该文列举了当前MMDB研究中关于索引结构的一些成果,并设计了一个新的索引结构——T-tail树,最后给出T-tail树的主要算法和这些算法的性能分析。结果表明在内存数据库中,T-tail树具备非常好的性能。  相似文献   

19.
Membrane algorithms are a class of distributed and parallel algorithms inspired by the structure and behavior of living cells. Many attractive features of living cells have already been abstracted as operators to improve the performance of algorithms. In this work, inspired by the function of biological neuron cells storing information, we consider a memory mechanism by introducing memory modules into a membrane algorithm. The framework of the algorithm consists of two kinds of modules (computation modules and memory modules), both of which are arranged in a ring neighborhood topology. They can store and process information, and exchange information with each other. We test our method on a knapsack problem to demonstrate its feasibility and effectiveness. During the process of approaching the optimum solution, feasible solutions are evolved by rewriting rules in each module, and the information transfers according to directions defined by communication rules. Simulation results showed that the performance of membrane algorithms with memory cells is superior to that of algorithms without memory cells for solving a knapsack problem. Furthermore, the memory mechanism can prevent premature convergence and increase the possibility of finding a global solution.  相似文献   

20.
This paper presents a new microarchitecture technique named DYNAMEM,in which memory reference instuctions are dynamically scheduled and can be executed out-of-order.Load instructions can bypass store instructions speculatively,even if the store instructions‘ addresses are unknown.DYNAMEM can greatly alleviate the restrints of ambiguous memory dependencies.Simulation results show that the frequency of false load is low.Mechanism has been provided to repair false loads with low penalty,and to achieve precise interrupts.Discussions and experimental results show that DYNAMEM could dramatically raise instruction-level parallelism in programs without recompilation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号