期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Path and cache conscious prefetching (PCCP)

Zhen He Alonso Marquez 《The VLDB Journal The International Journal on Very Large Data Bases》2007,16(2):235-249

Main memory cache performance continues to play an important role in determining the overall performance of object-oriented, object-relational and XML databases. An effective method of improving main memory cache performance is to prefetch or pre-load pages in advance to their usage, in anticipation of main memory cache misses. In this paper we describe a framework for creating prefetching algorithms with the novel features of path and cache consciousness. Path consciousness refers to the use of short sequences of object references at key points in the reference trace to identify paths of navigation. Cache consciousness refers to the use of historical page access knowledge to guess which pages are likely to be main memory cache resident most of the time and then assumes these pages do not exist in the context of prefetching. We have conducted a number of experiments comparing our approach against four highly competitive prefetching algorithms. The results shows our approach outperforms existing prefetching techniques in some situations while performing worse in others. We provide guidelines as to when our algorithm should be used and when others maybe more desirable. 相似文献

2.

使用内存缓存的迭代应用编程框架

连文波汪美玲陶秋铭赵琛《计算机系统应用》2015,24(3):44-49

迭代式计算是一类重要的大数据分析应用.在分布式计算框架MapReduce上实现迭代计算时,计算会被分解成多个作业并按作业依存关系顺序运行,这使得程序与分布式文件系统(DFS)有多次交互而影响程序执行时间.对这些交互相关数据的缓存会降低与DFS的交互时间,进而提升程序总体的性能.考虑到集群中的大量内存在多数情况下会处于空闲状态,提出了一种使用内存缓存的迭代式应用编程框架MemLoop.该系统从作业提交API、调度算法、缓存管理模块实现缓存管理以充分利用内存缓存迭代间可驻留数据与迭代内依存数据.我们将此框架与已有相关框架进行了比较,实验结果表明该框架能够提升迭代程序的性能. 相似文献

3.

Data cache prefetching using a global history buffer

Nesbit K.J. Smith J.E. 《Micro, IEEE》2005,25(1):90-97

Over the past couple of decades, trends in both microarchitecture and underlying semiconductor technology have significantly reduced microprocessor clock periods. These trends have significantly increased relative main-memory latencies as measured in processor clock cycles. To avoid large performance losses caused by long memory access delays, microprocessors rely heavily on a hierarchy of cache memories. But cache memories are not always effective, either because they are not large enough to hold a program's working set, or because memory access patterns don't exhibit behavior that matches a cache memory's demand-driven, line-structured organization. To partially overcome cache memories' limitations, we organize data cache prefetch information in a new way, a GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables. It reduces stale table data, improving accuracy and reducing memory traffic. It contains a more complete picture of cache miss history and is smaller than conventional tables. 相似文献

4.

Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》2001,61(12):1775

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme. 相似文献

5.

Reducing Data Cache Susceptibility to Soft Errors 总被引：1，自引：0，他引：1

Vilas Sridharan Hossein Asadi Mehdi B. Tahoori David Kaeli 《Dependable and Secure Computing, IEEE Transactions on》2006,3(4):353-364

Data caches are a fundamental component of most modern microprocessors. They provide for efficient read/write access to data memory. Errors occurring in the data cache can corrupt data values or state, and can easily propagate throughout the memory hierarchy. One of the main threats to data cache reliability is soft (transient, nonreproducible) errors. These errors can occur more often than hard (permanent) errors, and most often arise from single event upsets (SEUs) caused by strikes from energetic particles such as neutrons and alpha particles. Many protection techniques exist for data caches; the most common are ECC (error correcting codes) and parity. These protection techniques detect all single bit errors and, in the case of ECC, correct them. To make proper design decisions about which protection technique to use, accurate design-time modeling of cache reliability is crucial. In addition, as caches increase in storage capacity, another important goal is to reduce the failure rate of a cache, to limit disruption to normal system operation. In this paper, we present our modeling approach for assessing the impact of soft errors using architectural simulators. We also describe a new technique for reducing the vulnerability of data caches: refetching. By selectively refetching cache lines from the ECC-protected L2 cache, we can significantly reduce the vulnerability of the L1 data cache. We discuss and present results for two different algorithms that perform selective refetch. Experimental results show that we can obtain an 85 percent decrease in vulnerability when running the SPEC2K benchmark suite while only experiencing a slight decrease in performance. Our results demonstrate that selective refetch can cost-effectivety decrease the error rate of an L1 data cache 相似文献

6.

Parallel Multigrid for Anisotropic Elliptic Equations

《Journal of Parallel and Distributed Computing》2001,61(1):96-114

In this paper two well-known robust multigrid solvers for anisotropic operators on structured grids are compared: alternating-plane smoothers combined with full coarsening and plane smoothers combined with semi-coarsening. The study has taken into account not only numerical properties but also architectural ones, focusing on cache memory exploitation and parallel characteristics. Experimental results for the sequential algorithms have been obtained on two different systems based on the MIPS R10000 processor, but with different L2 cache sizes and memory bandwidths (an SGI O2 workstation and an SGI Origin 2000 system). Although the alternating-plane approach is the best choice for sequential implementations, experimental estimations show poor parallel efficiencies. For the semicoarsening alternative two different parallel implementations have been considered. The first one has optimal parallel characteristics but due to deterioration of the convergence properties its realistic efficiency is not satisfactory. In the second one, some processors remain idle during a short period of time on every multigrid cycle. However, the second parallel algorithm is more efficient since it preserves the numerical properties of the sequential version. Parallel experiments have also been taken on a Cray T3E system. 相似文献

7.

面向CFD应用的Intel持久内存性能评估

文敏华陈江胡广超韦建文王一超林新华《计算机工程与科学》2022,44(9):1550-1556

在科学计算领域,数据规模随着数值模拟精度要求的提高而快速增长,以DRAM为主存的传统方案由于成本高而难以扩展容量,近年来越来越被关注的持久内存技术有望解决这一问题。持久内存是在DRAM和SSD之间的补充,相比DRAM,持久内存具有容量大、性价比高的优点,但是性能也相对较低。为测试持久内存的应用性能,面向科学计算的一个重要领域——计算流体力学(CFD),对Intel持久内存进行性能评估。实验中,持久内存采用了最易于使用的内存模式,源码不需要任何修改,测试程序涵盖内存基准测试和3种常见的CFD算法,实验结果表明,在内存模式下,对不同CFD算法,相比纯DRAM的配置,持久内存的引入会带来一定的性能损失,且该损失随数据规模的增加而增大;另一方面,持久内存的部署使单服务器能支撑超大数据规模的数值模拟。相似文献

8.

Performance-aware cache management for energy-harvesting nonvolatile processors

Wang Yan Li Kenli Deng Xia Li Keqin 《The Journal of supercomputing》2022,78(3):3425-3447

With the increasing popularity of wearable, implantable, and Internet of Things devices, energy-harvesting nonvolatile processors (NVPs) have become promising alternative platforms due to their durability when running on an intermittent power supply. To address the problem of an intermittent power supply, backing up of volatile data into a nonvolatile cache has been proposed to avoid the frequent need to restart the program from the beginning. However, the penalties incurred by frequent backup and recovery operations significantly degrade the system performance and waste considerable energy resources. Moreover, the increasing amounts of data to be processed pose critical challenges in energy-harvesting NVP platforms with tight energy and latency budgets. To further improve the performance of NVPs, this article adopts a retention state that can enable a system to retain data in a volatile cache to wait for power recovery instead of backing up data immediately. Based on the retention time, we propose a performance-aware cache management scheme and a pre-backup method to improve the system performance and energy utilization while guaranteeing successful backup. The pre-backup method is also optimized by retaining data in a volatile cache when receiving a high voltage warning. In particular, the nonvolatile memory (NVM) compression technique is introduced to achieve the goal of minimizing power failures and maximizing system performance. Moreover, the security problems in the sleep state are discussed with regard to the NVM compression technique to guarantee the NVP’s security. We evaluate the performance and energy consumption of our proposed algorithms in comparison with the dual-threshold scheme. The experimental results show that compared with the dual-threshold scheme, the proposed algorithms together can achieve a 52.6% energy reduction and a 13.72% performance improvement on average.

相似文献

9.

基于Fermi架构的Join算法

李观钊陈思桐甄真陈虎《计算机科学》2013,40(3):62-67

在列数据库中,连接操作依然是最核心和最耗时的操作,GPU强大的计算能力可为此提供新的优化手段。基于Fermi架构,提出了新的Hash Join算法和Sort merge Join算法,其基本思想是充分利用该架构新增的缓存结构来减少连接操作的cache缺失率。与CUDA stream技术相结合,新算法在输出结果较多时可以有效地隐藏主存与显存间数据传输带来的延迟,进一步提升其执行效率。实验结果证实了基于Fcrmi架构的Hash Join算法处理偏抖数据的高效性及Sort merge Join算法的稳定性,并且通过比较表明,这两种算法的性能全面优于基于多核CPU充分优化的Join算法,最大加速2.4倍,在外键分布高偏抖时新的Hash Join算法的执行速度甚至达到每秒217M元组。相似文献

10.

Access region cache with register guided memory reference partitioning

Gyungho Le Yixin Shi 《Journal of Systems Architecture》2009,55(10-12):434-445

Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most. 相似文献

11.

Performance and reliability improvement by using asynchronous algorithms in disk buffer cache memory

Anna Haé 《Acta Informatica》1993,30(2):131-146

This paper proposes performance and reliability improvement by using new algorithms for asynchronous operations in disk buffer cache memory. These algorithms allow for writing the files into the buffer cache by the processes and consider the number of active processes in the system and the length of the queue to the disk buffer cache. Writing the contents of the buffer cache to the disk depends on the system load and the write activity. Performance and reliability measures including the elapsed time of writing a file into the buffer cache, the waiting time to start writing a file, and the mean number of blocks written to the disk between system failures are used to show performance and reliability improvement by using the algorithms. Sensitivity analysis is used to influence the algorithms' design. Examples of real systems are used to show the numerical results of performance and reliability improvement in different systems with various disk cache parameters and file sizes. 相似文献

12.

大规模视频点播磁盘cache替换算法 总被引：7，自引：0，他引：7

李勇彭宇行陈福接《计算机研究与发展》2000,37(2):207-212

在规划视频播（ＫＳＶＯＤ）中ｃａｃｈｅ机制是提高系统效率的有效手段,是支持ＶＯＤ实用化的关键技术之一,由于连续媒体的数据量大,使用周期长等特点,传统的ｃａｃｈｅ替换算法不能直接应用于ＳＶＯＤ。文中根据ＶＯＤ的特点开发了两种基于访问频率的替换算法,ＬＦＲＵ（ｌｅａｓｔｆｒｅｑｕｅｎｃｙａｎｄｒｅｃｅｎｔｌｙｕｓｅｄ）和ＰＬＦＵ（ｐｅｒｉｏｄｌｅａｓｔｆｒｅｑｕｅｎｃｙｕｓｅｄ）算法,它相似文献

13.

Performance of One''s Complement Caches

Qing Yang Sridar Adina T. Sun 《Journal of Parallel and Distributed Computing》1998,48(2):143

On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost. 相似文献

14.

Ordering Techniques for Two- and Three-Dimensional Convection-Dominated Elliptic Boundary Value Problems

Sabine Le Borne 《Computing》2000,64(2):123-155

Multigrid methods with simple smoothers have been proven to be very successful for elliptic problems with no or only moderate convection. In the presence of dominant convection or anisotropies as it might appear in equations of computational fluid dynamics (e.g. in the Navier-Stokes equations), the convergence rate typically decreases. This is due to a weakened smoothing property as well as to problems in the coarse grid correction. In order to obtain a multigrid method that is robust for convection-dominated problems, we construct efficient smoothers that obtain their favorable properties through an appropriate ordering of the unknowns. We propose several ordering techniques that work on the graph associated with the (convective part of the) stiffness matrix. The ordering algorithms provide a numbering together with a block structure which can be used for block iterative methods. We provide numerical results for the Stokes equations with a convective term illustrating the improved convergence properties of the multigrid algorithm when applied with an appropriate ordering of the unknowns. Received July 12, 1999; revised October 1, 1999 相似文献

15.

Data-Centric Transformations for Locality Enhancement

Kodukula Induprakas Pingali Keshav 《International journal of parallel programming》2001,29(3):319-364

On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five. 相似文献

16.

A new software cache structure on Sunway TaihuLight

Li Jianjiang Deng Zhaochu Du Panpan Lin Jie 《The Journal of supercomputing》2022,78(4):4779-4798

The Sunway TaihuLight is the first supercomputer built entirely with domestic processors in China. On Sunway Taihulight, the local data memory (LDM) of the slave core is limited, so data transmission with the main memory is frequent during calculation, and the memory access efficiency is low. On the other hand, for many scientific computing programs, how to solve the storage problem of irregular access data is the key to program optimization. Software cache (SWC) is one of the effective means to solve these problems. Based on the characteristics of Sunway TaihuLight structure and irregular access, this paper designs and implements a new software cache structure by using part of the space in LDM to simulate the cache function, which uses new cache address mapping and conflicts solution to solve high data access overhead and storage overhead in a traditional cache. At the same time, the SWC uses the register communication between the slave cores to share on the different slave core LDMs, increasing the capacity of the software cache and improving the hit rate. In addition, we adopt a double buffer strategy to access regular data in batches, which hides the communication overhead between the slave core and the main memory. The test results on the Sunway TaihuLight platform show that the software cache structure in this paper can effectively reduce the program running time, improve the software cache hit rate, and achieve a better optimization effect.

相似文献

17.

Characterization and Parallelization of Decision-Tree Induction

《Journal of Parallel and Distributed Computing》2001,61(3):322-349

This paper examines the performance and memory-access behavior of the C4.5 decision-tree induction program, a representative example of data mining applications, for both uniprocessor and parallel implementations. The goals of this paper are to characterize C4.5, in particular its memory hierarchy usage, and to decrease the run-time of C4.5 via algorithmic improvement and parallelization. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit instruction level parallelism to varying degrees. This paper makes the following four contributions. The first contribution is presenting a complete characterization of the C4.5 decision-tree induction program. The results show that with the exception of the input data set, the working set fits into an 8-Kbyte data cache; the instruction working set also fits into an 8-Kbyte instruction cache. For data sets larger than the L2 cache, performance is limited by accesses to main memory. The results further establish that four-way issue can provide up to a factor of two performance improvement over single-issue for larger L2 caches; for smaller L2 caches, out-of-order dispatch provides a large performance improvement over in-order dispatch. The second contribution made by this paper is examining the effect on the memory hierarchy of changing the layout of the input dataset in memory, showing again that the performance is limited by memory accesses. One proposed data layout decreases the dynamic instruction count by up to 24%, but usually results in lower performance due to worse cache behavior. Another proposed data layout does not improve the dynamic instruction count over the original layout, but has better cache behavior and decreases the run-time by up to a factor of two. Third, this paper presents the first decision-tree induction program parallelized for a ccNUMA architecture. A method for splitting the decision tree hash table is discussed that allows the hash table to be updated and accessed simultaneously without the use of locks. The performance of the parallel version is compared to the original version of C4.5 and a uniprocessor version of C4.5 using the best data layout found. Speedup curves from a six-processor Sun E4000 SMP system show a speedup on the induction step of 3.99, and simulation results show that the performance is mostly unaffected by increasing the remote memory access time until it is over a factor of ten greater than the local memory access time. Last, this paper characterizes the parallelized decision-tree induction program. Compared to the uniprocessor version, the parallel version exerts significantly less pressure on the memory hierarchy, with the exception of having a much larger first level data working set. 相似文献

18.

龙芯2F上的访存优化 总被引：1，自引：1，他引：0

苏波李凯徐志广何颂颂《计算机系统应用》2010,19(1):171-175

一般的数据处理程序中,计算时间在其中往往只起次要作用,因此访存方式是否有效对程序的性能影响很大。在基于龙芯2F处理器研制的高性能计算机系统KD-50-I上安装ATLAS,经测试其性能只达到龙芯2F理论峰值的30%。通过循环展开减少函数存储访问次数,增大计算访存比;采用数据分块、部分拷贝以增强访存局部性,减少cache失效;利用非阻塞cache加快内存访问速度等访存优化技术,将ATLAS性能提高50%以上。相似文献

19.

A real-time multigrid finite hexahedra method for elasticity simulation using CUDA

Christian Dick Joachim Georgii Rüdiger Westermann 《Simulation Modelling Practice and Theory》2011,19(2):801-816

We present a multigrid approach for simulating elastic deformable objects in real time on recent NVIDIA GPU architectures. To accurately simulate large deformations we consider the co-rotated strain formulation. Our method is based on a finite element discretization of the deformable object using hexahedra. It draws upon recent work on multigrid schemes for the efficient numerical solution of partial differential equations on such discretizations. Due to the regular shape of the numerical stencil induced by the hexahedral regime, and since we use matrix-free formulations of all multigrid steps, computations and data layout can be restructured to avoid execution divergence of parallel running threads and to enable coalescing of memory accesses into single memory transactions. This enables to effectively exploit the GPU’s parallel processing units and high memory bandwidth via the CUDA parallel programming API. We demonstrate performance gains of up to a factor of 27 and 4 compared to a highly optimized CPU implementation on a single CPU core and 8 CPU cores, respectively. For hexahedral models consisting of as many as 269,000 elements our approach achieves physics-based simulation at 11 time steps per second. 相似文献

20.

Cache-based Computer Systems

《Computer》1973,6(3):30-36

A cache-based computer system employs a fast, small memory -the " cache" - interposed between the usual processor and main memory. At any given time the cache contains as much as possible the instructions and data the processor needs; as new information is needed it is brought from main memory to cache, displacing old information. The processor tends to operate with a memory of cache speed but with main memory cost-per-bit. This configuration has analogies with other systems employing memory hierarchies, such as "paging" or "virtual memory" systems. In contrast with these latter, a cache is managed by hardware algorithms, deals with smaller blocks of data (32 bytes, for example, rather than 4096), provides a smaller ratio of memory access times (5:1 rather than 1000: 1), and, because of the last factor, holds the processor idle while blocks of data are being transferred from main memory to cache rather than switching to another task. These are important differences, and may suffice to make the cache-based system cost effective in many situations where paging is not. 相似文献