期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

郇丹丹李祖松胡伟武刘志勇《计算机学报》2007,30(7):1104-1114

随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%. 相似文献

2.

LAPT: A locality-aware page table for thread and data mapping

《Parallel Computing》2016

The performance and energy efficiency of current systems is influenced by accesses to the memory hierarchy. One important aspect of memory hierarchies is the introduction of different memory access times, depending on the core that requested the transaction, and which cache or main memory bank responded to it. In this context, the locality of the memory accesses plays a key role for the performance and energy efficiency of parallel applications. Accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. With information about the memory access pattern, pages can be migrated to the NUMA nodes that access them (data mapping), and threads that communicate can be migrated to the same node (thread mapping).In this paper, we present LAPT, a hardware-based mechanism to store the memory access pattern of parallel applications in the page table. The operating system uses the detected memory access pattern to perform an optimized thread and data mapping during the execution of the parallel application. Experiments with a wide range of parallel applications (from the NAS and PARSEC Benchmark Suites) on a NUMA machine showed significant performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average). 相似文献

3.

Software Controlled Adaptive Pre-Execution for Data Prefetching

ákos Dudás Sándor Juhász Tamás Schrádi 《International journal of parallel programming》2012,40(4):381-396

Data prefetching mechanisms are widely used for hiding memory latency in data intensive applications. They mask the speed gap between CPUs and their memory systems by preloading data into the CPU caches, where accessing them is by at least one order of magnitude faster. Pre-execution is a combined prefetching method, which executes a slice of the original code preloading the code and its data at the same time. Pre-execution is often mentioned in the literature, but according to our knowledge, it has not been formally defined yet. We fill this void by presenting the formal definition of speculative and non-speculative pre-execution, and derive a lightweight software-based strategy which accelerates the main working thread by introducing an adaptive, non-speculative pre-execution helper thread. This helper thread acts as a perfect predictor, calculates memory addresses, prefetches the data and consumes cache misses early. The adaptive automatic control allows the helper thread to configure itself in run-time for best performance. The method is directly applicable to any data intensive application without requiring hardware modifications. Our method was able to achieve an average speedup of 10–30% in a real-life application. 相似文献

4.

基于CMP的指针数据预取方法

下载免费PDF全文

朱会东黄永雨宋宝卫《计算机工程》2011,37(6):71-73

针对现代计算机系统中的存储墙问题,提出一种适合于链式数据结构的数据预取方法——纯遍历推送方法。采用基于共享高速缓存的多核处理器平台CMP上的多线程技术,在主程序运行时分离出一个推送线程,由其将主线程需要的数据提前预取至处理器共享高速缓存中以隐藏主线程的存储器延迟。实验结果证明该方法在CMP架构下对以链式结构为主的内存受限程序的性能有一定的改进。相似文献

5.

Architectural support for thread communications in multi-core processors

Sevin Varoglu Stephen Jenks 《Parallel Computing》2011,37(1):26-41

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors. 相似文献

6.

OpenMP数据分布子句自动生成算法

黄品丰赵荣彩韩林刘晓娴《计算机工程》2013,39(3):295-299

将OpenMP程序扩展到异构多核结构时,非本地存储访问会导致访存开销增加,影响程序性能。针对该问题,引入带数组划分信息的数据分布子句,对数据在异构多核存储系统的布局进行管理,提出一种基于并行循环识别和数组引用模式分析的算法,实现该类子句的自动生成。实验结果表明,自动生成的OpenMP程序包含数据分布子句,具有较好的数据局部性,可降低访存开销,在异构多核系统上获得明显的性能提升。相似文献

7.

The Performance Optimization of Threaded Prefetching for Linked Data Structures

Yan Huang Jie Tang Zhi-min Gu Min Cai Jianxun Zhang Ninghan Zheng 《International journal of parallel programming》2012,40(2):141-163

Helper threaded prefetching based on Chip Multiprocessor is a well known approach to reducing memory latency and has been explored in linked data structures accesses. However, conventional helper threaded prefetching often suffers from useless prefetches and cache thrashing, which affect its effectiveness. In this paper, we first analyzed the shortcomings of conventional helper threaded prefetching for linked data structures. Then we proposed an improved helper threaded prefetching, Skip Helper Threaded Prefetching, for hotspots with two level data traversals. Our solution is to profile the applications and balance delinquent loads between main thread and prefetching thread based on the characteristic of operations in their hotspots. Evaluations show that the proposed solution improves average performance by 8.9% (-O2) and 8.5% (-O3) over the conventional helper threaded prefetching that greedily prefetches all delinquent loads. We also compare our proposal with the active threaded prefetching which synchronizes with main thread by semaphore, and find that our proposal provides better performance for the targeted applications. 相似文献

8.

Correlation prefetching with a user-level memory thread

Solihin Y. Lee J. Torrellas J. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):563-580

This paper proposes using a user-level memory thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: The correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide applicability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53. 相似文献

9.

多线程环境的高效内存分配技术

马明理陈刚董金祥《计算机测量与控制》2006,14(11):1551-1553,1556

介绍了一种新的多线程内存分配技术（NIXMalloc）的设计和实现，提出了两种高效的分配策略及其自适应调优方法，有效地提高多线程应用程序的内存管理性能；其中Local分配策略对超级块对象Span进行了线程私有化，基于超级块对象为单位的垃圾回收和内存布局调整使多线程性能更优越；Global分配策略采用了自适应调优方法，在动态检测应用程序内存使用情况的基础上进行内存预取和线程缓存限值的动态调整；实验证明NIXMalloc可改善内存管理性能，提高吞吐量，同时降低内存使用量；在多线程应用系统中能获得较好的时空效率。相似文献

10.

片上多处理器中基于步长和指针的预取 总被引：1，自引：1，他引：0

下载免费PDF全文

肖俊华冯子军章隆兵《计算机工程》2009,35(4):58-60

在对大量程序访存行为进行分析的基础上,提出基于步长和指针的预取方法。能捕获规整的数据访问模式和指针访问模式。在L2cache和内存之间采用全局历史缓存实现该预取方法。全系统模拟结果表明,该预取方法对商业应用测试程序的性能平均提高14％,对科学计算测试程序的性能平均提高34．5％。相似文献

11.

Combining data prefetching with non-blocking loads to alleviate cache pollution effects

《Journal of Systems Architecture》1999,45(9):681-685

Data prefetching is a useful approach for reduction of memory access stalls in many scientific applications. However, it suffers from cache pollution severly in some applications. In this paper, we study the effectiveness of combining data prefetching with non-blocking loads on cache pollution and explain why it shows good result in our simulation. 相似文献

12.

基于时空局部性的层次化查询结果缓存机制

朱亚东郭嘉丰兰艳艳程学旗《中文信息学报》2016,30(1):63-71

查询结果缓存可以对查询结果的文档标识符集合或者实际的返回页面进行缓存,以提高用户查询的响应速度,相应的缓存形式可以分别称之为标识符缓存或页面缓存。对于固定大小的内存,标识符缓存可以获得更高的命中率,而页面缓存可以达到更高的响应速度。该文根据用户查询访问的时间局部性和空间局部性,提出了一种新颖的基于时空局部性的层次化结果缓存机制。首先,该机制将固定大小的结果缓存划分为两层:页面缓存和标识符缓存。对于用户提交的查询,该机制会首先使用第一层的页面缓存进行应答,如果未能命中,则继续尝试使用第二层的标识符缓存。实验显示这种层次化的缓存机制较传统的仅依赖于单一缓存形式的机制,在平均查询响应时间上,取得了可观的性能提升:例如,相对单纯的页面缓存,平均达到9%,最好情况下达到11%。其次,该机制在标识符缓存的基础上,设计了一种启发式的预取策略,对用户查询检索的空间局部性进行挖掘。实验显示,这种预取策略的融合,能进一步促进检索系统性能的有效提升,从而最终建立起一套时空完备的、有效的结果缓存机制。相似文献

13.

Path and cache conscious prefetching (PCCP)

Zhen He Alonso Marquez 《The VLDB Journal The International Journal on Very Large Data Bases》2007,16(2):235-249

Main memory cache performance continues to play an important role in determining the overall performance of object-oriented, object-relational and XML databases. An effective method of improving main memory cache performance is to prefetch or pre-load pages in advance to their usage, in anticipation of main memory cache misses. In this paper we describe a framework for creating prefetching algorithms with the novel features of path and cache consciousness. Path consciousness refers to the use of short sequences of object references at key points in the reference trace to identify paths of navigation. Cache consciousness refers to the use of historical page access knowledge to guess which pages are likely to be main memory cache resident most of the time and then assumes these pages do not exist in the context of prefetching. We have conducted a number of experiments comparing our approach against four highly competitive prefetching algorithms. The results shows our approach outperforms existing prefetching techniques in some situations while performing worse in others. We provide guidelines as to when our algorithm should be used and when others maybe more desirable. 相似文献

14.

一种结合页分配和组调度的内存功耗优化方法

贾刚勇万健李曦蒋从锋代栋《软件学报》2014,25(7):1403-1415

多核系统中,内存子系统消耗大量的能耗并且比例还会继续增大.因此,解决内存的功耗问题成为系统功耗优化的关键.根据线程的内存地址空间和负载均衡策略将系统中的线程划分成不同的线程组,根据线程所属的组,给同一组内的线程分配相同内存rank中的物理页,然后,根据划分的线程组以组为单位进行调度.提出了结合页分配和组调度的内存功耗优化方法（CAS）.CAS周期性地激活当前需要的内存rank,从而可以将暂时不使用的内存rank置为低功耗状态,同时延长低功耗内存rank的空闲时间.仿真实验结果显示：与其他同类方法相比,CAS方法能够平均降低10%的内存功耗,同时提高8%的性能. 相似文献

15.

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Lee Jaejin Jung Changhee Lim Daeseob Solihin Yan 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(9):1309-1324

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle. 相似文献

16.

Moving address translation closer to memory in distributed shared-memory multiprocessors

Qiu X. Dubois M. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(7):612-623

To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (translation lookaside buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (cache-coherent non-uniform memory access architectures) and COMAs (cache only memory access architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low. 相似文献

17.

StreamTMC: Stream compilation for tiled multi-core architectures

Haitao Wei Mingkang Qin Weiwei Zhang Junqing Yu Dongrui Fan Guang R. Gao 《Journal of Parallel and Distributed Computing》2013

Tiled multi-core architectures have become an important kind of multi-core design for its good scalability and low power consumption. Stream programming has been productively applied to a number of important application domains. It provides an attractive way to exploit the parallelism. However, the architecture characteristics of large amounts of cores, memory hierarchy and exposed communication between tiles have presented a performance challenge for stream programs running on tiled multi-cores. In this paper, we present StreamTMC, an efficient stream compilation framework that optimizes the execution of stream applications for the tiled multi-core. This framework is composed of three optimization phases. First, a software pipelining schedule is constructed to exploit the parallelism. Second, an efficient hybrid of SPM and cache buffer allocation algorithm and data copy elimination mechanism is proposed to improve the efficiency of the data access. Last, a communication aware mapping is proposed to reduce the network communication and synchronization overhead. We implement the StreamTMC compiler on Godson-T, a 64-core tiled architecture and conduct an experimental study to verify the effectiveness. The experimental results indicate that StreamTMC can achieve an average of 58% improvement over the performance before optimization. 相似文献

18.

Web object-based storage management in proxy caches

《Future Generation Computer Systems》2006,22(1-2):16-31

Proxy caches are essential to improve the performance of the World Wide Web and to enhance user perceived latency. Appropriate cache management strategies are crucial to achieve these goals. In our previous work, we have introduced Web object-based caching policies. A Web object consists of the main HTML page and all of its constituent embedded files. Our studies have shown that these policies improve proxy cache performance substantially.In this paper, we propose a new Web object-based policy to manage the storage system of a proxy cache. We propose two techniques to improve the storage system performance. The first technique is concerned with prefetching the related files belonging to a Web object, from the disk to main memory. This prefetching improves performance as most of the files can be provided from the main memory rather than from the proxy disk. The second technique stores the Web object members in contiguous disk blocks in order to reduce the disk access time. We used trace-driven simulations to study the performance improvements one can obtain with these two techniques. Our results show that the first technique by itself provides up to 50% reduction in hit latency, which is the delay involved in providing a hit document by the proxy. An additional 5% improvement can be obtained by incorporating the second technique. 相似文献

19.

深度学习在多核缓存预取中的应用研究综述

张建勋《计算机应用研究》2024,41(2)

当前人工智能技术应用于系统结构领域的研究前景广阔,特别是将深度学习应用于多核架构的数据预取研究已经成为国内外的研究热点。针对基于深度学习的缓存预取任务进行了研究,形式化地定义了深度学习缓存预取模型。在介绍当前常见的多核缓存架构和预取技术的基础上,全面分析了现有基于深度学习的典型缓存预取器的设计思路。深度学习神经网络在多核缓存预取领域的应用主要采用了深度神经网络、循环神经网络、长短期记忆网络和注意力机制等机器学习方法,综合对比分析现有基于深度学习的数据预取神经网络模型后发现,基于深度学习的多核缓存预取技术在计算成本、模型优化和实用性等方面还存在着局限性,未来在自适应预取模型以及神经网络预取模型的实用性方面还有很大的研究探索空间和发展前景。相似文献

20.

基于存取模式的Cache预取自适应策略研究

周可张江陵冯丹万志坤《计算机工程与科学》2003,25(1):80-84

不同的Cache预取策略适用于不同的存取模式。本文介绍了存储系统Cache预取技术的研究现状，从分析存取模式出发，构造了存取模式三元组模型，并在磁盘阵列上测试了适用于复杂环境下的Cache预取自适应策略，结果证明，自适应策略能够在不同环境上获得磁盘阵列的最优性能。相似文献