首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Data prefetching mechanisms are widely used for hiding memory latency in data intensive applications. They mask the speed gap between CPUs and their memory systems by preloading data into the CPU caches, where accessing them is by at least one order of magnitude faster. Pre-execution is a combined prefetching method, which executes a slice of the original code preloading the code and its data at the same time. Pre-execution is often mentioned in the literature, but according to our knowledge, it has not been formally defined yet. We fill this void by presenting the formal definition of speculative and non-speculative pre-execution, and derive a lightweight software-based strategy which accelerates the main working thread by introducing an adaptive, non-speculative pre-execution helper thread. This helper thread acts as a perfect predictor, calculates memory addresses, prefetches the data and consumes cache misses early. The adaptive automatic control allows the helper thread to configure itself in run-time for best performance. The method is directly applicable to any data intensive application without requiring hardware modifications. Our method was able to achieve an average speedup of 10–30% in a real-life application.  相似文献   

2.
Speculative pre-execution achieves efficient data prefetching by running additional prefetching threads on spare hardware contexts. Various implementations for speculative pre-execution have been proposed, including compiler-based static approaches and hardware-based dynamic approaches. A static approach defines the p-thread at compile time and executes it as a stand-alone running thread. Therefore, it cannot efficiently take dynamic events into account and requires a higher fetch bandwidth. Conversely, a hardware approach is, by essence, able to dynamically make use run-time information. However, it requires more complex hardware and also lacks global information on data and control flow. This paper proposes Speculative Pre-Execution Assisted by compileR (SPEAR), a pre-execution model which is a hybrid of the two approaches. It relies on a post-compiler to extract the p-thread code from program binaries and uses custom- designed hardware to execute the p-thread.  相似文献   

3.
《Micro, IEEE》2004,24(6):74-82
Memory latency dominates the performance of many applications on modern processors, despite advances in caches and prefetching techniques. Numerous prefetching techniques, both in hardware and software, try to alleviate the memory bottleneck. One such technique, known as helper threading improves single-thread performance on a simultaneous multithreaded architecture (SMT), which shares processor resources, including caches, among logical threads. It uses otherwise idle hardware thread contexts to execute speculative threads on behalf of the main thread. Helper threading accelerates a program by exploiting a processor's multithreading capability to run assist threads. Based on the helper threading usage model, virtual multithreading (VMT), a form of switch-on-event user-level multithreading, can improve performance for real-world workloads with a wall-clock speedup of 5.0 to 38.5 percent  相似文献   

4.
Helper threaded prefetching based on Chip Multiprocessor is a well known approach to reducing memory latency and has been explored in linked data structures accesses. However, conventional helper threaded prefetching often suffers from useless prefetches and cache thrashing, which affect its effectiveness. In this paper, we first analyzed the shortcomings of conventional helper threaded prefetching for linked data structures. Then we proposed an improved helper threaded prefetching, Skip Helper Threaded Prefetching, for hotspots with two level data traversals. Our solution is to profile the applications and balance delinquent loads between main thread and prefetching thread based on the characteristic of operations in their hotspots. Evaluations show that the proposed solution improves average performance by 8.9% (-O2) and 8.5% (-O3) over the conventional helper threaded prefetching that greedily prefetches all delinquent loads. We also compare our proposal with the active threaded prefetching which synchronizes with main thread by semaphore, and find that our proposal provides better performance for the targeted applications.  相似文献   

5.
核外计算中的几种I/O优化方法   总被引:1,自引:0,他引:1  
大数据量应用问题引入核外计算模式,由于访问磁盘数据的速度比较慢,I/O成为核外计算性能重要的限制因素.提出了一种使用运行库进行I/O优化的方法,给出了3种有效的优化策略:规则区域筛选、数据预取和边缘重用.编程人员可针对不同的应用问题使用相应的优化API来缩短程序执行时间.实验结果表明,通过减少I/O操作次数和内外存交换的数据量以及隐藏部分I/O操作延迟,有效提高了核外计算的性能.  相似文献   

6.
This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle.  相似文献   

7.
Helper threaded prefetching based on chip multiprocessor has been shown to reduce memory latency and improve overall system performance, and has been explored in linked data structures accesses. In our earlier work, we had proposed an effective threaded prefetching technique that balances delinquent loads between main thread and helper thread to improve effectiveness of prefetching. In this paper, we analyze memory access characteristic of specific application to estimate effective prefetch distance range for our proposed threaded prefetching technique. The effect of hardware prefetchers on the estimation is also exploited. We discuss key design issues of our proposed method and present preliminary experimental results. Our experimental evaluations indicated that the bounded range of effective prefetch distance can be determined using our method, and the optimal prefetch distances can be determined based on the estimated effective prefetch distance range by few trial runs.  相似文献   

8.
Data parallel memory systems must maintain a large number of outstanding memory references to fully use increasing DRAM bandwidth in the presence of increasing latency. At the same time, the throughput of modern DRAMs is very sensitive to access pattern's due to the time required to precharge and activate banks and to switch between read and write access. To achieve memory reference parallelism a system may simultaneously issue references from multiple reference threads. Alternatively multiple references from a single thread can be issued in parallel. In this paper, we examine this tradeoff and show that allowing only a single thread to access DRAM at any given time significantly improves performance by increasing the locality of the reference stream and hence reducing precharge/activate operations and read/write turnaround. Simulations of scientific and multimedia applications show that generating multiple references from a single thread gives, on average, 17% better performance than generating references from two parallel threads.  相似文献   

9.
并行文件系统中适度贪婪的Cache预取一体化算法   总被引:3,自引:0,他引:3  
卢凯  金士尧  卢锡城 《计算机学报》1999,22(11):1172-1177
传统文件系统中的Cache和预取技术是两种降低访问延迟的有效方法。在并行科学计算应用的I/O访问模式下,简单的Cache和预取技术已无法提供较高的Cache命中率,该文在分析该I/O模式的基础上提出了适度贪婪的Cache和预取一体化算法(PGI),该算法充分利用了并行文件系统环境的特点,采用了适度贪婪的动态滑模技术,可以有铲地消除预取时的抖动,降低系统处理开锁,并同时采用了Cache和预取一体化的  相似文献   

10.
Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers.  相似文献   

11.
We present the design and implementation of Arachne, a threads system that can be interfaced with a communications library for multithreaded distributed computations. In particular, Arachne supports thread migration between heterogeneous platforms, dynamic stack size management, and recursive thread functions. Arachne is efficient, flexible, and portable-it is based entirely on C and C++. To facilitate heterogeneous thread operations, we have added three keywords to the C++ language. The Arachne preprocessor takes as input code written in that language and outputs C++ code suitable for compilation with a conventional C++ compiler. The Arachne runtime system manages all threads during program execution. We present some performance measurements on the costs of basic thread operations and thread migration in Arachne and compare these to costs in other threads systems  相似文献   

12.
Binary translation and dynamic optimization are widely used to provide compatibility between legacy and promising upcoming architectures on the level of executable binary codes. Dynamic optimization is one of the key contributors to dynamic binary translation system performance. At the same time it can be a major source of overhead, both in terms of CPU cycles and whole system latency, as long as optimization time is included in the execution time of the application under translation. One of the solutions that allow to eliminate dynamic optimization overhead is to perform optimization simultaneously with the execution, in a separate thread. In the paper we present implementation of this technique in full system dynamic binary translator. For this purpose, an infrastructure for multithreaded execution was implemented in binary translation system. This allowed running dynamic optimization in a separate thread independently of and concurrently with the main thread of execution of binary codes under translation. Depending on the computational resources available, this is achieved whether by interleaving the two threads on a single processor core or by moving optimization thread to an underutilized processor core. In the first case the latency introduced to the system by a computational intensive dynamic optimization is reduced. In the second case overlapping of execution and optimization threads also results in elimination of optimization time from the total execution time of original binary codes.  相似文献   

13.
The Scalable I/O(SIO)Initiative‘s Low-Level Application Programming Interface(SIO LLAP)provides file system implementers with a simple low-Level interface to support high-level parallel /O interfaces efficiently and effectively.This paper describes a reference implementation and the evaluation of the SIO LLAPI on the Intel Paragon multicomputer.The implementation provides the file system structure and striping algorithm compatible with the Parallel File System(PFS)of Intel Paragon ,and runs either inside the kernel or as a user level library.The scatter-gather addressing read/write,asynchronous I/O,client caching and prefetching mechanism,file access hint mechanism,collective I/O and highly efficient file copy have been implemented.The preliminary experience shows that the SIO LLAPI provides opportunities of significant performance improvement and is easy to implement.Some high level file system interfaces and applications such as PFS,ADIO and Hartree-Fock application,are also implemented on top of SIO.The performance of PFS is at least the same as that of Intel‘s native pfs,and in many cases,such as small sequential file access,huge I/O requests and collective I/O,it is stable and much better,The SIO features help to support high level interfaces easily,quickly and more efficiently,and the cache,prefetching,hints are useful to get better performance based on different access models.The scalability and performance of SIO are limited by the network latency,network scalable bandwidth,memory copy bandwidth,memory size and pattern of I/O requests.The tadeoff between generality and efficienc should be considered in implementation.  相似文献   

14.
This paper proposes using a user-level memory thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: The correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide applicability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53.  相似文献   

15.
针对传统静态枚举设置帮助线程控制参数值的繁杂耗时问题,提出了一种帮助线程预取质量的实时在线评价方法。首先,明确了帮助线程的预取服务质量(QoS)的目标;其次,分析了帮助线程预取性能评价的动态指标,对帮助线程预取QoS进行了建模分析;最后,提出一个帮助线程预取的动态自适应调节算法,算法根据程序的阶段行为变化和动态预取获益变化等信息来判断参数值的适用度以及是否需要进行反馈优化,从而实现对预取控制的自适应调节。实验结果表明,应用自适应预取评价算法之后,Mst热点模块的性能提升加速比为1.496,所提出的自适应预取评价方法能够根据程序的动态阶段行为对帮助线程控制参数值作出自适应控制和调节。  相似文献   

16.
Branch misprediction is a considerable drawback to processor performance. The capacity of mulfithreading technology that can hide latency among threads brings up new opportunities to diminish the effect of misprediction. Since previous multithrending approaches either bring up additional pipeline lost or lack flexibility to implement. In this paper, an approach is proposed, which exploits instruction extension and instruction prefetch mechanism together with a parallel decoder. Not only the impact of misprediction delay is diminished, but also no accessional waste is caused by the thread switch mechanism. Experiment shows that IPC is improved by 10.2%-25% compared with Gshare branch prediction method for the most of the benchmarks.  相似文献   

17.
Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines.  相似文献   

18.
Lee  Minsuk  Min  Sang Lyul  Shin  Heonshik  Kim  Chong Sang  Park  Chang Yun 《Real-Time Systems》1997,13(1):47-65
Cache memories have been extensively used to bridge the speed gap between high speed processors and relatively slow main memory. However, they are not widely used in real-time systems due to their unpredictable performance. This paper proposes an instruction prefetching scheme called threaded prefetching as an alternative to instruction caching in real-time systems. In the proposed threaded prefetching, an instruction block pointer called a thread is assigned to each instruction memory block and is made to point to the next block on the worst case execution path that is determined by a compile-time analysis. Also, the thread is not updated throughout the entire program execution to guarantee predictability. This paper also compares the worst case performances of various previous instruction prefetching schemes with that of the proposed threaded prefetching. By analyzing several benchmark programs, we show that the worst case performance of the proposed scheme is significantly better than those of previous instruction prefetching schemes. The results also show that when the block size is large enough the worst case performance of the proposed threaded prefetching scheme is almost as good as that of an instruction cache with 100 % hit ratio.  相似文献   

19.
针对如何有效降低P2P VoD系统发生VCR操作时产生的时延问题,提出了一种P2P VoD系统中的两层关系预取策略,根据节点的播放记录找到片段之间可能存在的关系,以此为依据进行预取,并对节点的邻居列表进行优化,使之能更快定位和预取资源。仿真实验结果表明,两层关系策略准确地预取了视频内容,降低了VCR时产生的时延,提高了用户的观看体验。  相似文献   

20.
当今时代数据呈现出指数级增长效应,更多的组织采用多数据中心和分布式来存储数据,Alluxio作为以内存为中心的虚拟分布式存储系统,整合了底层大数据生态系统。在Alluxio与底层存储结合的远程场景中,由于网络的延迟,使得I/O速度成为影响对外服务的重要因素之一。针对以上研究提出一种基于Alluxio远程场景下的缓存策略CPR,利用存储系统中数据块之间的关联性指导数据预取与替换,采用分组思想提高关联规则的利用率,启用后台线程实时更新规则集,并通过仿真实验验证策略的有效性。仿真结果表明,CPR策略指导下的I/O性能要优于Alluxio现有的缓存策略和一些基于数据块间关联规则的缓存策略。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号