期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Pre-execution data prefetching with I/O scheduling

Yue Zhao Kenji Yoshigoe Mengjun Xie 《The Journal of supercomputing》2014,68(2):733-752

Parallel applications suffer from I/O latency. Pre-execution I/O prefetching is effective in hiding I/O latency, in which a pre-execution prefetching thread is created and dedicated to fetch the data for the main thread in advance. However, existing pre-execution prefetching works do not pay attention to the relationship between the main thread and the pre-execution prefetching thread. They just simply pre-execute the I/O accesses using the prefetching thread as soon as possible failing to carefully coordinate them with the operations of the main thread. This drawback induces a series of adverse effects on pre-execution prefetching such as diminishing the degree of the parallelism between computation and I/O, delaying the I/O access of main threads, and aggravating the I/O resource competition in the whole system. In this paper, we propose a new method to overcome this drawback by scheduling the I/O operations among the main threads and the pre-execution prefetching threads. The results of extensive experiments on four popular benchmarks in parallel I/O performance area demonstrate the benefits of the proposed approach. 相似文献

2.

Software Controlled Adaptive Pre-Execution for Data Prefetching

ákos Dudás Sándor Juhász Tamás Schrádi 《International journal of parallel programming》2012,40(4):381-396

Data prefetching mechanisms are widely used for hiding memory latency in data intensive applications. They mask the speed gap between CPUs and their memory systems by preloading data into the CPU caches, where accessing them is by at least one order of magnitude faster. Pre-execution is a combined prefetching method, which executes a slice of the original code preloading the code and its data at the same time. Pre-execution is often mentioned in the literature, but according to our knowledge, it has not been formally defined yet. We fill this void by presenting the formal definition of speculative and non-speculative pre-execution, and derive a lightweight software-based strategy which accelerates the main working thread by introducing an adaptive, non-speculative pre-execution helper thread. This helper thread acts as a perfect predictor, calculates memory addresses, prefetches the data and consumes cache misses early. The adaptive automatic control allows the helper thread to configure itself in run-time for best performance. The method is directly applicable to any data intensive application without requiring hardware modifications. Our method was able to achieve an average speedup of 10–30% in a real-life application. 相似文献

3.

Correlation prefetching with a user-level memory thread

Solihin Y. Lee J. Torrellas J. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):563-580

This paper proposes using a user-level memory thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: The correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide applicability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53. 相似文献

4.

SUPPLE: An efficient run-time support for non-uniform parallel loops

Salvatore Orlando Raffaele Perego 《Journal of Systems Architecture》1999,45(15):1323-1343

This paper presents SUPPLE (SUPort for Parallel Loop Execution), an innovative run-time support for the execution of parallel loops with regular stencil data references and non-uniform iteration costs. SUPPLE relies upon a static block data distribution to exploit locality, and combines static and dynamic policies for scheduling non-uniform iterations. It adopts, as far as possible, a static scheduling policy derived from the owner computes rule, and moves data and iterations among processors only if a load imbalance actually occurs. SUPPLE always tries to overlap communications with useful computations by reordering loop iterations and prefetching remote ones in the case of workload imbalance. The SUPPLE approach has been validated by many experimental results obtained by running a multi-dimensional flame simulation kernel on a 64-node Cray T3D. We have fed the benchmark code with several synthetic input data sets built on the basis of a load imbalance model. We have compared our results with those obtained with a CRAFT Fortran implementation of the benchmark. 相似文献

5.

Improving Data Prefetching Efficacy in Multimedia Applications

Cucchiara Rita Prati Andrea Piccardi Massimo 《Multimedia Tools and Applications》2003,20(2):159-178

The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches are unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding. 相似文献

6.

An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1

Edward H. Gornish Alexander Veidenbaum 《International journal of parallel programming》1999,27(1):35-70

Both hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherent in shared-memory multiprocessors; however, both types of prefetching have their shortcomings. While software schemes require less hardware support than hardware schemes, they must generate address calculation instructions and a prefetch instruction for each datum that needs to be prefetched. Hardware schemes, however, must become progressively more complex to be able to compute data access strides and to increase the prefetching lookahead. In this paper, we propose an integrated hardware/software prefetching method that uses simple hardware that can handle most data accesses and software prefetching for the few remaining accesses. A compile time algorithm analyzes the access streams formed by array references and determines sequences of consecutive memory accesses to an access stream that can be prefetched by the hardware mechanism. This analysis is based on the relative memory locations of consecutive accesses to an access stream and the number of intervening data references between consecutive accesses to an access stream. In addition, the prefetching lookahead can be set separately for each access stream. Our approach yields an effective scheme that minimizes both CPU overhead and hardware costs. Execution-driven simulations show our method to be very effective. 相似文献

7.

动态二进制翻译中数据预取优化研究*

罗琼程吴强《计算机应用研究》2009,26(12):4572-4576

动态优化是动态二进制翻译研究中一个十分重要的课题,数据预取优化能提高现代处理器体系结构应用程序性能。基于超级块(Superblock)的动态数据预取优化采用软件插桩方式收集应用程序的load访存延迟信息并构造Superblock;然后根据延迟信息以及Superblock数据流分析得出的寄存器定值引用关系,对延迟load指令进行预取优化。通过在龙芯DigitalBridge动态二进制翻译系统上实验验证,数据预取优化可以提高翻译后SPEC2000浮点测试程序代码的平均性能3.3%,开销远小于0.5%。相似文献

8.

内存泄漏的动态跟踪分析

吴民涂奉生《计算机工程与应用》2005,41(14):18-20

内存泄漏是软件开发中的一个难以定位和修正的严重错误。在大多数情况下,动态内存的有效域虽未明确写出,但仍是程序的局部;且程序动态运行的轨迹在一定程度上反映程序的静态性质。基于以上观察,开发了在面向函数定位框架中嵌入动态分析的内存泄漏监测新方法。新方法中,先建立程序的函数动态调用树,其中包含程序分配释放内存的信息,再在调用树中总结程序的静态性质,为内存泄漏定位提供有价值的信息。该文用两个实例展示这个方法的有效性。相似文献

9.

A hybrid intelligent system to improve predictive accuracy for cache prefetching

Sohail Sarwar Zia Ul-Qayyum Owais Ahmed Malik 《Expert systems with applications》2012,39(2):1626-1636

Cache being the fastest medium in memory hierarchy has a vital role to play for fully exploiting available resources, concealing latencies in IO operations, languishing the impact of these latencies and hence in improving system response time. Despite plenty of efforts made, caches alone cannot comprehend larger storage requirements without prefetching. Cache prefetching is speculatively fetching data to restrain all delays. However, effective prefetching requires a strong prediction mechanism to load relevant data with higher degree of accuracy. In order to ameliorate the predictive performance of cache prefetching, we applied the hybrid of two AI approaches named case based reasoning (CBR) and artificial neural networks (ANN). CBR maintains the past experience and ANN are used in adaptation phase of CBR instead of employing static rule base. The novelty of technique in this domain is valued due to hybrid of two approaches as well as usage of suffix tree in populating the CBR’s case base. Suffix trees provide rich data patterns for populating case base and greatly enhanced the overall performance. A number of evaluations from different aspects with varying parameters are presented (along with some findings) where the efficacy of our technique is affirmed with improved predictive accuracy and reduced level of associated costs. 相似文献

10.

申威处理器硬件数据预取技术的实现

贾迅胡向东尹飞《计算机工程与科学》2015,37(11):2013-2017

硬件数据预取技术可以有效提升处理器的访存性能,是申威处理器性能优化过程中亟需突破的一项技术。硬件开销和处理器架构的制约是硬件预取技术实现中的主要难点。借鉴学术界对硬件预取技术的研究成果和工业界的应用现状,紧密结合申威处理器的结构特点,研究了申威处理器硬件预取技术的实现方法。以流预取为例,在处理器核心面积增加0.97%的情况下,硬件预取技术的应用可以将目前申威处理器的整数性能平均提升5.17%,最高提升28.88%;浮点性能平均提升6.39%,最高提升30.11%。相似文献

11.

Static scheduling techniques for dependent tasks on dynamically reconfigurable devices

《Journal of Systems Architecture》2007,53(11):861-876

Dynamically reconfigurable hardware not only has high silicon reusability, but it can also deliver high performance for computation-intensive tasks. Advanced features such as run-time reconfiguration allow multiple tasks to be mapped onto the same device either simultaneously or multiplexed in time domain. These tasks need to be scheduled optimally or near optimally in order to efficiently utilize the device. It is a NP-hard problem, because task scheduling, allocation and configuration prefetching all need to be considered. In this paper, we target dependent task models and propose three static schedulers that use different problem solving strategies. The first is a heuristic approach developed from traditional list-based schedulers. It presents high efficiency but the least accuracy. The second is based on a full-domain search using constraint programming. It can guarantee to produce optimal solutions but requires significant searching effort. The last is a guided random search technique based on a genetic algorithm, which shows reasonable efficiency and much better accuracy than the heuristic approach. 相似文献

12.

Sequential hardware prefetching in shared-memory multiprocessors

Dahlgren F. Dubois M. Stenstrom P. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(7):733-746

To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively 相似文献

13.

Optimal Filters from Calibration Data for Image Deconvolution with Data Acquisition Error

Julianne Chung Matthias Chung Dianne P. O’Leary 《Journal of Mathematical Imaging and Vision》2012,44(3):366-374

Data acquisition errors due to dead pixels or other hardware defects can cause undesirable artifacts in imaging applications. Compensating for these defects typically requires knowledge such as a defective pixel map, which can be difficult or costly to obtain and which is not necessarily static. However, recent calibration data is readily available in many applications. In this paper, we compute optimal filters for image deconvolution with denoising using only this calibration data, by minimizing the empirical Bayes risk. We derive a bound on how the reconstruction changes as the number of dead pixels grows. We show that our approach is able to reconstruct missing information better than standard filtering approaches and is robust even in the presence of a large number of defects and to defects that arise after calibration. 相似文献

14.

反馈指导的链式数据结构预取优化

漆锋滨王飞李中升《软件学报》2009,20(Z1):34-39

传统的基于静态编译的数据预取大多针对数组访问,现在的应用程序中大量出现由指针构成的链式数据结构,依赖传统的编译方法难以进行预取优化.反馈式数据预取优化技术是当前高性能计算技术中前沿的一种编译优化手段,可以很好地解决链式结构的预取问题.在研究ORC编译器反馈式编译优化技术的基础上,针对Alpha结构的特点,对针对链式结构的反馈式数据预取进行了优化.SPEC2000测试表明,平均性能提高了4.1%. 相似文献

15.

Performance optimization problem in speculative prefetching

Tuah N.J. Kumar M. Venkatesh S. Das S.K. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(5):471-484

Speculative prefetching has been proposed to improve the response time of network access. Previous studies in speculative prefetching focus on building and evaluating access models for the purpose of access prediction. This paper investigates a complementary area which has been largely ignored, that of performance modeling. We analyze the performance of a prefetcher that has uncertain knowledge about future accesses. Our performance metric is the improvement in access time, for which we derive a formula in terms of resource parameters (time available and time required for prefetching) and speculative parameters (probabilities for next access). We develop a prefetch algorithm to maximize the improvement in access time. The algorithm is based on finding the best solution to a stretch knapsack problem, using theoretically proven apparatus to reduce the search space. An integration between speculative prefetching and caching is also investigated 相似文献

16.

Dynamic real-time scheduling strategies for interactive continuous media servers

Tsun-Ping J. To Babak Hamidzadeh 《Multimedia Systems》1999,7(2):91-106

In this paper, we propose and study a dynamic approach to schedule real-time requests in a video-on-demand (VOD) server. Providing quality of service in such servers requires uninterrupted and on-time retrieval of motion video data. VOD services and multimedia applications further require access to the storage devices to be shared among multiple concurrent streams. Most of the previous VOD scheduling approaches use limited run-time,0 information and thus cannot exploit the potential capacity of the system fully. Our approach improves throughput by making use of run-time information to relax admission control. It maintains excellent quality of service under varying playout rates by observing deadlines and by reallocating resources to guarantee continuous service. It also reduces start-up latency by beginning service as soon as it is detected that deadlines of all real-time requests will be met. We establish safe conditions for greedy admission, dynamic control of disk read sizes, fast initial service, and sporadic services. We conduct thorough simulations over a wide range of buffer capacities, load settings, and over varying playout rates to demonstrate the significant improvements in quality of service, throughput and start-up latency of our approach relative to a static approach. 相似文献

17.

A low-complexity microprocessor design with speculative pre-execution

Won W. Jean-Luc 《Journal of Systems Architecture》2008,54(12):1101-1112

Current superscalar architectures strongly depend on an instruction issue queue to achieve multiple instruction issue and out-of-order execution. However, the issue queue requires a centralized structure and mainly causes globally broadcasting operations to wakeup and select the instructions. Therefore, a large issue queue ultimately results in a low clock rate along with a high circuit complexity. In other words, the increasing demands for a larger issue queue correspondingly impose a significant burden on achieving a higher clock speed.This paper discusses our Speculative Pre-Execution Assisted by compileR (SPEAR), a low-complexity issue queue design. SPEAR is designed to manage the small window superscalar architecture more efficiently without increasing the window size. To this end, we have first recognized that the long memory latency is one of the factors that demand a large window, and we aim at achieving early execution of the miss-causing load instructions using another hierarchy of an issue queue. We pre-execute those miss-causing instructions speculatively as an additional prefetching thread. Simulation results show that the SPEAR design achieves performance comparable to or even better than what would be obtained in superscalar architectures with a large issue queue. However, SPEAR is designed with smaller issue queues which consequently can be implemented with low hardware complexity and high clock speed. 相似文献

18.

New robot navigation algorithm for arbitrary unknown dynamic environments based on future prediction and priority behavior

《Expert systems with applications》2017

This study focuses on existing drawbacks and inefficiencies of the available path planning approaches within unknown dynamic environments. The drawbacks are the inability to plan under uncertain dynamic environments, non-optimality, failure in crowded complex situations, and difficulty in predicting the velocity vector of obstacles. This study aims (1) to develop a new predictive method to avoid static and dynamic obstacles in planning the path of a mobile robot in unknown dynamic environments in which the obstacles are moving and their speed profiles are not pre-identified, to find a safe path and to react rapidly and (2) to integrate a decision-making process with the predictive behavior of the velocity vector of obstacles by using the sensory system information of the robot. Information on the locations, shapes, and velocities of static and dynamic obstacles is presumed to be unavailable. Such information is determined online using rangefinder sensors. Thus, the robot recognizes free directions that lead it toward its destination and keep it safe and prevent collision with obstacles. Extensive simulations confirm the efficiency of the suggested approach and its success in handling complex and extremely dynamic environments that contain various obstacle shapes. Findings indicate that the proposed method exhibits attractive features, such as high optimality, high stability, low running time, and zero failure rates. The failure rate is zero for all test problems. The average path length for all test environments is 16.51 with a standard deviation of 0.49, which provides an average optimality rate of 89.79%. The average running time is 4.74 s (the standard deviation is 0.26). 相似文献

19.

Taxonomy of Data Prefetching for Multicore Processors

下载免费PDF全文

Surendra Byna Member IEEE Yong Chen 《计算机科学技术学报》2009,24(3):405-417

Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory.With hardware and/or software support,data prefetching brings data closer to a processor before it is actually needed.Many prefetching techniques have been developed for single-core processors.Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching t... 相似文献

20.

帮助线程预取质量的实时在线评价方法

张建勋古志民《计算机应用》2017,37(1):114-119

针对传统静态枚举设置帮助线程控制参数值的繁杂耗时问题,提出了一种帮助线程预取质量的实时在线评价方法。首先,明确了帮助线程的预取服务质量（QoS）的目标;其次,分析了帮助线程预取性能评价的动态指标,对帮助线程预取QoS进行了建模分析;最后,提出一个帮助线程预取的动态自适应调节算法,算法根据程序的阶段行为变化和动态预取获益变化等信息来判断参数值的适用度以及是否需要进行反馈优化,从而实现对预取控制的自适应调节。实验结果表明,应用自适应预取评价算法之后,Mst热点模块的性能提升加速比为1.496,所提出的自适应预取评价方法能够根据程序的动态阶段行为对帮助线程控制参数值作出自适应控制和调节。相似文献