期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Strip-oriented asynchronous prefetching for parallel disk systems

Yang LIU Jian-zhong HUANG Xiao-dong SHI Qiang CAO Chang-sheng XIE 《浙江大学学报:C卷英文版》2012,(11):799-815

Sequential prefetching schemes are widely employed in storage servers to mask disk latency and improve system throughput. However, existing schemes cannot benefit parallel disk systems as expected due to the fact that they ignore the distinct internal characteristics of the parallel disk system, in particular, data striping. Moreover, their aggressive prefetching pattern suffers from premature evictions and prolonged request latencies. In this paper, we propose a strip-oriented asynchronous prefetching (SoAP) technique, which is dedicated to the parallel disk system. It settles the above-mentioned problems by providing multiple novel features, e.g., enhanced prediction accuracy, adaptive prefetching strength, physical data layout awareness, and timely prefetching. To validate SoAP, we implement a prototype by modifying the software redundant arrays of inexpensive disks (RAID) under Linux. Experimental results demonstrate that SoAP can consistently offer improved average response time and throughput to the parallel disk system under non-random workloads compared with STEP, SP, ASP, and Linux-like SEQPs. 相似文献

2.

Run Placement Policies for Concurrent Mergesorts Using Parallel Prefetching

Kun-Lung Wu Philip S. Yu James Z. Teng 《Knowledge and Information Systems》1999,1(4):435-457

We study the performance of various run placement policies on disks for the merge phase of concurrent mergesorts using parallel prefetching. The initial sorted runs (input) of a merge and its final sorted run (output) are stored on multiple disks but each run resides only on a single disk. In this paper, we examine through detailed simulations three different run placement policies and the impact of buffer thrashing. The results show that, with buffer thrashing avoidance, the best performance can be achieved by a run placement policy that uses a proper subset of the disks dedicated for writing the output runs while the rest of the disks are used for prefetching the input runs in parallel. However, the proper number of write disks is workload dependent, and if not carefully chosen, it can adversely affect the system performance. In practice, a reasonably good performance can be achieved by a run placement policy that does not place the output run of a merge on any of the disks that store its own input runs but allows the output run to share the same disk with some of the input runs of other merges. 相似文献

3.

A Simple and Efficient Parallel Disk Mergesort

R. D. Barve J. S. Vitter 《Theory of Computing Systems》2002,35(2):189-215

External sorting—the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks—is a fundamental subroutine in database systems[G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM ) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K,Section 5.4.9] recently identified SRM (which he calls ``randomized striping') as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is based on synchronous parallel I/O primitives provided by the TPIE programming environment[TPI]; whenever our program issues an I/O read (write) operation, one block of data is synchronously read from (written to) each disk in parallel. We compare the performance of SRM over a wide range of input sizes with that of disk-striped mergesort (DSM ), which is widely used in practice. DSM consists of a standard mergesort in conjunction with striped I/O for parallel disk access. SRM merges together significantly more runs at a time compared with DSM, and thus it requires fewer merge passes. We demonstrate in practical scenarios that even though the streaming speeds for merging with DSM are a little higher than those for SRM (since DSM merges fewer runs at a time), sorting using SRM is often significantly faster than with DSM (since SRM requires fewer passes). The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort and multiway partitioning of a file into several other files. Since both parallel disk merging and multimedia processing deal with streams that get ``consumed' at nonuniform and partially predictable rates, our techniques for lookahead based upon forecasting data may have relevance in video server applications. Received June 28, 2000, and in revised form June 5, 2001. Online publication April 8, 2002. 相似文献

4.

Increasing multimedia system throughput with consumption-based buffer management

Kun-Lung Wu Philip S. Yu 《Multimedia Systems》1998,6(6):421-428

In a multimedia server, multiple media streams are generally serviced in a cyclic fashion. Due to non-uniform playback rates and asynchronous arrivals of queries, there tends to be spare disk bandwidth in each service cycle. In this paper, we study the issue of dynamically using spare disk bandwidth and buffer to maximize the system throughput of a multimedia server. We introduce the concept of minimizing buffer consumption as the criterion to select an appropriate media stream to utilize the spare system resources. Buffer consumption measures not only the amount of buffer but also the amount of time such buffer space is occupied (i.e., the space-time product). Different alternatives to utilizing spare disk bandwidth are examined, including different rate-adjustable retrievals of an already activated stream and prefetching the next waiting stream. For rate-adjustable retrievals, we study buffer consumption-based and remaining-time-based criteria for selecting an active stream to increase retrievals. Simulations are conducted to evaluate and compare different cases. The results show that (1) minimizing buffer consumption is the right criterion for maximizing the system throughput with spare disk bandwidth; (2) in general, prefetching a waiting stream incurs more buffer consumption, and thus is less effective than rate-adjustable retrieval of active streams in maximizing the system throughput; and (3) the advantage of rate-adjustable retrieval over prefetching is especially significant when service cycle time is small. 相似文献

5.

Web object-based storage management in proxy caches

《Future Generation Computer Systems》2006,22(1-2):16-31

Proxy caches are essential to improve the performance of the World Wide Web and to enhance user perceived latency. Appropriate cache management strategies are crucial to achieve these goals. In our previous work, we have introduced Web object-based caching policies. A Web object consists of the main HTML page and all of its constituent embedded files. Our studies have shown that these policies improve proxy cache performance substantially.In this paper, we propose a new Web object-based policy to manage the storage system of a proxy cache. We propose two techniques to improve the storage system performance. The first technique is concerned with prefetching the related files belonging to a Web object, from the disk to main memory. This prefetching improves performance as most of the files can be provided from the main memory rather than from the proxy disk. The second technique stores the Web object members in contiguous disk blocks in order to reduce the disk access time. We used trace-driven simulations to study the performance improvements one can obtain with these two techniques. Our results show that the first technique by itself provides up to 50% reduction in hit latency, which is the delay involved in providing a hit document by the proxy. An additional 5% improvement can be obtained by incorporating the second technique. 相似文献

6.

Tight bounds for prefetching and buffer management algorithms forparallel I/O systems

Varman P.J. Verma R.M. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(12):1262-1275

The I/O performance of applications in multiple-disk systems can be improved by overlapping disk accesses. This requires the use of appropriate prefetching and buffer management algorithms that ensure the most useful blocks are accessed and retained in the buffer. In this paper, we answer several fundamental questions on prefetching and buffer management for distributed-buffer parallel I/O systems. First, we derive and prove the optimality of an algorithm, P-min, that minimizes the number of parallel I/Os. Second, we analyze P-con, an algorithm that always matches its replacement decisions with those of the well-known demand-paged MIN algorithm. We show that P-con can become fully sequential in the worst case. Third, we investigate the behavior of on-line algorithms for multiple-disk prefetching and buffer management. We define and analyze P-Iru, a parallel version of the traditional LRU buffer management algorithm. Unexpectedly, we find that the competitive ratio of P-Iru is independent of the number of disks. Finally, we present the practical performance of these algorithms on randomly generated reference strings. These results confirm the conclusions derived from the analysis on worst case inputs 相似文献

7.

多线程环境下基于多预取点的文件预取

刘金胡创胡明龚奕利《计算机应用》2012,32(6):1713-1716

为解决当前Linux内核的预取算法在多线程情况下出现预取误判的问题,依据多线程环境下进程对磁盘文件的访问特点,提出一种基于多预取点的预取算法。在Linux内核原有的预取算法的基础上,结合多线程环境下应用程序对数据的访问模式,在Linux内核的页面缓存层进行了实现。实验和分析表明,在IOzone单线程测试中,该算法和Linux内核原预取算法性能相当;在多线程测试中,读取相同大小的文件,耗时比Linux内核原预取算法至少少1/3。新算法对于提高I/O并行度,从而提高整个计算机系统并行化很有帮助。相似文献

8.

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》1998,53(2):170

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency. 相似文献

9.

Threaded Prefetching: A New Instruction Memory Hierarchy for Real-Time Systems

Lee Minsuk Min Sang Lyul Shin Heonshik Kim Chong Sang Park Chang Yun 《Real-Time Systems》1997,13(1):47-65

Cache memories have been extensively used to bridge the speed gap between high speed processors and relatively slow main memory. However, they are not widely used in real-time systems due to their unpredictable performance. This paper proposes an instruction prefetching scheme called threaded prefetching as an alternative to instruction caching in real-time systems. In the proposed threaded prefetching, an instruction block pointer called a thread is assigned to each instruction memory block and is made to point to the next block on the worst case execution path that is determined by a compile-time analysis. Also, the thread is not updated throughout the entire program execution to guarantee predictability. This paper also compares the worst case performances of various previous instruction prefetching schemes with that of the proposed threaded prefetching. By analyzing several benchmark programs, we show that the worst case performance of the proposed scheme is significantly better than those of previous instruction prefetching schemes. The results also show that when the block size is large enough the worst case performance of the proposed threaded prefetching scheme is almost as good as that of an instruction cache with 100 % hit ratio. 相似文献

10.

高带宽远程内存结构中的预取研究 总被引：1，自引：0，他引：1

许建卫陈明宇包云岗《计算机科学》2005,32(8):15-20

高速电路和光互联技术的发展极大地提高了网络的速度与带宽。因而,突破高性能计算机CPU与内存紧耦合的传统结构成为可能,CPU与内存的耦合不再受距离的限制,这必将引起体系结构的变革。文[1]提出DSAG结构——CPU与内存在空间上分离,每个CPU节点上仅留少量内存．将海量内存放在远程统一管理作为内存服务器,CPU节点和内存服务器之间通过高速网络互连。这种新的体系结构带来了更好的共享性和可扩展性,但同时也对我们解决CPU和内存之间的不平衡性问题带来了挑战。为了降低DSAG这种远程内存结构增加的访存时延,我们考虑到CPU正常访存没有充分利用网络的高带宽,因此可以利用剩余的网络带宽来进行远程内存数据的预取。本论文在应用程序执行时记录本地（相对于远程内存）不命中的地址信息,以页对齐分析其中存在的页框流（Page Frame Stream）的统计特征,并提出可基于页框流的预取机制可降低访存延迟、提升系统性能的观点。最后我们采用模拟的方法验证了观点的可行性与正确性,进一步提出了三种预取策略,比较并分析影响预取效果的因素。相似文献

11.

指令级并行编译器的数据预取及优化方法 总被引：6，自引：0，他引：6

连瑞琦张兆庆乔如良《计算机学报》2000,23(6):576-584

微处理器芯片的处理能力越来越强,但是,存储器的速度却远远不能与其匹配,造成了整个系统的性能不理想,为解决这个总理２,编译器发展了局部性优化、数据预取等多种技术,文中将介绍一种用于ＩＬＰ（Ｉｎｓｔｒｕｃｔｉｏｎｌｅｖ－ｅｌＰａｒａｌｌｅｌｉｓｍ）优化编译器的数据预取技术以及一种利用寄存器堆减少主存访问次数、对程序进行优化的方法,利用它们可以提高平均存储性能,对科学和工程计算的应用是相当有效的。相似文献

12.

An analysis of buffer sharing and prefetching techniques for multimedia systems 总被引：1，自引：0，他引：1

Raymond T. Ng Jinhai Yang 《Multimedia Systems》1996,4(2):55-69

In this paper, we study the problem of how to maximize the throughput of a continuous-media system, given fixed amounts of buffer space and disk bandwidth both pre-determined at design time. Our approach is to maximize the utilizations of disk and buffers. We propose doing so in two ways. First, we analyze a scheme that allows multiple streams to share buffers. Our analysis and preliminary simulation results indicate that buffer sharing could lead to as much as 50% reduction in total buffer requirement. Second, we develop three prefetching strategies: SP, IP1 and IP2. As demonstrated by SP, straightforward prefetching is not effective at all. In contrast, IP1 and IP2, which prefetch more intelligently than does SP, could be valuable in maximizing the effective use of buffers and disk. Our preliminary simulation results show that IP1 and IP2 could lead to a 40% improvement in throughput. 相似文献

13.

Bulk Synchronous Parallel Algorithms for the External Memory Model

Dehne Dittrich Hutchinson Maheshwari 《Theory of Computing Systems》2008,35(6):567-597

Abstract. Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/ O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchronous Parallel (BSP) algorithms into efficient parallel EM algorithms. It optimizes blockwise data access and parallel disk I/ O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel EM algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including three-dimensional convex hulls (two-dimensional Voronoi diagrams), and various graph problems. We show that certain parallel algorithms known for the BSP model can be used to obtain EM algorithms that meet well known I /O complexity lower bounds for various problems, including sorting. 相似文献

14.

A clustering-based prefetching scheme on a Web cache environment

George Athena Jaroslav 《Computers & Electrical Engineering》2008,34(4):309-323

Web prefetching is an attractive solution to reduce the network resources consumed by Web services as well as the access latencies perceived by Web users. Unlike Web caching, which exploits the temporal locality, Web prefetching utilizes the spatial locality of Web objects. Specifically, Web prefetching fetches objects that are likely to be accessed in the near future and stores them in advance. In this context, a sophisticated combination of these two techniques may cause significant improvements on the performance of the Web infrastructure. Considering that there have been several caching policies proposed in the past, the challenge is to extend them by using data mining techniques. In this paper, we present a clustering-based prefetching scheme where a graph-based clustering algorithm identifies clusters of “correlated” Web pages based on the users’ access patterns. This scheme can be integrated easily into a Web proxy server, improving its performance. Through a simulation environment, using a real data set, we show that the proposed integrated framework is robust and effective in improving the performance of the Web caching environment. 相似文献

15.

NAS设备的磁盘调度模型分析

胡鸣周敬利余胜生《小型微型计算机系统》2004,25(9):1578-1581

附网存储 (NAS)设备的性能目标是优化网络存储数据访问和存储子系统的管理 .本文旨在显示随着磁盘转速的增加 ,NAS应该从磁盘硬件的最佳工作性能出发 ,整体配合以提高它的 I/O操作性能 .为了发掘 NAS最佳的工作性能 ,我们建立同时基于磁臂位置和旋转位置的精确的磁盘调度模型 ,并提出测量调度算法性能的方法 .以 HP975 6 0性能参数为基础 ,进行理论分析和模拟测试 .结果磁盘转速越快 ,磁盘访问的开销越大 .因此 ,NAS设备的设计必须从整体上考虑磁盘调度策略的选择 ,并行多磁盘结构的选择 ,文件 cache的分配和文件系统布局等 ,以便提高网络存储数据访问的性能相似文献

16.

Towards implementation of a novel scheme for data prefetching on distributed shared memory systems

Hsiao-Hsi Wang Kuan-Ching Li Ssu-Hsuan Lu Chun-Chieh Yang 《The Journal of supercomputing》2009,47(2):111-126

High speed networks and rapidly improving microprocessor performance make the network of workstations an extremely important tool for parallel computing in order to speedup the execution of scientific applications. Shared memory is an attractive programming model for designing parallel and distributed applications, where the programmer can focus on algorithmic development rather than data partition and communication. Based on this important characteristic, the design of systems to provide the shared memory abstraction on physically distributed memory machines has been developed, known as Distributed Shared Memory (DSM). DSM is built using specific software to combine a number of computer hardware resources into one computing environment. Such an environment not only provides an easy way to execute parallel applications, but also combines available computational resources with the purpose of speeding up execution of these applications. DSM systems need to maintain data consistency in memory, which usually leads to communication overhead. Therefore, there exists a number of strategies that can be used to overcome this overhead issue and improve overall performance. Strategies as prefetching have been proven to show great performance in DSM systems, since they can reduce data access communication latencies from remote nodes. On the other hand, these strategies also transfer unnecessary prefetching pages to remote nodes. In this research paper, we focus on the access pattern during execution of a parallel application, and then analyze the data type and behavior of parallel applications. We propose an adaptive data classification scheme to improve prefetching strategy with the goal to improve overall performance. Adaptive data classification scheme classifies data according to the accessing sequence of pages, so that the home node uses past history access patterns of remote nodes to decide whether it needs to transfer related pages to remote nodes. From experimental results, we can observe that our proposed method can increase the accuracy of data access in effective prefetch strategy by reducing the number of page faults and misprefetching. Experimental results using our proposed classification scheme show a performance improvement of about 9–25% over the same benchmark applications running on top of an original JIAJIA DSM system.

Kuan-Ching Li (Corresponding author)Email:

相似文献

17.

基于存取模式的Cache预取自适应策略研究

周可张江陵冯丹万志坤《计算机工程与科学》2003,25(1):80-84

不同的Cache预取策略适用于不同的存取模式。本文介绍了存储系统Cache预取技术的研究现状，从分析存取模式出发，构造了存取模式三元组模型，并在磁盘阵列上测试了适用于复杂环境下的Cache预取自适应策略，结果证明，自适应策略能够在不同环境上获得磁盘阵列的最优性能。相似文献

18.

Fine-grained management of I/O optimizations based on workload characteristics

Bing WEI Limin XIAO Bingyu ZHOU Guangjun QIN Baicheng YAN Zhisheng HUO 《Frontiers of Computer Science》2021,15(3):153102

With the advent of new computing paradigms, parallel file systems serve not only traditional scientific computing applications but also non-scientific computing applications, such as financial computing, business, and public administration. Parallel file systems provide storage services for multiple applications. As a result, various requirements need to be met. However, parallel file systems usually provide a unified storage solution, which cannot meet specific application needs. In this paper, an extended file handle scheme is proposed to deal with this problem. The original file handle is extended to record I/O optimization information, which allows file systems to specify optimizations for a file or directory based on workload characteristics. Therefore, fine-grained management of I/O optimizations can be achieved. On the basis of the extended file handle scheme, data prefetching and small file optimization mechanisms are proposed for parallel file systems. The experimental results show that the proposed approach improves the aggregate throughput of the overall system by up to 189.75%. 相似文献

19.

可重构云计算领域的小文件系统设计与实现

刘浩洋祝永新《计算机工程》2014,(4):1-6

借助ASIC系统的高效性和软件的可编程性,可重构概念使计算机的性能获得了进一步的提升空间。但在云计算应用背景下,需要一个文件系统对互联网上的海量小文件进行高效处理。为此,阐述和分析现有的小文件系统,设计一个基于现场可编程门阵列(FPGA)的小文件系统(FPGASmallFS)。该系统通过简化文件系统结构和动态划分磁盘卷,提高文件系统的速度和磁盘空间利用率,同时借助FPGA的并行加速实现文件系统挂载过程的加速,从而提高磁盘空间利用率。测试结果表明,与ReiserFS和Ext2系统相比,FPGASmallFS具有更好的系统性能。相似文献

20.

Flexible Usage of Redundancy in Disk Arrays

E. J. Schwabe I. M. Sutherland 《Theory of Computing Systems》1999,32(5):561-587

It is well known that dedicating one disk's worth of space in a disk array to parity check information can allow the array to tolerate a single failure. More recently, two possible ways of increasing that benefit through the use of additional redundant information have been demonstrated: the additional redundancy can be used to allow the array to tolerate more than one disk failure without the loss of information (multiple-fault tolerance ), or it can be used to speed up the reconstruction of a single failed disk and thus reduce the impact of on-line reconstruction on array performance (using a technique known as {parity declustering ). However, these two advantages could not be obtained simultaneously. In this paper we demonstrate for the first time how to divide the benefits of extra space for redundant information arbitrarily between increased fault tolerance and accelerated reconstruction of failed disks. In addition, we give general lower bounds on the space overhead required to protect information in a disk array by storing redundant information. These bounds widen the optimality of some known multiple-fault-tolerant data layouts. Received September 1997, and in final form October 1998. 相似文献