首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 47 毫秒
1.
Data Grid integrates graphically distributed resources for solving data intensive scientific applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Scheduling is a traditional problem in parallel and distributed system. However, due to special issues and goals of Grid, traditional approach is not effective in this environment any more. Therefore, it is necessary to propose methods specialized for this kind of parallel and distributed system. Another solution is to use a data replication strategy to create multiple copies of files and store them in convenient locations to shorten file access times. To utilize the above two concepts, in this paper we develop a job scheduling policy, called hierarchical job scheduling strategy (HJSS), and a dynamic data replication strategy, called advanced dynamic hierarchical replication strategy (ADHRS), to improve the data access efficiencies in a hierarchical Data Grid. HJSS uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers network characteristics, number of jobs waiting in queue, file locations, and disk read speed of storage drive at data sources. Moreover, due to the limited storage capacity, a good replica replacement algorithm is needed. We present a novel replacement strategy which deletes files in two steps when free space is not enough for the new replica: first, it deletes those files with minimum time for transferring. Second, if space is still insufficient then it considers the last time the replica was requested, number of access, size of replica and file transfer time. The simulation results show that our proposed algorithm has better performance in comparison with other algorithms in terms of job execution time, number of intercommunications, number of replications, hit ratio, computing resource usage and storage usage.  相似文献   

2.
In recent years, grid technology has had such a fast growth that it has been used in many scientific experiments and research centers. A large number of storage elements and computational resources are combined to generate a grid which gives us shared access to extra computing power. In particular, data grid deals with data intensive applications and provides intensive resources across widely distributed communities. Data replication is an efficient way for distributing replicas among the data grids, making it possible to access similar data in different locations of the data grid. Replication reduces data access time and improves the performance of the system. In this paper, we propose a new dynamic data replication algorithm named PDDRA that optimizes the traditional algorithms. Our proposed algorithm is based on an assumption: members in a VO (Virtual Organization) have similar interests in files. Based on this assumption and also file access history, PDDRA predicts future needs of grid sites and pre-fetches a sequence of files to the requester grid site, so the next time that this site needs a file, it will be locally available. This will considerably reduce access latency, response time and bandwidth consumption. PDDRA consists of three phases: storing file access patterns, requesting a file and performing replication and pre-fetching and replacement. The algorithm was tested using a grid simulator, OptorSim developed by European Data Grid projects. The simulation results show that our proposed algorithm has better performance in comparison with other algorithms in terms of job execution time, effective network usage, total number of replications, hit ratio and percentage of storage filled.  相似文献   

3.
Data grids support access to widely distributed storage for large numbers of users accessing potentially many large files. Efficient access is hindered by the high latency of the Internet. To improve access time, replication at nearby sites may be used. Replication also provides high availability, decreased bandwidth use, enhanced fault tolerance, and improved scalability. Resource availability, network latency, and user requests in a grid environment may vary with time. Any replica placement strategy must be able to adapt to such dynamic behavior. In this paper, we describe a new dynamic replica placement algorithm, Popularity Based Replica Placement (PBRP), for hierarchical data grids which is guided by file “popularity”. Our goal is to place replicas close to clients to reduce data access time while still using network and storage resources efficiently. The effectiveness of PBRP depends on the selection of a threshold value related to file popularity. We also present Adaptive-PBRP (APBRP) that determines this threshold dynamically based on data request arrival rates. We evaluate both algorithms using simulation. Results for a range of data access patterns show that our algorithms can shorten job execution time significantly and reduce bandwidth consumption compared to other dynamic replication methods.  相似文献   

4.
Data Grid is a geographically distributed environment that deals with large-scale data-intensive applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Data replication is another key optimization technique for reducing access latency and managing large data by storing data in a wisely manner. In this paper, two algorithms are proposed: first, a novel job scheduling algorithm called Combined Scheduling Strategy (CSS) that considers the number of jobs waiting in queue, the location of required data for the job, and computational capability; second, a dynamic data replication strategy called Dynamic Hierarchical Replication Algorithm (DHRA) that improves file access time. DHRA stores each replica in an appropriate site, i.e., appropriate site in the requested region that has the highest number of access for that particular replica. Also, it can minimize access latency by selecting the best replica when various sites hold replicas of datasets. The simulation results demonstrate the proposed replication and scheduling strategies give better performance compared to the other algorithms.  相似文献   

5.
In recent years data grids have been deployed and grown in many scientific experiments and data centers. The deployment of such environments has allowed grid users to gain access to a large number of distributed data. Data replication is a key issue in a data grid and should be applied intelligently because it reduces data access time and bandwidth consumption for each grid site. Therefore this area will be very challenging as well as providing much scope for improvement. In this paper, we introduce a new dynamic data replication algorithm named Popular File Group Replication, PFGR which is based on three assumptions: first, users in a grid site (Virtual Organization) have similar interests in files and second, they have the temporal locality of file accesses and third, all files are read-only. Based on file access history and first assumption, PFGR builds a connectivity graph for a group of dependent files in each grid site and replicates the most popular group files to the requester grid site. After that, when a user of that grid site needs some files, they are available locally. The simulation results show that our algorithm increases performance by minimizing the mean job execution time and bandwidth consumption and avoids unnecessary replication.  相似文献   

6.
Data Grid is a geographically distributed environment that deals with large-scale data-intensive applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Data replication is another key optimization technique for reducing access latency and managing large data by storing data in a wisely manner. In this paper two algorithms are proposed, first a novel job scheduling algorithm called Combined Scheduling Strategy (CSS) that uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers the number of jobs waiting in queue, the location of required data for the job and the computing capacity of sites. Second a dynamic data replication strategy, called the Modified Dynamic Hierarchical Replication Algorithm (MDHRA) that improves file access time. This strategy is an enhanced version of Dynamic Hierarchical Replication (DHR) strategy. Data replication should be used wisely because the storage capacity of each Grid site is limited. Thus, it is important to design an effective strategy for the replication replacement. MDHRA replaces replicas based on the last time the replica was requested, number of access, and size of replica. It selects the best replica location from among the many replicas based on response time that can be determined by considering the data transfer time, the storage access latency, the replica requests that waiting in the storage queue and the distance between nodes. The simulation results demonstrate the proposed replication and scheduling strategies give better performance compared to the other algorithms.  相似文献   

7.
Lustre文件系统对大文件的I/O性能较好,但对小文件不佳。针对这个问题,提出建立一个基于MDS节点的小文件缓存池机制,在缓存池里缓存经常被访问的小文件。在该机制中,小文件缓存池与OST使用全相联映射方式对应,并且使用贯穿读出式和直写式策略保持文件的一致性;缓存池更新策略综合考虑了文件的访问时间和次数等因素,使用改进的近期最少使用算法(LRU)更新替换。实验结果表明,改进后的Lustre文件系统减少了小文件的网络传输开销和访问时间,对小文件的I/O性能有较明显的提高。虽然它对大文件的I/O性能有所降低,但在可接受范围之内,仍具有一定的实用价值。  相似文献   

8.
为保证访问负载的均衡分布,分布式存储系统往往依赖访问热度信息进行文件放置。然而,访问热度信息在文件存入系统时刻并不可知,并且随时间不断变化,依赖访问热度信息的放置算法需要不断调整文件的存储位置,产生高昂的迁移成本。本文提出一种细粒度均衡的新型分布式文件放置算法。该算法利用文件访问热度同已创建时间之间的相关性,通过保证各节点所存储数据量在创建时间维度上的细粒度相似性,实现较好的访问负载均衡。该算法仅基于文件的创建时间属性,该属性在文件存入系统时刻属于已知信息并且不随时间变化。实验结果表明,相较于HDFS系统的随机放置算法,本文算法能够更好地实现访问负载的均衡分布,提高访问性能。  相似文献   

9.
核外计算中,由于磁盘I/O操作特点是启动开销大,所以对文件的访问时间占的比例较大。如果能减少读取文件操作的次数则可以大幅度地提高运行效率。数据重用是一种有效的减少I/O操作次数的技术。本文将数据分成几个文件,然后将本次Cholesky分解完毕的文件继续的留在内存缓冲区中。当对下一个文件进行分解时,可用上一个刚分解完的文件进行数据的更新。这样就减少了读取数据的I/O操作次数,从而提高了分解效率。  相似文献   

10.
Hadoop分布式文件系统(HDFS)通常用于大文件的存储和管理,当进行海量小文件的存储和计算时,会消耗大量的NameNode内存和访问时间,成为制约HDFS性能的一个重要因素.针对多模态医疗数据中海量小文件问题,提出一种基于双层哈希编码和HBase的海量小文件存储优化方法.在小文件合并时,使用可扩展哈希函数构建索引文件存储桶,使索引文件可以根据需要进行动态扩展,实现文件追加功能.在每个存储桶中,使用MWHC哈希函数存储每个文件索引信息在索引文件中的位置,当访问文件时,无须读取所有文件的索引信息,只需读取相应存储桶中的索引信息即可,从而能够在O(1)的时间复杂度内读取文件,提高文件查找效率.为了满足多模态医疗数据的存储需求,使用HBase存储文件索引信息,并设置标识列用于标识不同模态的医疗数据,便于对不同模态数据的存储管理,并提高文件的读取速度.为了进一步优化存储性能,建立了基于LRU的元数据预取机制,并采用LZ4压缩算法对合并文件进行压缩存储.通过对比文件存取性能、NameNode内存使用率,实验结果表明,所提出的算法与原始HDFS、HAR、MapFile、TypeStorage以及...  相似文献   

11.
目前分布式存储集群广泛采用纠删码来保证数据可靠性,但是数据更新密集时存储集群的磁盘I/O开销会成为性能瓶颈.在常用的纠删码数据更新方法中,磁盘I/O开销主要包括:1)更新数据块时对数据节点的读后写操作;2)更新校验块时读写日志的磁盘寻道开销.针对这些问题,提出PARD(parity logging with reserved space and data delta)数据更新方法,其主要思想是首先利用纠删码线性运算的特性来减少读后写操作;然后根据磁盘特性来降低磁盘寻道开销.PARD包含3个设计要点:1)采用即时的数据块更新和基于日志的校验块更新;2)利用纠删码线性运算的特性,构建基于数据增量的日志,极大限度地消除对数据节点的读后写操作;3)根据磁盘特性,在数据文件末尾为日志预留空间,减少读写日志的磁盘寻道开销.实验结果表明,当块大小为4 MB时,PARD的更新吞吐率相较于PLR(parity logging with reserved space),PARIX(speculative partial write),FO(full overwrite),分别至少提升了30.4%,47.0%,82.0%.  相似文献   

12.
支持高并发数据流处理的MapReduce中间结果缓存   总被引:1,自引:0,他引:1  
针对面向大规模历史数据的高并发数据流处理需求,为改进MapReduce的实时处理能力,提出了一种内存Hash B树、外存SSTable文件的keyvalue中间结果缓存,该结构具有可划分性、可扩展性和高效性.在此基础上,利用B树的平衡性特征提出了一种基于概率的B树构造算法和多路查询算法,利用读写开销估算和缓冲区信息改造了外存文件读写策略和内外存替换算法,进一步优化了中间结果的高并发读写性能.算法分析和实验证明了该缓存的有效性.  相似文献   

13.
Parallel file systems employ data declustering to increase I/O throughput. However, because a single read or write operation can generate data accesses on multiple independent storage devices, a concurrency control mechanism must be employed to retain familiar file access semantics. Concurrency control negates some of the performance benefits of data declustering by introducing additional file access overhead. This paper examines the performance characteristics of the transaction-based concurrency control mechanism implemented in the PIOUS parallel file system. Results demonstrate that linearizability of file access operations is provided without loss of scalability or stability.  相似文献   

14.
页是磁盘与内存进行数据交换的基本单位,它在操作系统、数据库管理系统以及倒排文件的数据组织中占据十分重要的地位。为减少倒排索引的磁盘 I/O 读写开销,提出了一种倒排文件按页存储的构建方法,实现了按页读写文件。该方法主要包括磁盘I/O层设计、页管理器设计以及堆文件管理器设计三个部分,实现了页大小可变的分块式数据文件管理,支持页内定长记录、变长记录的组装以及超长数据记录的跨页存储。经实验测试,结果表明该方法是行之有效的,可以将其应用到实际的垂直搜索引擎中。  相似文献   

15.
The Data Grid provides massive aggregated computing resources and distributed storage space to deal with data-intensive applications. Due to the limitation of available resources in the grid as well as production of large volumes of data, efficient use of the Grid resources becomes an important challenge. Data replication is a key optimization technique for reducing access latency and managing large data by storing data in a wise manner. Effective scheduling in the Grid can reduce the amount of data transferred among nodes by submitting a job to a node where most of the requested data files are available. In this paper two strategies are proposed, first a novel job scheduling strategy called Weighted Scheduling Strategy (WSS) that uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers the number of jobs waiting in a queue, the location of the required data for the job and the computing capacity of the sites Second, a dynamic data replication strategy, called Enhanced Dynamic Hierarchical Replication (EDHR) that improves file access time. This strategy is an enhanced version of the Dynamic Hierarchical Replication strategy. It uses an economic model for file deletion when there is not enough space for the replica. The economic model is based on the future value of a data file. Best replica placement plays an important role for obtaining maximum benefit from replication as well as reducing storage cost and mean job execution time. So, it is considered in this paper. The proposed strategies are implemented by OptorSim, the European Data Grid simulator. Experiment results show that the proposed strategies achieve better performance by minimizing the data access time and avoiding unnecessary replication.  相似文献   

16.
Cloud computing environment is getting more interesting as a new trend of data management. Data replication has been widely applied to improve data access in distributed systems such as Grid and Cloud. However, due to the finite storage capacity of each site, copies that are useful for future jobs can be wastefully deleted and replaced with less valuable ones. Therefore, it is considerable to have appropriate replication strategy that can dynamically store the replicas while satisfying quality of service (QoS) requirements and storage capacity constraints. In this paper, we present a dynamic replication algorithm, named hierarchical data replication strategy (HDRS). HDRS consists of the replica creation that can adaptively increase replicas based on exponential growth or decay rate, the replica placement according to the access load and labeling technique, and finally the replica replacement based on the value of file in the future. We evaluate different dynamic data replication methods using CloudSim simulation. Experiments demonstrate that HDRS can reduce response time and bandwidth usage compared with other algorithms. It means that the HDRS can determine a popular file and replicates it to the best site. This method avoids useless replications and decreases access latency by balancing the load of sites.  相似文献   

17.
The author presents a distributed algorithm that considers the number of read and write accesses to files for every process type, the number of processes and their demands on system resources, the utilization of bottlenecks on all machines, and file sizes. Performance improvement obtained with the algorithm is discussed and proved. A number of experiments executed in a distributed system in order to predict the impact on performance of various algorithm strategies are examined. The experiments show changes in system performance due to file and process placement, file replication, and file and process migration  相似文献   

18.
曙光星云分布式文件系统:海量小文件存取   总被引:2,自引:0,他引:2  
随着互联网应用的发展和云计算的兴起,在线图片、音频、视频以及微博等服务逐渐广泛发展,这些应用展示了与传统应用截然不同的数据访问和存储模式.数据中心内每秒钟都有大量较小文件的生成、分析和返回,这些应用对高并发海量文件的高吞吐、低延迟读写提出了新的挑战.提出基于分布式表存储的全新的分布式文件系统HVFS来管理数以十亿计的文件,并同时支持高吞吐和低延迟的文件访问.HVFS通过改进分布式可扩展哈希来管理元数据、日志结构的格式和列存储来利用时空局部性.本文描述了HVFS的设计和实现并进行了中等规模的实验.实验显示HVFS的表存储结构能够线性的扩展,并在82个结点上提供超过240,000次/秒、100,000次/秒的数据(<1KB)写和读;基于FUSE的实现在32个节点上提供超过180,000个/秒的文件创建速度.  相似文献   

19.
非定常Monte Carlo输运问题的并行算法   总被引:1,自引:0,他引:1  
文中给出了非定常MonteCarlo(下文简写为MC)输运问题的并行算法 ,对并行程序的加载运行模式进行了讨论和优化设计 .针对MC并行计算设计了一种理想情况下无通信的并行随机数发生器算法 .动态MC输运问题有大量的I/O操作 ,特别是读取剩余粒子数据文件需要大量的I/O时间 ,文中针对I/O问题 ,提出了三种并行I/O算法 .最后给出了并行算法的性能测试结果 ,对比串行计算时间 ,使用 6 4台处理机时的并行计算时间缩短了 30倍  相似文献   

20.
A disk cache is typically used in file systems to reduce average access time for data storage and retrieval. The `periodic update' write policy, widely used in existing computer systems, is one in which dirty cache blocks are written to a disk on a periodic basis. The average response time for disk read requests when the periodic update write policy is used is determined. Read and write load, cache-hit ratio, and the disk scheduler's ability to reduce service time under load are incorporated in the analysis, leading to design criteria that can be used to decide among competing cache write policies. The main conclusion is that the bulk arrivals generated by the periodic update policy cause a traffic jam effect which results in severely degraded service. Effective use of the disk cache and disk scheduling can alleviate this problem, but only under a narrow range of operating conditions. Based on this conclusion, alternate write packages that retain the periodic update policy's advantages and provide uniformly better service are proposed  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号