期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A File Group Data Replication Algorithm for Data Grids

Amir Masoud Rahmani Leila Azari Helder A. Daniel 《Journal of Grid Computing》2017,15(3):379-393

In recent years data grids have been deployed and grown in many scientific experiments and data centers. The deployment of such environments has allowed grid users to gain access to a large number of distributed data. Data replication is a key issue in a data grid and should be applied intelligently because it reduces data access time and bandwidth consumption for each grid site. Therefore this area will be very challenging as well as providing much scope for improvement. In this paper, we introduce a new dynamic data replication algorithm named Popular File Group Replication, PFGR which is based on three assumptions: first, users in a grid site (Virtual Organization) have similar interests in files and second, they have the temporal locality of file accesses and third, all files are read-only. Based on file access history and first assumption, PFGR builds a connectivity graph for a group of dependent files in each grid site and replicates the most popular group files to the requester grid site. After that, when a user of that grid site needs some files, they are available locally. The simulation results show that our algorithm increases performance by minimizing the mean job execution time and bandwidth consumption and avoids unnecessary replication. 相似文献

2.

Network and data location aware approach for simultaneous job scheduling and data replication in large-scale data grid environments

Najme MANSOURI 《Frontiers of Computer Science》2014,8(3):391-408

Data Grid integrates graphically distributed resources for solving data intensive scientific applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Scheduling is a traditional problem in parallel and distributed system. However, due to special issues and goals of Grid, traditional approach is not effective in this environment any more. Therefore, it is necessary to propose methods specialized for this kind of parallel and distributed system. Another solution is to use a data replication strategy to create multiple copies of files and store them in convenient locations to shorten file access times. To utilize the above two concepts, in this paper we develop a job scheduling policy, called hierarchical job scheduling strategy (HJSS), and a dynamic data replication strategy, called advanced dynamic hierarchical replication strategy (ADHRS), to improve the data access efficiencies in a hierarchical Data Grid. HJSS uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers network characteristics, number of jobs waiting in queue, file locations, and disk read speed of storage drive at data sources. Moreover, due to the limited storage capacity, a good replica replacement algorithm is needed. We present a novel replacement strategy which deletes files in two steps when free space is not enough for the new replica: first, it deletes those files with minimum time for transferring. Second, if space is still insufficient then it considers the last time the replica was requested, number of access, size of replica and file transfer time. The simulation results show that our proposed algorithm has better performance in comparison with other algorithms in terms of job execution time, number of intercommunications, number of replications, hit ratio, computing resource usage and storage usage. 相似文献

3.

Enhanced Dynamic Hierarchical Replication and Weighted Scheduling Strategy in Data Grid

Najme Mansouri Gholam Hosein Dastghaibyfard 《Journal of Parallel and Distributed Computing》2013

The Data Grid provides massive aggregated computing resources and distributed storage space to deal with data-intensive applications. Due to the limitation of available resources in the grid as well as production of large volumes of data, efficient use of the Grid resources becomes an important challenge. Data replication is a key optimization technique for reducing access latency and managing large data by storing data in a wise manner. Effective scheduling in the Grid can reduce the amount of data transferred among nodes by submitting a job to a node where most of the requested data files are available. In this paper two strategies are proposed, first a novel job scheduling strategy called Weighted Scheduling Strategy (WSS) that uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers the number of jobs waiting in a queue, the location of the required data for the job and the computing capacity of the sites Second, a dynamic data replication strategy, called Enhanced Dynamic Hierarchical Replication (EDHR) that improves file access time. This strategy is an enhanced version of the Dynamic Hierarchical Replication strategy. It uses an economic model for file deletion when there is not enough space for the replica. The economic model is based on the future value of a data file. Best replica placement plays an important role for obtaining maximum benefit from replication as well as reducing storage cost and mean job execution time. So, it is considered in this paper. The proposed strategies are implemented by OptorSim, the European Data Grid simulator. Experiment results show that the proposed strategies achieve better performance by minimizing the data access time and avoiding unnecessary replication. 相似文献

4.

A novel dynamic network data replication scheme based on historical access record and proactive deletion

Zhe Wang Tao Li Naixue Xiong Yi Pan 《The Journal of supercomputing》2012,62(1):227-250

Data replication is becoming a popular technology in many fields such as cloud storage, Data grids and P2P systems. By replicating files to other servers/nodes, we can reduce network traffic and file access time and increase data availability to react natural and man-made disasters. However, it does not mean that more replicas can always have a better system performance. Replicas indeed decrease read access time and provide better fault-tolerance, but if we consider write access, maintaining a large number of replications will result in a huge update overhead. Hence, a trade-off between read access time and write updating cost is needed. File popularity is an important factor in making decisions about data replication. To avoid data access fluctuations, historical file popularity can be used for selecting really popular files. In this research, a dynamic data replication strategy is proposed based on two ideas. The first one employs historical access records which are useful for picking up a file to replicate. The second one is a proactive deletion method, which is applied to control the replica number to reach an optimal balance between the read access time and the write update overhead. A unified cost model is used as a means to measure and compare the performance of our data replication algorithm and other existing algorithms. The results indicate that our new algorithm performs much better than those algorithms. 相似文献

5.

Combination of data replication and scheduling algorithm for improving data availability in Data Grids

Najme Mansouri Gholam Hosein Dastghaibyfard Ehsan Mansouri 《Journal of Network and Computer Applications》2013,36(2):711-722

Data Grid is a geographically distributed environment that deals with large-scale data-intensive applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Data replication is another key optimization technique for reducing access latency and managing large data by storing data in a wisely manner. In this paper two algorithms are proposed, first a novel job scheduling algorithm called Combined Scheduling Strategy (CSS) that uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers the number of jobs waiting in queue, the location of required data for the job and the computing capacity of sites. Second a dynamic data replication strategy, called the Modified Dynamic Hierarchical Replication Algorithm (MDHRA) that improves file access time. This strategy is an enhanced version of Dynamic Hierarchical Replication (DHR) strategy. Data replication should be used wisely because the storage capacity of each Grid site is limited. Thus, it is important to design an effective strategy for the replication replacement. MDHRA replaces replicas based on the last time the replica was requested, number of access, and size of replica. It selects the best replica location from among the many replicas based on response time that can be determined by considering the data transfer time, the storage access latency, the replica requests that waiting in the storage queue and the distance between nodes. The simulation results demonstrate the proposed replication and scheduling strategies give better performance compared to the other algorithms. 相似文献

6.

A dynamic replica management strategy in data grid

《Journal of Network and Computer Applications》2012,35(4):1297-1303

Data Grid provides scalable infrastructure for storage resource and data files management, which supports several large scale applications. Due to limitation of available resources in grid, efficient use of the grid resources becomes an important challenge. Replication is a technique used in data grid to improve fault tolerance and to reduce the bandwidth consumption. This paper proposes a Dynamic Hierarchical Replication (DHR) algorithm that places replicas in appropriate sites i.e. best site that has the highest number of access for that particular replica. It also minimizes access latency by selecting the best replica when various sites hold replicas. The proposed replica selection strategy selects the best replica location for the users' running jobs by considering the replica requests that waiting in the storage and data transfer time. The simulated results with OptorSim, i.e. European Data Grid simulator show that DHR strategy gives better performance compared to the other algorithms and prevents unnecessary creation of replica which leads to efficient storage usage. 相似文献

7.

Job scheduling and dynamic data replication in data grid environment

Najme Mansouri Gholam Hosein Dastghaibyfard 《The Journal of supercomputing》2013,64(1):204-225

Data Grid is a geographically distributed environment that deals with large-scale data-intensive applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Data replication is another key optimization technique for reducing access latency and managing large data by storing data in a wisely manner. In this paper, two algorithms are proposed: first, a novel job scheduling algorithm called Combined Scheduling Strategy (CSS) that considers the number of jobs waiting in queue, the location of required data for the job, and computational capability; second, a dynamic data replication strategy called Dynamic Hierarchical Replication Algorithm (DHRA) that improves file access time. DHRA stores each replica in an appropriate site, i.e., appropriate site in the requested region that has the highest number of access for that particular replica. Also, it can minimize access latency by selecting the best replica when various sites hold replicas of datasets. The simulation results demonstrate the proposed replication and scheduling strategies give better performance compared to the other algorithms. 相似文献

8.

Mohammad Shorfuzzaman Peter Graham Rasit Eskicioglu 《The Journal of supercomputing》2010,51(3):374-392

Data grids support access to widely distributed storage for large numbers of users accessing potentially many large files. Efficient access is hindered by the high latency of the Internet. To improve access time, replication at nearby sites may be used. Replication also provides high availability, decreased bandwidth use, enhanced fault tolerance, and improved scalability. Resource availability, network latency, and user requests in a grid environment may vary with time. Any replica placement strategy must be able to adapt to such dynamic behavior. In this paper, we describe a new dynamic replica placement algorithm, Popularity Based Replica Placement (PBRP), for hierarchical data grids which is guided by file “popularity”. Our goal is to place replicas close to clients to reduce data access time while still using network and storage resources efficiently. The effectiveness of PBRP depends on the selection of a threshold value related to file popularity. We also present Adaptive-PBRP (APBRP) that determines this threshold dynamically based on data request arrival rates. We evaluate both algorithms using simulation. Results for a range of data access patterns show that our algorithms can shorten job execution time significantly and reduce bandwidth consumption compared to other dynamic replication methods. 相似文献

9.

Hierarchical data replication strategy to improve performance in cloud computing

Najme MANSOURI Mohammad Masoud JAVIDI Behnam Mohammad Hasani ZADE 《Frontiers of Computer Science》2021,15(2):152501

Cloud computing environment is getting more interesting as a new trend of data management. Data replication has been widely applied to improve data access in distributed systems such as Grid and Cloud. However, due to the finite storage capacity of each site, copies that are useful for future jobs can be wastefully deleted and replaced with less valuable ones. Therefore, it is considerable to have appropriate replication strategy that can dynamically store the replicas while satisfying quality of service (QoS) requirements and storage capacity constraints. In this paper, we present a dynamic replication algorithm, named hierarchical data replication strategy (HDRS). HDRS consists of the replica creation that can adaptively increase replicas based on exponential growth or decay rate, the replica placement according to the access load and labeling technique, and finally the replica replacement based on the value of file in the future. We evaluate different dynamic data replication methods using CloudSim simulation. Experiments demonstrate that HDRS can reduce response time and bandwidth usage compared with other algorithms. It means that the HDRS can determine a popular file and replicates it to the best site. This method avoids useless replications and decreases access latency by balancing the load of sites. 相似文献

10.

A survey of dynamic replication strategies for improving data availability in data grids

Tehmina Amjad^{Author Vitae} Muhammad Sher Author VitaeAli Daud Author Vitae 《Future Generation Computer Systems》2012,28(2):337-349

Data grid is a distributed collection of storage and computational resources that are not bounded within a geophysical location. It is a fast growing area of research and providing efficient data access and maximum data availability is a challenging task. To achieve this task, data is replicated to different sites. A number of data replication techniques have been presented for data grids. All replication techniques address some attributes like fault tolerance, scalability, improved bandwidth consumption, performance, storage consumption, data access time etc. In this paper, different issues involved in data replication are identified and different replication techniques are studied to find out which attributes are addressed in a given technique and which are ignored. A tabular representation of all those parameters is presented to facilitate the future comparison of dynamic replication techniques. The paper also includes some discussion about future work in this direction by identifying some open research problems. 相似文献

11.

A study on performance of dynamic file replication algorithms for real-time file access in Data Grids 总被引：1，自引：0，他引：1

Atakan 《Future Generation Computer Systems》2009,25(8):829-839

Real-time Grid applications are emerging in many disciplines of science and engineering. In order to run these applications while meeting the associated real-time constraints with them, the Grid infrastructure should be designed to respect these constraints and allocate its computing, networking, storage, and the other resources accordingly. Furthermore, these applications involve a large number of data intensive jobs and require to access terabytes of data in real-time. On the other hand, a variety of dynamic file replication algorithms were proposed for the best-effort Data Grid environments in an attempt to decrease job completion times and save network bandwidth. Until now, there is no study in the literature which tries to elaborate on the real-time performance of these dynamic file replication algorithms. Based on this motivation, in this study, the performance of eight dynamic replication algorithms are evaluated under various Data Grid settings. For this evaluation, a process oriented and discrete-event driven simulator called DGridSim is developed. A detailed set of simulation studies are conducted using DGridSim and the results obtained are presented to reveal the real-time performance of the dynamic file replication algorithms. 相似文献

12.

Dynamic replica placement and selection strategies in data grids— A comprehensive survey

R. Kingsy Grace R. Manimegalai 《Journal of Parallel and Distributed Computing》2014

Data replication techniques are used in data grid to reduce makespan, storage consumption, access latency and network bandwidth. Data replication enhances data availability and thereby increases the system reliability. There are two steps involved in data replication, namely, replica placement and replica selection. Replica placement involves identifying the best possible node to duplicate data based on network latency and user request. Replica selection involves selecting the best replica location to access the data for job execution in the data grid. Various replica placement and selection algorithms are available in the literature. These algorithms measure and analyze different parameters such as bandwidth consumption, access cost, scalability, execution time, storage consumption and makespan. In this paper, various replica placement and selection strategies along with their merits and demerits are discussed. This paper also analyses the performance of various strategies with respect to the parameters mentioned above. In particular, this paper focuses on the dynamic replica placement and selection strategies in the data grid environment. 相似文献

13.

A PTS-PGATS based approach for data-intensive scheduling in data grids 总被引：1，自引：0，他引：1

Kenli Li Zhao Tong Dan Liu Teklay Tesfazghi Xiangke Liao 《Frontiers of Computer Science in China》2011,5(4):513-525

Grid computing is the combination of computer resources in a loosely coupled, heterogeneous, and geographically dispersed environment. Grid data are the data used in grid computing, which consists of large-scale data-intensive applications, producing and consuming huge amounts of data, distributed across a large number of machines. Data grid computing composes sets of independent tasks each of which require massive distributed data sets that may each be replicated on different resources. To reduce the completion time of the application and improve the performance of the grid, appropriate computing resources should be selected to execute the tasks and appropriate storage resources selected to serve the files required by the tasks. So the problem can be broken into two sub-problems: selection of storage resources and assignment of tasks to computing resources. This paper proposes a scheduler, which is broken into three parts that can run in parallel and uses both parallel tabu search and a parallel genetic algorithm. Finally, the proposed algorithm is evaluated by comparing it with other related algorithms, which target minimizing makespan. Simulation results show that the proposed approach can be a good choice for scheduling large data grid applications. 相似文献

14.

基于层次化调度策略和动态数据复制的网格调度方法 总被引：2，自引：0，他引：2

赖锦辉梁松《计算机应用研究》2014,31(2):412-416

针对在网格中如何有效地进行任务调度和数据复制, 以便减少任务执行时间等问题, 提出了任务调度算法（ISS）和优化动态数据复制算法（ODHRA）, 并构建一个方案将两种算法进行了有效结合。该方案采用ISS算法综合考虑任务等待队列的数量、任务需求数据的位置和站点的计算容量, 采用网络结构分级调度的方式, 配以适当的权重系数计算综合任务成本, 搜索出最佳计算节点区域; 采用ODHRA算法分析数据传输时间、存储访问延迟、等待在存储队列中的副本请求和节点间的距离, 在众多的副本中选取出最佳副本位置, 再结合副本放置和副本管理, 从而降低了文件访问时间。仿真结果表明, 提出的方案在平均任务执行时间方面, 与其他算法相比表现出了更好的性能。相似文献

15.

A multiple parallel download scheme with server throughput and client bandwidth considerations for data grids 总被引：1，自引：0，他引：1

Ruay-Shiung Ming-Huang Hau-Chin 《Future Generation Computer Systems》2008,24(8):798-805

Though computer technology advances quickly, the computing speed and storage capacity of a single computer still cannot satisfy the requirements of many applications. As a result, grids have emerged to utilize the collective power of many computers. One of them, the data grid, provides a mechanism for handling a large amount of data. One of the characteristics of a data grid is to replicate files to many different computers such that a popular file would be more available. If a grid site does not have a file, it will have to download it from other grid sites. Thus, the parallel download method, which allows a user to download different parts of a file from various computers simultaneously, is used to decrease download time.However, multiple parallel downloads will affect one another. Thus if all jobs in the grid system use parallel download, the problem of resource competition and conflict will happen. In this paper, we propose a parallel download scheme considering the server output throughput limits and client input bandwidth constraints. Experimental results show that the proposed download scheme outperforms static and dynamic parallel download schemes. 相似文献

16.

SDN-aware federation of distributed data

《Future Generation Computer Systems》2016

The introduction of software defined networking (SDN) has created an opportunity for file access services to get a view of the underlying network and to further optimize large data transfers. This opportunity is still unexplored while the amount of data that needs to be transferred is growing. Data transfers are also becoming more frequent as a result of interdisciplinary collaborations and the nature of research infrastructures. To address the needs for larger and more frequent data transfers, we propose an approach which enables file access services to use SDN. We extend the file access services developed in our earlier work by including network resources in the provisioning for large data transfers. A novel SDN-aware file transfer mechanism is prototyped for improving the performance and reliability of large data transfers on research infrastructure equipped with programmable network switches. Our results show that I/O and data-intensive scientific workflows benefit from SDN-aware file access services. 相似文献

17.

多模态医疗数据中海量小文件存储优化方法

曾梦邹北骥张文生杨雪冰朱承璋《软件学报》2023,34(3):1451-1469

Hadoop分布式文件系统(HDFS)通常用于大文件的存储和管理,当进行海量小文件的存储和计算时,会消耗大量的NameNode内存和访问时间,成为制约HDFS性能的一个重要因素.针对多模态医疗数据中海量小文件问题,提出一种基于双层哈希编码和HBase的海量小文件存储优化方法.在小文件合并时,使用可扩展哈希函数构建索引文件存储桶,使索引文件可以根据需要进行动态扩展,实现文件追加功能.在每个存储桶中,使用MWHC哈希函数存储每个文件索引信息在索引文件中的位置,当访问文件时,无须读取所有文件的索引信息,只需读取相应存储桶中的索引信息即可,从而能够在O(1)的时间复杂度内读取文件,提高文件查找效率.为了满足多模态医疗数据的存储需求,使用HBase存储文件索引信息,并设置标识列用于标识不同模态的医疗数据,便于对不同模态数据的存储管理,并提高文件的读取速度.为了进一步优化存储性能,建立了基于LRU的元数据预取机制,并采用LZ4压缩算法对合并文件进行压缩存储.通过对比文件存取性能、NameNode内存使用率,实验结果表明,所提出的算法与原始HDFS、HAR、MapFile、TypeStorage以及... 相似文献

18.

基于hybrid拓扑的数据网格副本创建策略* 总被引：1，自引：1，他引：0

卢炎生胡辉《计算机应用研究》2007,24(11):286-288

数据复制技术被广泛应用于数据网格中,以缩短数据访问时间和传输时间、降低网络带宽消耗.针对包含树型拓扑和环型拓扑的混合式网格拓扑结构,提出了一种考虑网络带宽、网络传输延迟、用户请求频率和站点可用存储空间大小等因素的副本创建策略,并引入评估函数衡量各因素的影响大小,具有良好的可靠性、可扩展性和自适应性.模拟实验的结果显示此副本创建策略可以有效降低数据平均访问时间. 相似文献

19.

A dynamic data replication strategy using access-weights in data grids 总被引：2，自引：0，他引：2

Ruay-Shiung Chang Hui-Ping Chang 《The Journal of supercomputing》2008,45(3):277-295

Data grids deal with a huge amount of data regularly. It is a fundamental challenge to ensure efficient accesses to such widely distributed data sets. Creating replicas to a suitable site by data replication strategy can increase the system performance. It shortens the data access time and reduces bandwidth consumption. In this paper, a dynamic data replication mechanism called Latest Access Largest Weight (LALW) is proposed. LALW selects a popular file for replication and calculates a suitable number of copies and grid sites for replication. By associating a different weight to each historical data access record, the importance of each record is differentiated. A more recent data access record has a larger weight. It indicates that the record is more pertinent to the current situation of data access. A Grid simulator, OptorSim, is used to evaluate the performance of this dynamic replication strategy. The simulation results show that LALW successfully increases the effective network usage. It means that the LALW replication strategy can find out a popular file and replicates it to a suitable site without increasing the network burden too much.

Ruay-Shiung ChangEmail:

相似文献

20.

Adaptive data replication strategy in cloud computing for performance improvement

Najme MANSOURI 《Frontiers of Computer Science》2016,10(5):925-935

Cloud computing is becoming a very popular word in industry and is receiving a large amount of attention from the research community. Replica management is one of the most important issues in the cloud, which can offer fast data access time, high data availability and reliability. By keeping all replicas active, the replicas may enhance system task successful execution rate if the replicas and requests are reasonably distributed. However, appropriate replica placement in a large-scale, dynamically scalable and totally virtualized data centers is much more complicated. To provide cost-effective availability, minimize the response time of applications and make load balancing for cloud storage, a new replica placement is proposed. The replica placement is based on five important parameters: mean service time, failure probability, load variance, latency and storage usage. However, replication should be used wisely because the storage size of each site is limited. Thus, the site must keep only the important replicas.We also present a new replica replacement strategy based on the availability of the file, the last time the replica was requested, number of access, and size of replica. We evaluate our algorithm using the CloudSim simulator and find that it offers better performance in comparison with other algorithms in terms of mean response time, effective network usage, load balancing, replication frequency, and storage usage. 相似文献