首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Evolutionary Algorithms for Allocating Data in Distributed Database Systems   总被引:2,自引:0,他引:2  
A major cost in executing queries in a distributed database system is the data transfer cost incurred in transferring relations (fragments) accessed by a query from different sites to the site where the query is initiated. The objective of a data allocation algorithm is to determine an assignment of fragments at different sites so as to minimize the total data transfer cost incurred in executing a set of queries. This is equivalent to minimizing the average query execution time, which is of primary importance in a wide class of distributed conventional as well as multimedia database systems. The data allocation problem, however, is NP-complete, and thus requires fast heuristics to generate efficient solutions. Furthermore, the optimal allocation of database objects highly depends on the query execution strategy employed by a distributed database system, and the given query execution strategy usually assumes an allocation of the fragments. We develop a site-independent fragment dependency graph representation to model the dependencies among the fragments accessed by a query, and use it to formulate and tackle data allocation problems for distributed database systems based on query-site and move-small query execution strategies. We have designed and evaluated evolutionary algorithms for data allocation for distributed database systems.  相似文献   

2.
Enhancing the performance of the DDBs (Distributed Database system) can be done by speeding up the computation of the data allocation, leading to higher speed allocation decisions and resulting in smaller data redundancy and shorter processing time. This paper deals with an integrated method for grouping the distributed sites into clusters and customizing the database fragments allocation to the clusters and their sites. We design a high speed clustering and allocating method to determine which fragments would be allocated to which cluster and site so as to maintain data availability and a constant systemic reliability, and evaluate the performance achieved by this method and demonstrate its efficiency by means of tabular and graphical representation. We tested our method over different network sites and found it reduces the data transferred between the sites during the execution time, minimizes the communication cost needed for processing applications, and handles the database queries and meets their future needs.  相似文献   

3.
With the growth of information technology and computer networks, there is a vital need for optimal design of distributed databases with the aim of performance improvement in terms of minimizing the round-trip response time and query transmission and processing costs. To address this issue, new fragmentation, data allocation, and replication techniques are required. In this paper, we propose enhanced vertical fragmentation, allocation, and replication schemes to improve the performance of distributed database systems. The proposed fragmentation scheme clusters highly-bonded attributes (i.e., normally accessed together) into a single fragment in order to minimize the query processing cost. The allocation scheme is proposed to find an optimized allocation to minimize the round-trip response time. The replication scheme partially replicates the fragments to increase the local execution of queries in a way that minimizes the cost of transmitting replicas to the sites. Experimental results show that, on average, the proposed schemes reduce the round-trip response time of queries by 23% and query processing cost by 15%, as compared to the related work.  相似文献   

4.
《Parallel Computing》2014,40(10):697-709
In order to run tasks in a parallel and load-balanced fashion, existing scientific parallel applications such as mpiBLAST introduce a data-initializing stage to move database fragments from shared storage to local cluster nodes. Unfortunately, with the exponentially increasing size of sequence databases in today’s big data era, such an approach is inefficient.In this paper, we develop a scalable data access framework to solve the data movement problem for scientific applications that are dominated by “read” operation for data analysis. SDAFT employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. SDAFT consists of two interlocked components: (1) a data centric load-balanced scheduler (DC-scheduler) to enforce data-process locality and (2) a translation layer to translate conventional parallel I/O operations into HDFS I/O. By experimenting our SDAFT prototype system with real-world database and queries at a wide variety of computing platforms, we found that SDAFT can reduce I/O cost by a factor of 4–10 and double the overall execution performance as compared with existing schemes.  相似文献   

5.
Due to the gradual expansion in data volume used in social networks and cloud computing, the term “Big data” has appeared with its challenges to store the immense datasets. Many tools and algorithms appeared to handle the challenges of storing big data. NoSQL databases, such as Cassandra and MongoDB, are designed with a novel data management system that can handle and process huge volumes of data. Partitioning data in NoSQL databases is considered one of the critical challenges in database design. In this paper, a MapReduce Rendezvous Hashing-Based Virtual Hierarchies (MR-RHVH) framework is proposed for scalable partitioning of Cassandra NoSQL database. The MapReduce framework is used to implement MR-RHVH on Cassandra to enhance its performance in highly distributed environments. MR-RHVH distributes the nodes to rendezvous regions based on a proposed Adopted Virtual Hierarchies strategy. Each region is responsible for a set of nodes. In addition, a proposed bloom filter evaluator is used to ensure the accurate allocation of keys to nodes in each region. Moreover, a number of experiments were performed to evaluate the performance of MR-RHVH framework, using YCSB for database benchmarking. The results show high scalability rate and less time consuming for MR-RHVH framework over different recent systems.  相似文献   

6.
A repartitioning hypergraph model for dynamic load balancing   总被引:1,自引:0,他引:1  
In parallel adaptive applications, the computational structure of the applications changes over time, leading to load imbalances even though the initial load distributions were balanced. To restore balance and to keep communication volume low in further iterations of the applications, dynamic load balancing (repartitioning) of the changed computational structure is required. Repartitioning differs from static load balancing (partitioning) due to the additional requirement of minimizing migration cost to move data from an existing partition to a new partition. In this paper, we present a novel repartitioning hypergraph model for dynamic load balancing that accounts for both communication volume in the application and migration cost to move data, in order to minimize the overall cost. The use of a hypergraph-based model allows us to accurately model communication costs rather than approximate them with graph-based models. We show that the new model can be realized using hypergraph partitioning with fixed vertices and describe our parallel multilevel implementation within the Zoltan load balancing toolkit. To the best of our knowledge, this is the first implementation for dynamic load balancing based on hypergraph partitioning. To demonstrate the effectiveness of our approach, we conducted experiments on a Linux cluster with 1024 processors. The results show that, in terms of reducing total cost, our new model compares favorably to the graph-based dynamic load balancing approaches, and multilevel approaches improve the repartitioning quality significantly.  相似文献   

7.
陈小碾 《计算机工程与设计》2012,33(8):3069-3073,3116
针对已有的Web协同应用中的一致性维护方法会带来严重的服务器耗费问题,提出了一种基于文档划分的一致性维护模型。该模型在操作转换算法SLOT(symmetric linear operational transformation)的基础上引入文档划分的思想。从降低服务器通信和内存耗费的角度出发,结合用户数量和操作频率的变化,给出一种动态的文档划分策略及其实现算法。仿真实验结果表明,该模型可以有效地降低大规模协同应用中服务器的通信和内存耗费。  相似文献   

8.
The growing popularity of massively accessed Web applications that store and analyze large amounts of data, being Facebook, Twitter and Google Search some prominent examples of such applications, have posed new requirements that greatly challenge traditional RDBMS. In response to this reality, a new way of creating and manipulating data stores, known as NoSQL databases, has arisen. This paper reviews implementations of NoSQL databases in order to provide an understanding of current tools and their uses. First, NoSQL databases are compared with traditional RDBMS and important concepts are explained. Only databases allowing to persist data and distribute them along different computing nodes are within the scope of this review. Moreover, NoSQL databases are divided into different types: Key-Value, Wide-Column, Document-oriented and Graph-oriented. In each case, a comparison of available databases is carried out based on their most important features.  相似文献   

9.
Horizontal partitioning is a logical database design technique which facilitates efficient execution of queries by reducing the irrelevant objects accessed. Given a set of most frequently executed queries on a class, the horizontal partitioning generates horizontal class fragments (each of which is a subset of object instances of the class), that meet the queries requirements. There are two types of horizontal class partitioning, namely, primary and derived. Primary horizontal partitioning of a class is performed using predicates of queries accessing the class. Derived horizontal partitioning of a class is the partitioning of a class based on the horizontal partitioning of another class. We present algorithms for both primary and derived horizontal partitioning and discuss some issues in derived horizontal partitioning and present their solutions. There are two important aspects for supporting database operations on a partitioned database, namely, fragment localization for queries and object migration for updates. Fragment localization deals with identifying the horizontal fragments that contribute to the result of the query, and object migration deals with migrating objects from one class fragment to another due to updates. We provide novel solutions to these two problems, and finally we show the utility of horizontal partitioning for query processing.  相似文献   

10.
Logic flaws within web applications will allow malicious operations to be triggered towards back-end database. Existing approaches to identifying logic flaws of database accesses are strongly tied to structured query language (SQL) statement construction and cannot be applied to the new generation of web applications that use not only structured query language (NoSQL) databases as the storage tier. In this paper, we present Lom, a black-box approach for discovering many categories of logic flaws within MongoDBbased web applications. Our approach introduces a MongoDB operation model to support new features of MongoDB and models the application logic as a mealy finite state machine. During the testing phase, test inputs which emulate state violation attacks are constructed for identifying logic flaws at each application state. We apply Lom to several MongoDB-based web applications and demonstrate its effectiveness.  相似文献   

11.
The requirements for storage space and computational power of large-scale applications are increasing rapidly. Clusters seem to be the most attractive architecture for such applications, due to their low costs and high scalability. On the other hand, smart disk systems, with their large storage capacities and growing computational power are becoming increasingly popular. In this work, we compare the performance of these architectures with a single host-based system using representative queries from the Decision Support System (DSS) databases. We show how to implement individual database operations in the smart disk system and also show how to optimize the execution of the whole query by bundling frequently occurring operations together and executing the bundle in a single invocation. Besides decreasing the overall execution time, operation bundling also offers an easy-to-program and easy-to-use interface to access the data on smart disks. We also present a protocol for minimizing the communication time in the smart-disk-based system.  To measure the response times, we have developed the DBsim, an accurate simulator which can simulate the database operations for the single host-based, cluster-based, and smart-disk-based systems. Using this simulator, we illustrate that the smart disk architecture offers substantial benefits in terms of overall query execution times of the TPC-D benchmark suite. In particular, the average response time of the smart disk architecture for the representative queries from the TPC-D benchmark in our base configuration is 71% smaller than the response time on the single host-based system and 4.2% smaller than the response time on the fastest cluster architecture. We also demonstrate the effectiveness of the operation bundling and compare the scalabilities of the cluster-based and smart-disk-based systems.  相似文献   

12.
随着互联网时代的到来,IT行业迅猛发展,NoSQL数据库以其在大数据环境下出色的业务处理处理能力,在IT行业内得到越来越广泛的应用。而各NoSQL数据库由于自身数据模型的不同,在数据组织方式上彼此存在差异。NoSQL数据库间进行数据交换时,数据模型的不同会导致数据库间数据传输的阻抗,以源数据库数据模型封装的业务数据可能无法直接被目标数据库解析,需进行额外的模型适配操作,参照目标数据库数据模型组织业务数据以供筛选存储。为此,拟定义一种数据描述模型,对NoSQL数据库数据模型特征建模,描述NoSQL数据库的数据组织方式,并定义NoSQL数据库数据模型间距离评估算法。根据数据描述模型与距离评估算法可设计实现一种通用数据模型,其在数据交换过程中可与相关NoSQL数据库进行数据模型上的转换,系统相关业务代码只需参照该数据模型设计,而独立于数据交换过程中NoSQL数据库具体的数据模型。  相似文献   

13.
In the last decade, we have observed an unprecedented development in molecular biology. An extremely high number of organisms have been sequenced in genome projects and included in genomic databases, for further analysis. These databases present an exponential growth rate and they are intensively accessed daily, all over the world. Once a sequence is obtained, its function and/or structure must be determined. Direct experimentation is considered to be the most reliable method to do that. However, the experiments that must be conducted are very complex and time consuming. For this reason, it is far more productive to use computational methods to infer biological information from a sequence. This is usually done by comparing the new sequence with sequences that already had their characteristics determined. BLAST is the most widely used heuristic tool for sequence comparison. Thousands of BLAST searches are made daily, all over the world. In order to further reduce the BLAST execution time, cluster and grid environments can be effectively used. This paper proposes and evaluates an adaptive task allocation framework to perform BLAST searches in a grid environment. The framework, called PackageBLAST, provides an infrastructure that executes distributed BLAST genomic database comparisons. In addition, it is flexible since the user can choose or incorporate new task allocation strategies. Furthermore, we propose a mechanism to compute grid nodes’ execution weight, adapting the chosen allocation policy to the observed computational power and local load of the nodes. Our results present very good speedups. For instance, in a 16-machine heterogeneous grid testbed, a speedup of 14.59 was achieved, reducing the BLAST execution time from 30.88 min to 2.11 min. Also, we show that the adaptive task allocation strategy was able to handle successfully the complexity of a grid environment.  相似文献   

14.
Current information technologies generate large amounts of data for management or further analysis, storing it in NoSQL databases which provide horizontal scaling and high performance, supporting many read/write operations per second. NoSQL column-oriented databases, such as Cassandra and HBase, are usually modelled following a query-driven approach, resulting in denormalized databases where the same data can be repeated in several tables. Therefore, maintaining data integrity relies on client applications to ensure that, for data changes that occur, the affected tables will be appropriately updated. We devise a method called MDICA that, given a data insertion at a conceptual level, determines the required actions to maintain database integrity in column-oriented databases. This method is implemented for Cassandra database applications. MDICA is based on the definition of (1) rules to determine the tables that will be impacted by the insertion, (2) procedures to generate the statements to ensure data integrity and (3) messages to warn the user about errors or potential problems. This method helps developers in two ways: generating the statements needed to maintain data integrity and producing messages to avoid problems such as loss of information, redundant repeated data or gaps of information in tables.  相似文献   

15.
In distributed database systems, tables are frequently fragmented and replicated over a number of sites in order to reduce network communication costs. How to fragment, when to replicate and how to allocate the fragments to the sites are challenging problems that has previously been solved either by static fragmentation, replication and allocation, or based on a priori query analysis. Many emerging applications of distributed database systems generate very dynamic workloads with frequent changes in access patterns from different sites. In such contexts, continuous refragmentation and reallocation can significantly improve performance. In this paper we present DYFRAM, a decentralized approach for dynamic table fragmentation and allocation in distributed database systems based on observation of the access patterns of sites to tables. The approach performs fragmentation, replication, and reallocation based on recent access history, aiming at maximizing the number of local accesses compared to accesses from remote sites. We show through simulations and experiments on the DASCOSA distributed database system that the approach significantly reduces communication costs for typical access patterns, thus demonstrating the feasibility of our approach.  相似文献   

16.
内存数据库是支持高性能信息处理的一种方法,对于内存数据库而言,数据库的存储结构与存取方法是关键。本文给出了一种内存数据库组织与存取的改进图论方法,并对该方法给几个基本的查询处理操作所能带来的优势进行了探讨,最后从存储空间和操作执行时间两方面进行了性能分析,结果显示改进的图论方法能提供很好的性能。  相似文献   

17.
Wide-column NoSQL databases are an important class of NoSQL (Not only SQL) databases which scale horizontally and feature high access performance on sparse tables. With current trends towards big Data Warehouses (DWs), it is attractive to run existing business intelligence/data warehousing applications on higher volumes of data in wide-column NoSQL databases for low latency by mapping multidimensional models to wide-column NoSQL models or using additional SQL add-ons. For examples, applications like retail management can run over integrated data sets stored in big DWs or in the cloud to capture current item-selling trends. Many of these systems also employ Snapshot Isolation (SI) as a concurrency control mechanism to achieve high throughput for read-heavy workloads. SI works well in a DW environment, as analytical queries can now work on (consistent) snapshots and are not impacted by concurrent update jobs performed by online incremental Extract-Transform-Load (ETL) flows that refresh fact/dimension tables. However, the snapshot made available in the DW is often stale, since at the moment when an analytical query is issued, the source updates (e.g. in a remote retail store) may not have been extracted and processed by the ETL process in time due to high input data volume or slow processing speed. This staleness may cause incorrect results for time-critical decision support queries. To address this problem, snapshots which are supposed to be accessed by analytical queries need to be first maintained by corresponding ETL flows to reflect source updates based on given freshness needs. Snapshot maintenance in this work means maintaining the distributed data partitions that are required by a query. Since most NoSQL databases are not ACID compliant and do not provide full-fledged distributed transaction support, snapshot may be inconsistently derived when its data partitions are updated by different ETL maintenance jobs.This paper describes an extended version of HBelt system [1] which tightly integrates the wide-column NoSQL database HBase with a clustered & pipelined ETL engine. Our objective is to efficiently refresh HBase tables with remote source updates while a consistent snapshot is guaranteed across distributed partitions for each scan request in analytical queries. A consistency model is defined and implemented to address so-called distributed snapshot maintenance. To achieve this, ETL jobs and analytical queries are scheduled in a distributed processing environment. In addition, a partitioned, incremental ETL pipeline is introduced to increase the performance of ETL (update) jobs. We validate the efficiency gain in terms of data pipelining and data partitioning using the TPC-DS benchmark, which simulates a modern decision support system for a retail product supplier. Experimental results show that high query throughput can be achieved in HBelt when distributed, refreshed snapshots are demanded.  相似文献   

18.
Data Partitioning for Parallel Spatial Join Processing   总被引:1,自引:0,他引:1  
The cost of spatial join processing can be very high because of the large sizes of spatial objects and the computation-intensive spatial operations. While parallel processing seems a natural solution to this problem, it is not clear how spatial data can be partitioned for this purpose. Various spatial data partitioning methods are examined in this paper. A framework combining the data-partitioning techniques used by most parallel join algorithms in relational databases and the filter-and-refine strategy for spatial operation processing is proposed for parallel spatial join processing. Object duplication caused by multi-assignment in spatial data partitioning can result in extra CPU cost as well as extra communication cost. We find that the key to overcome this problem is to preserve spatial locality in task decomposition. In this paper we show that a near-optimal speedup can be achieved for parallel spatial join processing using our new algorithms.  相似文献   

19.
Changqing Li  Jianhua Gu 《Software》2019,49(3):401-422
As the applications with big data in cloud computing environment grow, many existing systems expect to expand their service to support the dramatic increase of data, and modern software development for services computing and cloud computing software systems is no longer based on a single database but on existing multidatabases and this convergence needs new software architecture design. This paper proposes an integration approach to support hybrid database architecture, including MySQL, MongoDB, and Redis, to make it possible of allowing users to query data simultaneously from both relational SQL systems and NoSQL systems in a single SQL query. Two mechanisms are provided for constructing Redis's indexes and semantic transforming between SQL and MongoDB API to add the SQL feature for these NoSQL databases. With the proposed approach, hybrid database systems can be performed in a flexible manner, ie, access can be either relational database or NoSQL, depending on the size of data. The approach can effectively reduce development complexity and improve development efficiency of the software systems with multidatabases. This is the result of further research on the related topic, which fills the gap ignored by relevant scholars in this field to make a little contribution to the further development of NoSQL technology.  相似文献   

20.
A Distribution Design Methodology for Object DBMS   总被引:1,自引:0,他引:1  
The design of distributed databases involves making decisions on the fragmentation and placement of data and programs across the sites of a computer network. The first phase of the distribution design in a top-down approach is the fragmentation phase, which clusters in fragments the information accessed simultaneously by applications. Most distribution design algorithms propose a horizontal or vertical class fragmentation. However, the user has no assistance in the choice between these techniques. In this work we present a detailed methodology for the design of distributed object databases that includes: (i) an analysis phase, to indicate the most adequate fragmentation technique to be applied in each class of the database schema; (ii) a horizontal class fragmentation algorithm, and (iii) a vertical class fragmentation algorithm. Basically, the analysis phase is responsible for driving the choice between the horizontal and the vertical partitioning techniques, or even the combination of both, in order to assist distribution designers in the fragmentation phase of object databases. Experiments using our methodology have resulted in fragmentation schemas offering a high degree of parallelism together with an important reduction of irrelevant data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号