首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Indexing and querying XML using extended Dewey labeling scheme   总被引:1,自引:0,他引:1  
Finding all the occurrences of a tree pattern in an XML database is a core operation for efficient evaluation of XML queries. The Dewey labeling scheme is commonly used to label an XML document to facilitate XML query processing by recording information on the path of an element. In order to improve the efficiency of XML tree pattern matching, we introduce a novel labeling scheme, called extended Dewey, which effectively extends the existing Dewey labeling scheme to combine the types and identifiers of elements in a label, and to avoid the scan of labels for internal query nodes to accelerate query processing (in I/O cost). Based on extended Dewey, we propose a series of holistic XML tree pattern matching algorithms. We first present TJFast to answer an XML twig pattern query. To efficiently answer a generalized XML tree pattern, we then propose GTJFast, an optimization that exploits the non-output nodes. In addition, we propose TJFastTL and GTJFastTL based on the tag + level data partition scheme to further reduce I/O costs by level pruning. Finally, we report our comprehensive experimental results to show that our set of XML tree pattern matching algorithms are superior to existing approaches in terms of the number of elements scanned, the size of intermediate results and query performance.  相似文献   

MapReduce in MPI for Large-scale graph algorithms   总被引:1,自引:0,他引:1  

目前的矩阵乘法算法无法处理大规模和超大规模的矩阵,而随着MapReduce编程框架的提出,并行处理矩阵乘法成为解决大矩阵运算的主要手段。总结了矩阵乘法在MapReduce编程模型上的并行实现方法,并提出了实现高性能大矩阵乘法的策略——折中单个工作节点的计算量和需要网络传输的数据量。实验证明,并行实现算法在大矩阵上明显优于传统的单机算法,而且随着集群中节点数目的增多,并行算法会表现出更好的性能。  相似文献   

XML documents are often viewed as trees (basically the parse tree of the document), and queries over such documents typically test for ancestor relationships among tree nodes. Search engines process such queries using an index structure summarizing the ancestor relations. In the index, each document item (tree node) is identified using some logical id (node label), such that, given two labels, the engine can determine the ancestor relationship between the corresponding nodes. The length of the labels is a main factor of the index size. Therefore, reducing this length, even by a constant factor, is a critical issue. In this work we consider the following problem. Given a rooted XML tree T, label the nodes of T in the most compact way such that given the labels of two nodes, one can determine in constant time, by looking at the labels only, whether one node is an ancestor of the other. Labelings currently being used are all variants of the following interval scheme. Number the leaves say from left to right and label each node with a pair consisting of the numbers of its smallest and largest leaf descendants. An ancestor query then amounts to an interval containment test on the labels. The maximum label length using this scheme is 2 log n, where n is the number of nodes in the tree. (All logarithms in this paper are to base 2.) The focus of this work is finding a scheme that works best in practice on real XML data. We suggest an orthogonal prefix-based approach, where the labeling is such that an ancestor query roughly amounts to testing whether one label is a prefix of the other. We present several new labeling schemes based on this approach and analyze their performance both theoretically and empirically.  相似文献   

In recent years, the MapReduce framework has become one of the most popular parallel computing platforms for processing big data. MapReduce is used by companies such as Facebook, IBM, and Google to process or analyze massive data sets. Since the approach is frequently used for industrial solutions, the algorithms based on the MapReduce framework gained significant attention within the scientific community. The subgraph isomorphism is a fundamental graph theory problem. Finding small patterns in large graphs is a core challenge in the analysis of applications with big data sets. This paper introduces two novel algorithms, which are capable of finding matching patterns in arbitrary large graphs. The algorithms are designed for utilizing the easy parallelization technique offered by the MapReduce framework. The approaches are evaluated regarding their space and memory requirements. The paper also provides the applied data structure and presents formal analysis of the algorithms.  相似文献   

In order to facilitate the XML query processing, several labeling schemes have been proposed to directly determine the structural relationships between two arbitrary XML nodes without accessing the original XML documents. However, the existing XML labeling schemes have to re-label the pre-existing nodes or re-calculate the label values when a new node is inserted into the XML document during an update process. In this paper, we devise a novel encoding scheme based on the fractional number to encode the labels of the XML nodes. Moreover, we propose a mapping method to convert our proposed fractional number based encoding scheme to bit string based encoding scheme with the intention to minimize the label size and save the storage space. By applying our proposed bit string encoding scheme to the range-based labeling scheme and the prefix labeling scheme, the process of re-labeling the pre-existing nodes can be avoided when nodes are inserted as leaf nodes and sibling nodes without affecting the order of XML nodes. In addition, we propose an algorithm to control the increment of label size when new nodes are inserted frequently at a fix place of an XML tree. Experimental results show that our proposed bit string encoding scheme provides efficient support to the process of XML updating without sacrificing the query performance when it is applied to the range-based labeling schemes.  相似文献   

基于MapReduce的海量数据挖掘技术研究   总被引:2,自引:0,他引:2  
MapReduce是一种编程模型,可以运行在异构环境下,编程简单,不必关心底层实现细节,用于大规模数据集的并行运算。将MapReduce应用在数据挖掘的三个算法中:朴素贝叶斯分类算法、K-modes聚类算法和ECLAT频繁项集挖掘算法。实验结果表明,在保证算法准确率的前提下,MapReduce可以有效提高海量数据挖掘工作的效率。  相似文献   

In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to “black-box” (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows.  相似文献   

在实验室系统处理海量原始数据时,实际应用场景中存在采样率高、偏度(skewness)高的特殊情况,导致在使用两阶分区算法在平衡同构环境下的Reducer节点负载时,无法有效地处理这些问题。为此,引入MapReduce的并行化处理,可以提高实验室系统中采样数据利用率;同时,为了解决数据偏度和采样度高的问题,则采用了ICSC(Improved Cluster Split Combination)分区调度的算法。经过实验证明,基于两阶分区的MapReduce负载均衡算法能够有效减少Mapper和Reducer节点空转的时间。随着数据偏度的增加,算法的执行时长基本不产生变化,即数据偏度对该算法执行时间的影响较小。此外,数据采样度的增加,ICSC分区调度算法也保持着对比模型中最少的时间开销。因此,基于两阶分区的MapReduce负载均衡算法弱化了Reducer节点间的依赖性,并提升MapReduce任务的执行效率和容错率,从而高效地实现MapReduce框架下的实验室系统中数据处理的负载均衡。  相似文献   

Clustering is a useful data mining technique which groups data points such that the points within a single group have similar characteristics, while the points in different groups are dissimilar. Density-based clustering algorithms such as DBSCAN and OPTICS are one kind of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big data is a challenging problem. Recently, parallelizing clustering algorithms on a large cluster of commodity machines using the MapReduce framework have received a lot of attention.In this paper, we first propose the new density-based clustering algorithm, called DBCURE, which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce. We next develop DBCURE-MR, which is a parallelized DBCURE using MapReduce. While traditional density-based algorithms find each cluster one by one, our DBCURE-MR finds several clusters together in parallel. We prove that both DBCURE and DBCURE-MR find the clusters correctly based on the definition of density-based clusters. Our experimental results with various data sets confirm that DBCURE-MR finds clusters efficiently without being sensitive to the clusters with varying densities and scales up well with the MapReduce framework.  相似文献   

Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. This paper addresses a number of algorithmic issues in parallel data cube construction. First, we present an aggregation tree for sequential (and parallel) data cube construction, which has minimally bounded memory requirements. An aggregation tree is parameterized by the ordering of dimensions. We present a parallel algorithm based upon the aggregation tree. We analyze the interprocessor communication volume and construct a closed form expression for it. We prove that the same ordering of the dimensions in the aggregation tree minimizes both the computational and communication requirements. We also describe a method for partitioning the initial array and prove that it minimizes the communication volume. Finally, in the cases when memory may be a bottleneck, we describe how tiling can help scale sequential and parallel data cube construction. Experimental results from implementation of our algorithms on a cluster of workstations show the effectiveness of our algorithms and validate our theoretical results.  相似文献   

针对矩形空间数据对象,以传统CIF四叉树索引技术为基础,利用Hadoop平台与MapReduce并行编程模型,采用“分而治之”的思想,对数据空间进行划分,设计适用于分布式环境的创建索引、相交查询、区域删除的并行算法。在此基础上,通过改变数据集中矩形对象的数目与map数进行实验,分析并行创建与相交查询的效率。实验结果表明,对于大数据量的数据集与多数据集,并行创建与查询可以提高处理效率。   相似文献   

在许多分类任务中,存在大量未标记的样本,并且获取样本标签耗时且昂贵。利用主动学习算法确定最应被标记的关键样本,来构建高精度分类器,可以最大限度地减少标记成本。本文提出一种基于PageRank的主动学习算法(PAL),充分利用数据分布信息进行有效的样本选择。利用PageRank根据样本间的相似度关系依次计算邻域、分值矩阵和排名向量;选择代表样本,并根据其相似度关系构建二叉树,利用该二叉树对代表样本进行聚类,标记和预测;将代表样本作为训练集,对其他样本进行分类。实验采用8个公开数据集,与5种传统的分类算法和3种流行的主动学习算法比较,结果表明PAL算法能取得更好的分类效果。  相似文献   

决策树算法是经典的分类挖掘算法之一,具有广泛的实际应用价值。经典的ID3决策树算法是内存驻留算法,只能处理小数据集,在面对海量数据集时显得无能为力。为此,对经典ID3决策树生成算法的可并行性进行了深入分析和研究,利用云计算的MapReduce编程技术,提出并实现面向海量数据的ID3决策树并行分类算法。实验结果表明该算法是有效可行的。  相似文献   

随着XML技术的发展,如何利用现有的数据库技术存储和查询XML文档已成为XML数据管理领域研究的热点问题。本文介绍了一种新的文档编码方法,以及基于这种编码方式提出了一种新的XML文档存储方法。方法按照文档中结点类型将XML文档树型结构分解为结点,分别存储到对应的关系表中,这种方法能够将任意结构的文档存储到一个固定的关系模式中。同时为了便于实现数据的查询,将文档中出现的简单路径模式也存储为一个表。这种新的文档存储方法能够有效地支持文档的查询操作,并能根据结点的编码信息实现原XML文档的正确恢复。最后,对本文提出的存储方法和恢复算法进行了实验验证。  相似文献   

Finding the occurrences of structural patterns in XML data is a key operation in XML query processing. Existing algorithms for this operation focus almost exclusively on path patterns or tree patterns. Current applications of XML require querying of data whose structure is complex or is not fully known to the user, or integrating XML data sources with different structures. These applications have motivated recently the introduction of query languages that allow a partial specification of path patterns in a query. In this paper, we consider partial path queries, a generalization of path pattern queries, and we focus on their efficient evaluation under the indexed streaming evaluation model. Our approach explicitly deals with repeated labels (that is, multiple occurrences of the same label in a query). We show that partial path queries can be represented as rooted dags for which a topological ordering of the nodes exists. We present three algorithms for the efficient evaluation of these queries. The first one exploits a structural summary of data to generate a set of path patterns that together are equivalent to a partial path query. To evaluate these path patterns, we extend a previous algorithm for path-pattern queries so that it can work on path patterns with repeated labels. The second one extracts a spanning tree from the query dag, uses a stack-based algorithm to find the matches of the root-to-leaf paths in the tree, and merge-joins the matches to compute the answer. Finally, the third one exploits multiple pointers of stack entries and a topological ordering of the query dag to apply a stack-based holistic technique. We analyze our algorithms and perform extensive experimental evaluations. Our experimental results show that the holistic algorithm outperforms the other ones. Our approaches are the first ones to efficiently evaluate this class of queries in the indexed streaming model.  相似文献   

Extensible Markup Language (XML) is commonly employed to represent and transmit information over the Internet. Therefore, how to effectively search for keywords of massive XML data becomes a new issue. In this paper, we first present four properties to improve the classical ILE algorithm. Then, a kind of parallel XML keyword search algorithm, based on intelligent grouping to calculate SLCA, is proposed and realized under MapReduce programming model. At last, a series of experiments are implemented on 7 datasets of different sizes. The obtained results indicate that the proposed algorithm has high execution efficiency and is applicable to keyword search of massive XML data.  相似文献   

XML data can be represented by a tree or graph structure and XML query processing requires the information of structural relationships among nodes. The basic structural relationships are parent-child and ancestor-descendant, and finding all occurrences of these basic structural relationships in an XML data is clearly a core operation in XML query processing. Several node labeling schemes have been suggested to support the determination of ancestor-descendant or parent-child structural relationships simply by comparing the labels of nodes. However, the previous node labeling schemes have some disadvantages, such as a large number of nodes that need to be relabeled in the case of an insertion of XML data, huge space requirements for node labels, and inefficient processing of structural joins. In this paper, we propose the nested tree structure that eliminates the disadvantages and takes advantage of the previous node labeling schemes. The nested tree structure makes it possible to use the dynamic interval-based labeling scheme, which supports XML data updates with almost no node relabeling as well as efficient structural join processing. Experimental results show that our approach is efficient in handling updates with the interval-based labeling scheme and also significantly improves the performance of the structural join processing compared with recent methods.  相似文献   

针对传统单机算法在计算大规模互联网拓扑特征参数时效率低的问题,基于MapReduce分布式计算框架对网络拓扑特征参数算法进行研究。通过分析单机图算法并行移植时存在的问题,提出了图算法并行化设计的原则和消息传递机制;根据设计原则和消息传递机制,为4个网络拓扑参数设计了并行算法。实验证明,并行的拓扑参数算法能够有效提高计算效率,且具备良好的可扩展性。  相似文献   

XML数据类型验证算法的改进   总被引:2,自引:1,他引:1       下载免费PDF全文
介绍几种XML Schema类型验证算法,并对其中基于树自动机的算法进行研究,针对它在验证嵌套复杂类型XML文档过程中存在的问题,提出一种改进算法,为每个终结符增设一个处理状态标志,从而对以XML文档表示的数据类型进行验证。通过实验测试其性能,结果表明该算法是有效的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号