首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data mining domain. In this paper, we propose a new method of XML document clustering by a global criterion function, considering the weight of common structures. Our approach initially extracts representative structures of frequent patterns from schemaless XML documents using a sequential pattern mining algorithm. Then, we perform clustering of an XML document by the weight of common structures, without a measure of pairwise similarity, assuming that an XML document is a transaction and frequent structures extracted from documents are items of the transaction. We conducted experiments to compare our method with previous methods. The experimental results show the effectiveness of our approach.  相似文献   

2.
We introduce a new methodology for coupling language-induced partitions and index  -induced partitions on XML documents that is aimed for the benefit of efficient evaluation of XPath queries. In particular, we identify XPath fragments which are ideally coupled with the newly introduced P(k)P(k)-partition which has its definition grounded in the well-known A(k)A(k) structural index and its associated partition. We then utilize these couplings to investigate fundamental questions about the use of structural indexes in XPath query evaluation.  相似文献   

3.
Clustering XML documents is extensively used to organize large collections of XML documents in groups that are coherent according to structure and/or content features. The growing availability of distributed XML sources and the variety of high-demand environments raise the need for clustering approaches that can exploit distributed processing techniques. Nevertheless, existing methods for clustering XML documents are designed to work in a centralized way. In this paper, we address the problem of clustering XML documents in a collaborative distributed framework. XML documents are first decomposed based on semantically cohesive subtrees, then modeled as transactional data that embed both XML structure and content information. The proposed clustering framework employs a centroid-based partitional clustering method that has been developed for a peer-to-peer network. Each peer in the network is allowed to compute a local clustering solution over its own data, and to exchange its cluster representatives with other peers. The exchanged representatives are used to compute representatives for the global clustering solution in a collaborative way. We evaluated effectiveness and efficiency of our approach on real XML document collections varying the number of peers. Results have shown that major advantages with respect to the corresponding centralized clustering setting are obtained in terms of runtime behavior, although clustering solutions can still be accurate with a moderately low number of nodes in the network. Moreover, the collaborativeness characteristic of our approach has revealed to be a convenient feature in distributed clustering as found in a comparative evaluation with a distributed non-collaborative clustering method.  相似文献   

4.
An efficient and scalable algorithm for clustering XML documents by structure   总被引:11,自引:0,他引:11  
With the standardization of XML as an information exchange language over the Internet, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.  相似文献   

5.
XML tree structures can conveniently be represented using ordered unranked trees. Due to the repetitiveness of XML markup these trees can be compressed effectively using dictionary-based methods, such as minimal directed acyclic graphs (DAGs) or straight-line context-free (SLCF) tree grammars. While minimal SLCF tree grammars are in general smaller than minimal DAGs, they cannot be computed in polynomial time unless P=NPP=NP. Here, we present a new linear time algorithm for computing small SLCF tree grammars, called TreeRePair, and show that it greatly outperforms the best known previous algorithm BPLEX. TreeRePair is a generalization to trees of Larsson and Moffat's RePair string compression algorithm. SLCF tree grammars can be used as efficient memory representations of trees. Using TreeRePair, we are able to produce the smallest queryable memory representation of ordered trees that we are aware of. Our investigations over a large corpus of commonly used XML documents show that tree traversals over TreeRePair grammars are 14 times slower than over pointer structures and 5 times slower than over succinct trees, while memory consumption is only 1/43 and 1/6, respectively. With respect to file compression we are able to show that a Huffman-based coding of TreeRePair grammars gives compression ratios comparable to the best known XML file compressors.  相似文献   

6.
针对当前XML文档结构聚类算法的一些不足,指出XML文档树中节点的重复和嵌套影响聚类的质量和效率.利用重复剪枝和嵌套剪枝简化XML文档树的表示,然后根据化简后的结构计算两棵XML文档树中的编辑距离,在此基础上得出两棵树整体的结构相似度量,按照层次聚类方法得到聚类结果.实验证明该算法有比较高的查全率和查准率,有效降低了时间复杂性,具有改进效果.  相似文献   

7.
Efficient extraction of schemas for XML documents   总被引:3,自引:0,他引:3  
In this paper, we present a technique for efficient extraction of concise and accurate schemas for XML documents. By restricting the schema form and applying some heuristic rules, we achieve the efficiency and conciseness. The result of an experiment with real-life DTDs shows that our approach attains high accuracy and is 20 to 200 times faster than existing approaches.  相似文献   

8.
How to structure and access XML documents with ontologies   总被引:16,自引:0,他引:16  
The current hype on Extensible Mark-up Language (XML) produced hundreds of XML-based applications. Many of them offer document type definitions (DTDs) to structure actual XML documents. Access to these documents relies on special purpose applications or on query languages that are closely tied to the document structure. Our approach uses ontologies to derive canonical structures, i.e., DTDs, to access sets of distributed XML documents on a conceptual level. We will show how the combination of conceptual modeling, inheritance, and inference mechanisms with the popularity, simplicity, and flexibility of XML leads to applications providing a broad range of high-quality information.  相似文献   

9.
适合于关系数据库存储的XML文档分解   总被引:5,自引:0,他引:5  
作为未来主宰Web种结构化表达应用的XML语言正得到越来越广泛的应用。但如何分解XML文档并将其数据信息存入主流的关系数据库中,却是阻止XML应用展开的一个首要问题,在考察国内外各个方面对XML文档分解的基础上,提出了适合于关系数据库存储分解内容信息7个数据类的方法。  相似文献   

10.
Push-based systems for distributing information through Internet are today becoming more and more popular and widely used. The widespread use of such systems raises non trivial security concerns. In particular, confidentiality, integrity and authenticity of the distributed data must be ensured. To cope with such issues, we describe here a system for securing push distribution of XML documents, which adopts digital signature and encryption techniques to ensure the above mentioned properties and allows the specification of both signature and access control policies. We also describe the implementation of the proposed system and present an extensive performance evaluation of its main components.  相似文献   

11.
Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters.We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method’s performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests.  相似文献   

12.
基于BFS树的XML文档图结构相似性计算   总被引:1,自引:1,他引:1  
可扩展链接语言将XML文档从树状结构扩展到图状结构,其结构相似性比较对文档查询、聚类意义重大.现存的比较XML树状结构相似性以及比较图结构相似性的方法忽视了文档结构特点,比较的结果与实际存在较大差异.基于BFS树的XML文档图结构相似性计算方法运用广度优先搜索算法找到最小代码树,重新定义了编辑距离的概念.比较结果表明,该方法更符合实际文档相似程度,因此在比较XML文档图结构相似性上有很大的可行性.  相似文献   

13.
XQuery is a query and functional programming language that is designed for querying the data in XML documents. This paper addresses how to efficiently query encrypted XML documents using XQuery, with the key point being how to eliminate redundant decryption so as to accelerate the querying process. We propose a processing model that can automatically translate the XQuery statements for encrypted XML documents. The implementation and experimental results demonstrate the practicality of the proposed model.  相似文献   

14.
XML documents are becoming popular for business process integration. To achieve interoperability between applications, XML documents must also conform to various commonly used data type definitions (DTDs). However, most business data are not maintained as XML documents. They are stored in various native formats, such as database tables or LDAP directories. Hence, a middleware is needed to dynamically generate XML documents conforming to predefined DTDs from various data sources. As industrial consortia and large corporations have created various DTDs, it is both challenging and time-consuming to design the necessary middleware to conform to so many different DTDs. This problem is particularly acute for a small- or medium-sized enterprise because it lacks the IT skills to quickly develop such a middleware. In this paper, we present XLE, an XML Lightweight Extractor, as a practical approach to dynamically extracting DTD-conforming XML documents from heterogeneous data sources. XLE is based on a framework called DTD source annotation (DTDSA). It treats a DTD as the control structure of a program. The annotations become the program statements, such as functions and assignments. DTD-conforming XML documents are generated by parsing annotated DTDs. Basically, DTD annotations describe declaratively the mappings between target XML documents and the source data. The XLE engine implements a few basic annotations, providing a practical solution for many small- and medium-sized enterprises. However, XLE is designed to be versatile. It allows sophisticated users to plug in their own implementations to access new types of data or to achieve better performance. Heterogeneous data sources can be simply specified in the annotations. A GUI tool is provided to highlight the places where annotations are needed.  相似文献   

15.
Ed-Sjoin:一种优化的字符串相似连接算法   总被引:1,自引:0,他引:1  
相似连接(similarity join)在数据清洗、生物信息、模式识别等应用领域中有着广泛应用,其中基于编辑距离的字符串相似连接是一种重要的相似连接.尽管当前有一些基于编辑距离的字符串连接算法提出,然而,当前的算法存在着大量的多余计算,影响了算法的效率.为了高效计算基于编辑距离的字符串连接,提出了一种优化的算法Ed-sjoin,分别从优化筛选算法和基于前缀的重复消减策略两方面对算法进行优化,这些优化策略可以实现更加有效的剪枝,并且避免了部分重复计算,从而加速算法的执行.实验结果表明,提出的方法优于现有方法.  相似文献   

16.
一种XML文档索引及查询处理方式   总被引:3,自引:0,他引:3  
本文首先论述了传统XML路径模式索引方式,在此基础上提出面向元素的XML文档索引方式和相关算法,以及使用扩展的后序遍历序号进行元素节点标识的方案,并给出了该索引方式和元素节点标识方案下规则路径表达式查询和树型模式查询处理的方法,最后说明该方式在效率上优于传统索引方式下规则路径表达式查询和树型模式查询处理。  相似文献   

17.
首先探讨了利用XML文件存储树型结构数据的直观表示方式。在此基础上用Java编程实现了解析XML文件生成DOM树的方法,最终通过深度优先遍历算法将DOM树转换成JTree树。从而实现了用JTree树直观地显示DOM树,为树型结构数据的图形化表示提供了便利。  相似文献   

18.
We are interested in specifying functional dependencies (FDs) for data-centric XML documents (XML documents that are used mainly for data storage). FDs are a natural constraint. Specifying FDs for XML documents is more difficult because unlike relational databases, XML documents do not have uniform structures. This paper introduces XML Template Functional Dependencies (XTFDs), which are able to specify FDs for XML documents. This paper also presents a necessary and sufficient condition for an XTFD to cause data redundancy in XML documents. Further, we propose Attribute Rule and Text String Rule as two procedures that can be repeatedly applied to remove redundancy caused by XTFDs. In addition, we prove that if an XML document has data redundancy with respect to an FD specified by using the tree tuple approach, it would have data redundancy with respect to an XTFD and show by example that XTFDs can specify some FDs for XML documents that the tree tuple approach cannot.  相似文献   

19.
在XML的树模型基础上,提出查询是一个有序的带标记树、数据库是一个有序的带标记树集合的思想,对于查询的回答是一个或几个从查询树结点到数据库结点的同态映射;对一般意义下的XML树模型进行了形式化改造,并且基于改造后的XML树模型构造了查询;最后,阐述了这一工作的意义。  相似文献   

20.
Twig query pattern matching is a core operation in XML query processing. Indexing XML documents for twig query processing is of fundamental importance to supporting effective information retrieval. In practice, many XML documents on the web are heterogeneous and have their own formats; documents describing relevant information can possess different structures. Therefore some “user-interesting” documents having similar but non-exact structures against a user query are often missed out. In this paper, we propose the RRSi, a novel structural index designed for structure-based query lookup on heterogeneous sources of XML documents supporting proximate query answers. The index avoids the unnecessary processing of structurally irrelevant candidates that might show good content relevance. An optimized version of the index, oRRSi, is also developed to further reduce both space requirements and computational complexity. To our knowledge, these structural indexes are the first to support proximity twig queries on XML documents. The results of our preliminary experiments show that RRSi and oRRSi based query processing significantly outperform previously proposed techniques in XML repositories with structural heterogeneity.
Vincent T. Y. NgEmail:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号