首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
How to structure and access XML documents with ontologies   总被引:16,自引:0,他引:16  
The current hype on Extensible Mark-up Language (XML) produced hundreds of XML-based applications. Many of them offer document type definitions (DTDs) to structure actual XML documents. Access to these documents relies on special purpose applications or on query languages that are closely tied to the document structure. Our approach uses ontologies to derive canonical structures, i.e., DTDs, to access sets of distributed XML documents on a conceptual level. We will show how the combination of conceptual modeling, inheritance, and inference mechanisms with the popularity, simplicity, and flexibility of XML leads to applications providing a broad range of high-quality information.  相似文献   

2.
曹峰  陶世群  张剑妹 《计算机应用》2006,26(11):2674-2677
节点索引是XML索引中可支持正则路径表达式的最具代表性的一种,但是对于长的查询路径表达式,尤其是在中间结果很多的时候,节点索引的连接操作代价很高。对节点索引的索引结构进行了改进,通过减少中间结果的连接次数,使得查询时间与路径的长度无关,而只与路径的复杂度有关,并且提出了一种利用该索引结构输出查询结果的算法。  相似文献   

3.
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.  相似文献   

4.
Author-X is a Java-based system that addresses the security issues of access control and policy design for XML document administration. Author-X supports the specification of policies at varying granularity levels and the specification of user credentials as a way to enforce access control. Access control is available according to both push and pull document distribution policies, and document updates are distributed through a combination of hash functions and digital signature techniques. The Author-X approach to distributed updates allows a user to verify a document's integrity without contacting the document server  相似文献   

5.
As XML documents contain both content and structure information, taking advantage of the document structure in the retrieval process can lead to better identify relevant information units. In this paper, we describe an information retrieval (IR) approach dealing with queries composed of content and structure conditions. The XFIRM model we propose is designed to be as flexible as possible to process such queries. It is based on a complete query language, derived from XPath and on a relevance values propagation method. This paper aims at evaluating functions used in the propagation process, and particularly the use of distance between nodes as a parameter. The proposed method is evaluated, thanks to the INEX evaluation initiative. Results show a relative high precision of our proposal.  相似文献   

6.
We investigate the typechecking problem for XML queries: statically verifying that every answer to a query conforms to a given output DTD, for inputs satisfying a given input DTD. This problem had been studied by a subset of the authors in a simplified framework that captured the structure of XML documents but ignored data values. We revisit here the typechecking problem in the more realistic case when data values are present in documents and tested by queries. In this extended framework, typechecking quickly becomes undecidable. However, it remains decidable for large classes of queries and DTDs of practical interest. The main contribution of the present paper is to trace a fairly tight boundary of decidability for typechecking with data values. The complexity of typechecking in the decidable cases is also considered.  相似文献   

7.
An efficient and scalable algorithm for clustering XML documents by structure   总被引:11,自引:0,他引:11  
With the standardization of XML as an information exchange language over the Internet, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.  相似文献   

8.
XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data mining domain. In this paper, we propose a new method of XML document clustering by a global criterion function, considering the weight of common structures. Our approach initially extracts representative structures of frequent patterns from schemaless XML documents using a sequential pattern mining algorithm. Then, we perform clustering of an XML document by the weight of common structures, without a measure of pairwise similarity, assuming that an XML document is a transaction and frequent structures extracted from documents are items of the transaction. We conducted experiments to compare our method with previous methods. The experimental results show the effectiveness of our approach.  相似文献   

9.
Active XML (AXML) documents combine extensional XML data with intentional data defined through Web service calls. The dynamic properties of these documents pose challenges to both storage and data materialization techniques. In this paper, we present ARAXA, a non-intrusive approach to store and manage AXML documents. We also define a methodology to materialize AXML documents at query time. The storage approach of ARAXA is based on plain relational tables and user-defined functions of Object-Relational DBMS to trigger the service calls. By using a DBMS we benefit from efficient storage tools and query optimization. Approaches without DBMS support have to process XML in main memory or provide for virtual memory solutions. One of the main advantages of ARAXA is that AXML documents do not need to be loaded into main memory at query processing time. This is crucial when dealing with large documents. The experimental results with ARAXA prototype show that our approach is scalable and capable of dealing with large AXML documents.  相似文献   

10.
XML关系保护模型   总被引:1,自引:1,他引:0  
为了更好的保护节点关系所携带的信息以及XML的敏感信息.提出了利用"块"来组织的访问控制策略,并使用虚节点来指定与关系相关的机密限制.对具有块概念的XML-BB访问控制模型进行了研究,该模型能够隐藏不同块间节点的关系,以一种细粒度访问控制加强了XML敏感信息的保护.实验结果表明了该模型的可行性.  相似文献   

11.
The revolution of XML is recognized as the trend of technology on the Internet to researchers as well as practitioners. Companies need to adopt XML technology. With investment in the current relational database systems, they want to develop new XML documents while running existing relational databases on production. They need to reengineer the relational databases into XML documents with constraints preservation. In the process, schema translation must be done before data conversion. Since the existing relational databases are usually normalized, they have to be reconstructed into XML document tree structures. This can be accomplished through denormalization by joining the normalized relations into tables according to their data dependencies constraints. The joined tables are mapped into DOMs, which are then integrated into XML document trees. The user specifies an XML document root with its relevant nodes to form a partitioned XML document tree to meet their requirements. The selected XML document tree is mapped into an XML schema in the form of DTD. We then load joined tables into DOMs, integrate them into a DOM, and transform it into an XML document.  相似文献   

12.
13.
提出了一种基于TreeMiner算法挖掘频繁子树的文档结构相似度量方法,解决了传统的距离编辑法计算代价高而路径匹配法无法处理重复标签的问题。该方法架构了一个新的检索模型—频繁结构向量模型,给出了文档的结构向量表示和权重函数,构造了XML文档结构相似度量计算公式;同时从数据结构和挖掘程序上对TreeMiner 算法进行了改进,使其更适合大文档数据集的结构挖掘。实验结果表明,该方法具有很高的计算精度和准确率。  相似文献   

14.
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported. Received June 15, 2000 / Revised November 7, 2000  相似文献   

15.
XML documents are extensively used in several applications and evolve over time. Identifying the semantics of these changes becomes a fundamental process to understand their evolution. Existing approaches related to understanding changes (diff) in XML documents focus only on syntactic changes. These approaches compare XML documents based on their structure, without considering the associated semantics. However, for large XML documents, which have undergone many changes from a version to the next, a large number of syntactic changes in the document may correspond to fewer semantic changes, which are then easier to analyze and understand. For instance, increasing the annual salary and the gross pay, and changing the job title of an employee (three syntactic changes) may mean that this employee was promoted (one semantic change). In this paper, we explore this idea and present the XChange approach. XChange considers the semantics of the changes to calculate the diff of different versions of XML documents. For such, our approach analyzes the granular syntactic changes in XML attributes and elements using inference rules to combine them into semantic changes. Thus, differently from existing approaches, XChange proposes the use of syntactic changes in versions of an XML document to infer the real reason for the change and support the process of semantic diff. Results of an experimental study indicate that XChange can provide higher effectiveness and efficiency when used to understand changes between versions of XML documents when compared with the (syntactic) state-of-the-art approaches.  相似文献   

16.
目前国际上对变化检测算法的研究主要集中于在效率或空间上的优化,变化检测的精确程度不能令人满意,比如不能准确定位改变的文字等.通过将XML文档的树型结构和文本之间相似度相结合,提出了一种新颖的面向文本内容的变化检测算法DML-Diff,重点突出了文本内容的变化,使得变化检测结果更精确.  相似文献   

17.
目前国际上对变化检测算法的研究主要集中于在效率或空间上的优化,变化检测的精确程度不能令人满意,比如不能准确定位改变的文字等。通过将XML文档的树型结构和文本之间相似度相结合,提出了一种新颖的面向文本内容的变化检测算法DML-Diff,重点突出了文本内容的变化,使得变化检测结果更精确。  相似文献   

18.
In view of the efficiency requirements for query and update processing in XML databases, implementation of the robust node labeling (numbering) scheme becomes an increasingly important research issue. In order to process XML queries efficiently, it is necessary to detect the ancestor-descendant relationship between the nodes and restore the sequence order of nodes in the document. To solve this problem, the technique of labeling the document nodes is used. As a result, the so-called numbering scheme is created. The nodes of the documents are labeled with certain unique identifiers. Comparing these identifiers, one can restore the sequence order of the nodes and to establish the hierarchical relationships. In this paper, we give a survey of the most efficient numbering schemes and introduce a numbering scheme proposed by the authors and employed in the Sedna DBMS [1].  相似文献   

19.
This paper presents a schema matching method for the transformation of XML documents. The proposed method consists of two steps: computing preliminary matching relationships between leaf nodes in the two XML schemas based on proposed ontology and leaf node similarity, and extracting final matchings based on a proposed path similarity. Particularly, for a sophisticated schema matching, the proposed ontology is incrementally updated by users' feedback. Furthermore, since the ontology can describe various relationships between concepts, the proposed method can compute complex matchings as well as simple matchings. Experimental results with schemas used in various domains show that the proposed method performs better than previous methodologies, resulting in a precision of 97% and a recall of 83% on the average.  相似文献   

20.
Selective and authentic third-party distribution of XML documents   总被引:4,自引:0,他引:4  
Third-party architectures for data publishing over the Internet today are receiving growing attention, due to their scalability properties and to the ability of efficiently managing large number of subjects and great amount of data. In a third-party architecture, there is a distinction between the Owner and the Publisher of information. The Owner is the producer of information, whereas Publishers are responsible for managing (a portion of) the Owner information and for answering subject queries. A relevant issue in this architecture is how the Owner can ensure a secure and selective publishing of its data, even if the data are managed by a third-party, which can prune some of the nodes of the original document on the basis of subject queries and access control policies. An approach can be that of requiring the Publisher to be trusted with regard to the considered security properties. However, the serious drawback of this solution is that large Web-based systems cannot be easily verified to be secure and can be easily penetrated. For these reasons, we propose an alternative approach, based on the use of digital signature techniques, which does not require the Publisher to be trusted. The security properties we consider are authenticity and completeness of a query response, where completeness is intended with regard to the access control policies stated by the information Owner. In particular, we show that, by embedding in the query response one digital signature generated by the Owner and some hash values, a subject is able to locally verify the authenticity of a query response. Moreover, we present an approach that, for a wide range of queries, allows a subject to verify the completeness of query results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号