共查询到20条相似文献,搜索用时 0 毫秒
1.
How to structure and access XML documents with ontologies 总被引:16,自引:0,他引:16
The current hype on Extensible Mark-up Language (XML) produced hundreds of XML-based applications. Many of them offer document type definitions (DTDs) to structure actual XML documents. Access to these documents relies on special purpose applications or on query languages that are closely tied to the document structure. Our approach uses ontologies to derive canonical structures, i.e., DTDs, to access sets of distributed XML documents on a conceptual level. We will show how the combination of conceptual modeling, inheritance, and inference mechanisms with the popularity, simplicity, and flexibility of XML leads to applications providing a broad range of high-quality information. 相似文献
2.
3.
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed. 相似文献
4.
Author-X is a Java-based system that addresses the security issues of access control and policy design for XML document administration. Author-X supports the specification of policies at varying granularity levels and the specification of user credentials as a way to enforce access control. Access control is available according to both push and pull document distribution policies, and document updates are distributed through a combination of hash functions and digital signature techniques. The Author-X approach to distributed updates allows a user to verify a document's integrity without contacting the document server 相似文献
5.
As XML documents contain both content and structure information, taking advantage of the document structure in the retrieval process can lead to better identify relevant information units. In this paper, we describe an information retrieval (IR) approach dealing with queries composed of content and structure conditions. The XFIRM model we propose is designed to be as flexible as possible to process such queries. It is based on a complete query language, derived from XPath and on a relevance values propagation method. This paper aims at evaluating functions used in the propagation process, and particularly the use of distance between nodes as a parameter. The proposed method is evaluated, thanks to the INEX evaluation initiative. Results show a relative high precision of our proposal. 相似文献
6.
Noga Alon 《Journal of Computer and System Sciences》2003,66(4):688-727
We investigate the typechecking problem for XML queries: statically verifying that every answer to a query conforms to a given output DTD, for inputs satisfying a given input DTD. This problem had been studied by a subset of the authors in a simplified framework that captured the structure of XML documents but ignored data values. We revisit here the typechecking problem in the more realistic case when data values are present in documents and tested by queries. In this extended framework, typechecking quickly becomes undecidable. However, it remains decidable for large classes of queries and DTDs of practical interest. The main contribution of the present paper is to trace a fairly tight boundary of decidability for typechecking with data values. The complexity of typechecking in the decidable cases is also considered. 相似文献
7.
Wang Lian Cheung D.W.-l. Mamoulis N. Siu-Ming Yiu 《Knowledge and Data Engineering, IEEE Transactions on》2004,16(1):82-96
With the standardization of XML as an information exchange language over the Internet, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection. 相似文献
8.
Jeong Hee Hwang Author Vitae 《Journal of Systems and Software》2010,83(7):1267-1274
XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data mining domain. In this paper, we propose a new method of XML document clustering by a global criterion function, considering the weight of common structures. Our approach initially extracts representative structures of frequent patterns from schemaless XML documents using a sequential pattern mining algorithm. Then, we perform clustering of an XML document by the weight of common structures, without a measure of pairwise similarity, assuming that an XML document is a transaction and frequent structures extracted from documents are items of the transaction. We conducted experiments to compare our method with previous methods. The experimental results show the effectiveness of our approach. 相似文献
9.
Cláudio Ananias Ferraz Vanessa Braganholo Marta Mattoso 《Journal of Web Semantics》2010,8(2-3):209-224
Active XML (AXML) documents combine extensional XML data with intentional data defined through Web service calls. The dynamic properties of these documents pose challenges to both storage and data materialization techniques. In this paper, we present ARAXA, a non-intrusive approach to store and manage AXML documents. We also define a methodology to materialize AXML documents at query time. The storage approach of ARAXA is based on plain relational tables and user-defined functions of Object-Relational DBMS to trigger the service calls. By using a DBMS we benefit from efficient storage tools and query optimization. Approaches without DBMS support have to process XML in main memory or provide for virtual memory solutions. One of the main advantages of ARAXA is that AXML documents do not need to be loaded into main memory at query processing time. This is crucial when dealing with large documents. The experimental results with ARAXA prototype show that our approach is scalable and capable of dealing with large AXML documents. 相似文献
10.
11.
《Information and Software Technology》2003,45(6):335-355
The revolution of XML is recognized as the trend of technology on the Internet to researchers as well as practitioners. Companies need to adopt XML technology. With investment in the current relational database systems, they want to develop new XML documents while running existing relational databases on production. They need to reengineer the relational databases into XML documents with constraints preservation. In the process, schema translation must be done before data conversion. Since the existing relational databases are usually normalized, they have to be reconstructed into XML document tree structures. This can be accomplished through denormalization by joining the normalized relations into tables according to their data dependencies constraints. The joined tables are mapped into DOMs, which are then integrated into XML document trees. The user specifies an XML document root with its relevant nodes to form a partitioned XML document tree to meet their requirements. The selected XML document tree is mapped into an XML schema in the form of DTD. We then load joined tables into DOMs, integrate them into a DOM, and transform it into an XML document. 相似文献
12.
13.
14.
Oronzo Altamura Floriana Esposito Donato Malerba 《International Journal on Document Analysis and Recognition》2001,4(1):2-17
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires
solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems.
In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means
of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents
in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps,
namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps:
document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation,
the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general
layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of
the system components implementing these innovative aspects is reported.
Received June 15, 2000 / Revised November 7, 2000 相似文献
15.
XML documents are extensively used in several applications and evolve over time. Identifying the semantics of these changes becomes a fundamental process to understand their evolution. Existing approaches related to understanding changes (diff) in XML documents focus only on syntactic changes. These approaches compare XML documents based on their structure, without considering the associated semantics. However, for large XML documents, which have undergone many changes from a version to the next, a large number of syntactic changes in the document may correspond to fewer semantic changes, which are then easier to analyze and understand. For instance, increasing the annual salary and the gross pay, and changing the job title of an employee (three syntactic changes) may mean that this employee was promoted (one semantic change). In this paper, we explore this idea and present the XChange approach. XChange considers the semantics of the changes to calculate the diff of different versions of XML documents. For such, our approach analyzes the granular syntactic changes in XML attributes and elements using inference rules to combine them into semantic changes. Thus, differently from existing approaches, XChange proposes the use of syntactic changes in versions of an XML document to infer the real reason for the change and support the process of semantic diff. Results of an experimental study indicate that XChange can provide higher effectiveness and efficiency when used to understand changes between versions of XML documents when compared with the (syntactic) state-of-the-art approaches. 相似文献
16.
目前国际上对变化检测算法的研究主要集中于在效率或空间上的优化,变化检测的精确程度不能令人满意,比如不能准确定位改变的文字等.通过将XML文档的树型结构和文本之间相似度相结合,提出了一种新颖的面向文本内容的变化检测算法DML-Diff,重点突出了文本内容的变化,使得变化检测结果更精确. 相似文献
17.
18.
N. A. Aznauryan S. D. Kuznetsov L. G. Novak M. N. Grinev 《Programming and Computer Software》2006,32(1):8-18
In view of the efficiency requirements for query and update processing in XML databases, implementation of the robust node labeling (numbering) scheme becomes an increasingly important research issue. In order to process XML queries efficiently, it is necessary to detect the ancestor-descendant relationship between the nodes and restore the sequence order of nodes in the document. To solve this problem, the technique of labeling the document nodes is used. As a result, the so-called numbering scheme is created. The nodes of the documents are labeled with certain unique identifiers. Comparing these identifiers, one can restore the sequence order of the nodes and to establish the hierarchical relationships. In this paper, we give a survey of the most efficient numbering schemes and introduce a numbering scheme proposed by the authors and employed in the Sedna DBMS [1]. 相似文献
19.
《Information and Software Technology》2006,48(9):937-946
This paper presents a schema matching method for the transformation of XML documents. The proposed method consists of two steps: computing preliminary matching relationships between leaf nodes in the two XML schemas based on proposed ontology and leaf node similarity, and extracting final matchings based on a proposed path similarity. Particularly, for a sophisticated schema matching, the proposed ontology is incrementally updated by users' feedback. Furthermore, since the ontology can describe various relationships between concepts, the proposed method can compute complex matchings as well as simple matchings. Experimental results with schemas used in various domains show that the proposed method performs better than previous methodologies, resulting in a precision of 97% and a recall of 83% on the average. 相似文献
20.
Bertino E. Carminati B. Ferrari E. Thuraisingham B. Amar Gupta 《Knowledge and Data Engineering, IEEE Transactions on》2004,16(10):1263-1278
Third-party architectures for data publishing over the Internet today are receiving growing attention, due to their scalability properties and to the ability of efficiently managing large number of subjects and great amount of data. In a third-party architecture, there is a distinction between the Owner and the Publisher of information. The Owner is the producer of information, whereas Publishers are responsible for managing (a portion of) the Owner information and for answering subject queries. A relevant issue in this architecture is how the Owner can ensure a secure and selective publishing of its data, even if the data are managed by a third-party, which can prune some of the nodes of the original document on the basis of subject queries and access control policies. An approach can be that of requiring the Publisher to be trusted with regard to the considered security properties. However, the serious drawback of this solution is that large Web-based systems cannot be easily verified to be secure and can be easily penetrated. For these reasons, we propose an alternative approach, based on the use of digital signature techniques, which does not require the Publisher to be trusted. The security properties we consider are authenticity and completeness of a query response, where completeness is intended with regard to the access control policies stated by the information Owner. In particular, we show that, by embedding in the query response one digital signature generated by the Owner and some hash values, a subject is able to locally verify the authenticity of a query response. Moreover, we present an approach that, for a wide range of queries, allows a subject to verify the completeness of query results. 相似文献