首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Schema integration aims to create a mediated schema as a unified representation of existing heterogeneous sources sharing a common application domain. These sources have been increasingly written in XML due to its versatility and expressive power. Unfortunately, these sources often use different elements and structures to express the same concepts and relations, thus causing substantial semantic and structural conflicts. Such a challenge impedes the creation of high-quality mediated schemas and has not been adequately addressed by existing integration methods. In this paper, we propose a novel method, named XINTOR, for automating the integration of heterogeneous schemas. Given a set of XML sources and a set of correspondences between the source schemas, our method aims to create a complete and minimal mediated schema: it completely captures all of the concepts and relations in the sources without duplication, provided that the concepts do not overlap. Our contributions are fourfold. First, we resolve structural conflicts inherent in the source schemas. Second, we introduce a new statistics-based measure, called path cohesion, for selecting concepts and relations to be a part of the mediated schema. The path cohesion is statistically computed based on multiple path quality dimensions such as average path length and path frequency. Third, we resolve semantic conflicts by augmenting the semantics of similar concepts with context-dependent information. Finally, we propose a novel double-layered mediated schema to retain a wider range of concepts and relations than existing mediated schemas, which are at best either complete or minimal, but not both. Performed on both real and synthetic datasets, our experimental results show that XINTOR outperforms existing methods with respect to (i) the mediated-schema quality using precision, recall, F-measure, and schema minimality; and (ii) the execution performance based on execution time and scale-up performance.  相似文献   

2.
Automatic annotation is an essential technique for effectively handling and organizing Web objects (e.g., Web pages), which have experienced an unprecedented growth over the last few years. Automatic annotation is usually formulated as a multi-label classification problem. Unfortunately, labeled data are often time-consuming and expensive to obtain. Web data also accommodate much richer feature space. This calls for new semi-supervised approaches that are less demanding on labeled data to be effective in classification. In this paper, we propose a graph-based semi-supervised learning approach that leverages random walks and ? 1 sparse reconstruction on a mixed object-label graph with both attribute and structure information for effective multi-label classification. The mixed graph contains an object-affinity subgraph, a label-correlation subgraph, and object-label edges with adaptive weight assignments indicating the assignment relationships. The object-affinity subgraph is constructed using ? 1 sparse graph reconstruction with extracted structural meta-text, while the label-correlation subgraph captures pairwise correlations among labels via linear combination of their co-occurrence similarity and kernel-based similarity. A random walk with adaptive weight assignment is then performed on the constructed mixed graph to infer probabilistic assignment relationships between labels and objects. Extensive experiments on real Yahoo! Web datasets demonstrate the effectiveness of our approach.  相似文献   

3.
The schema matching problem can be defined as the task of finding semantic relationships between schema elements existing in different data repositories. Despite the existence of elaborated graphic tools for helping to find such matches, this task is usually manually done. In this paper, we propose a novel evolutionary approach to addressing the problem of automatically finding complex matches between schemas of semantically related data repositories. To the best of our knowledge, this is the first approach that is capable of discovering complex schema matches using only the data instances. Since we only exploit the data stored in the repositories for this task, we rely on matching strategies that are based on record deduplication (aka, entity-oriented strategy) and information retrieval (aka, value-oriented strategy) techniques to find complex schema matches during the evolutionary process. To demonstrate the effectiveness of our approach, we conducted an experimental evaluation using real-world and synthetic datasets. The results show that our approach is able to find complex matches with high accuracy, similar to that obtained by more elaborated (hybrid) approaches, despite using only evidence based on the data instances.  相似文献   

4.
On Graph Features of Semantic Web Schemas   总被引:1,自引:0,他引:1  
In this paper, we measure and analyze the graph features of semantic Web (SW) schemas with focus on power-law degree distributions. Our main finding is that the majority of SW schemas with a significant number of properties (respectively, classes) approximate a power law for total-degree (respectively, the number of subsumed classes) distribution. Moreover, our analysis revealed some emerging conceptual modeling practices of SW schema developers: (1) each schema has a few focal classes that have been analyzed in detail (that is, they have numerous properties and subclasses), which are further connected with focal classes defined in other schemas, (2) class subsumption hierarchies are mostly unbalanced (that is, some branches are deep and heavy, while others are shallow and light), (3) most properties have as domain/range classes that are located high at the class subsumption hierarchies, and (4) the number of recursive/multiple properties is significant. The knowledge of these features is essential for guiding synthetic SW schema generation, which is an important step toward benchmarking SW repositories and query language implementations.  相似文献   

5.
A graph G is the k-leaf power of a tree T if its vertices are leaves of T such that two vertices are adjacent in G if and only if their distance in T is at most k. Then T is the k-leaf root of G. This notion was introduced and studied by Nishimura, Ragde, and Thilikos motivated by the search for underlying phylogenetic trees. Their results imply a O(n3) time recognition algorithm for 3-leaf powers. Later, Dom, Guo, Hüffner, and Niedermeier characterized 3-leaf powers as the (bull, dart, gem)-free chordal graphs. We show that a connected graph is a 3-leaf power if and only if it results from substituting cliques into the vertices of a tree. This characterization is much simpler than the previous characterizations via critical cliques and forbidden induced subgraphs and also leads to linear time recognition of these graphs.  相似文献   

6.
Provenance has become increasingly important in scientific workflows to understand, verify, and reproduce the result of scientific data analysis. Most existing systems store provenance data in provenance stores with proprietary provenance data models and conduct query processing over the physical provenance storages using query languages, such as SQL, SPARQL, and XQuery, which are closely coupled to the underlying storage strategies. Querying provenance at such low level leads to poor usability of the system: a user needs to know the underlying schema to formulate queries; if the schema changes, queries need to be reformulated; and queries formulated for one system will not run in another system. In this paper, we present OPQL, a provenance query language that enables the querying of provenance directly at the graph level. An OPQL query takes a provenance graph as input and produces another provenance graph as output. Therefore, OPQL queries are not tightly coupled to the underlying provenance storage strategies. Our main contributions are: (i) we design OPQL, including six types of graph patterns, a provenance graph algebra, and OPQL syntax and semantics, that supports querying provenance at the graph level; (ii) we implement OPQL using a Web service via our OPMProv system; therefore, users can invoke the Web service to execute OPQL queries in a provenance browser, called OPMProVis. The result of OPQL queries is displayed as a provenance graph in OPMProVis. An experimental study is conducted to evaluate the feasibility and performance of OPMProv on OPQL provenance querying.  相似文献   

7.
Web超链挖掘:中国境内Web图结构研究   总被引:4,自引:0,他引:4  
丁国栋  王斌  白硕 《计算机工程》2005,31(14):24-26
以网站作为Web图的顶点,以网站之间链接为有向边,研究了中国境内Web图的拓扑特点和宏观结构。试验表明:网站的入度和出度分布同样服从幂级数定律(Power Law);境内Web图的连通性明显高于全球的Web图,其最大的强连通分量中的网站数超过50%;在境内Web中,如果两个网站之间存在一条有向路径,则从一个网站漫游到另外一个网站,平均只需点击71次,最多只需点击29次。  相似文献   

8.
9.
10.
Despite advances in machine learning technologies a schema matching result between two database schemas (e.g., those derived from COMA++) is likely to be imprecise. In particular, numerous instances of ??possible mappings?? between the schemas may be derived from the matching result. In this paper, we study problems related to managing possible mappings between two heterogeneous XML schemas. First, we study how to efficiently generate possible mappings for a given schema matching task. While this problem can be solved by existing algorithms, we show how to improve the performance of the solution by using a divide-and-conquer approach. Second, storing and querying a large set of possible mappings can incur large storage and evaluation overhead. For XML schemas, we observe that their possible mappings often exhibit a high degree of overlap. We hence propose a novel data structure, called the block tree, to capture the commonalities among possible mappings. The block tree is useful for representing the possible mappings in a compact manner and can be efficiently generated. Moreover, it facilitates the evaluation of a probabilistic twig query (PTQ), which returns the non-zero probability that a fragment of an XML document matches a given query. For users who are interested only in answers with k-highest probabilities, we also propose the top-k PTQ and present an efficient solution for it. An extensive evaluation on real-world data sets shows that our approaches significantly improve the efficiency of generating, storing, and querying possible mappings.  相似文献   

11.
The capabilities of XSLT processing are widely used to transform XML documents into target XML documents. These target XML documents conform to output schemas of the used XSLT stylesheet. Output schemas of XSLT stylesheets can be used for a static analysis of the used XSLT stylesheet, to automatically detect the XSLT stylesheet of target XML documents or to reason on the output schema without access to the target XML documents. In this paper, we develop an approach to automatically determining the output schema of an XSLT stylesheet. We also describe several application scenarios of output schemas. The experimental evaluation shows that our prototype can determine the output schemas of nearly all typical XSLT stylesheets and the improvements in preciseness in several application scenarios when using output schemas in comparison to when not using output schemas.  相似文献   

12.
Matching query interfaces is a crucial step in data integration across multiple Web databases. The problem is closely related to schema matching that typically exploits different features of schemas. Relying on a particular feature of schemas is not sufficient. We propose an evidential approach to combining multiple matchers using Dempster–Shafer theory of evidence. First, our approach views the match results of an individual matcher as a source of evidence that provides a level of confidence on the validity of each candidate attribute correspondence. Second, it combines multiple sources of evidence to get a combined mass function that represents the overall level of confidence, taking into account the match results of different matchers. Our combination mechanism does not require the use of weighing parameters, hence no setting and tuning of them is needed. Third, it selects the top k attribute correspondences of each source attribute from the target schema based on the combined mass function. Finally it uses some heuristics to resolve any conflicts between the attribute correspondences of different source attributes. Our experimental results show that our approach is highly accurate and effective.  相似文献   

13.
The PiDuce project comprises a programming language and a distributed runtime environment devised for experimenting Web services technologies by relying on solid theories about process calculi and formal languages for XML documents and schemas.The language features values and datatypes that extend XML documents and schemas with channels, an expressive type system with subtyping, a pattern matching mechanism for deconstructing XML values, and control constructs that are based on Milner’s asynchronous pi calculus. The runtime environment supports the execution of PiDuce processes over networks by relying on state-of-the-art technologies, such as XML schema and WSDL, thus enabling interoperability with existing Web services.We thoroughly describe the PiDuce project: the programming language and its semantics, the architecture of the distributed runtime and its implementation.  相似文献   

14.
Dynamic composition and optimization of Web services   总被引:1,自引:0,他引:1  
Process-based composition of Web services has recently gained significant momentum for the implementation of inter-organizational business collaborations. In this approach, individual Web services are choreographed into composite Web services whose integration logics are expressed as composition schema. In this paper, we present a goal-directed composition framework to support on-demand business processes. Composition schemas are generated incrementally by a rule inference mechanism based on a set of domain-specific business rules enriched with contextual information. In situations where multiple composition schemas can achieve the same goal, we must first select the best composition schema, wherein the best schema is selected based on the combination of its estimated execution quality and schema quality. By coupling the dynamic schema creation and quality-driven selection strategy in one single framework, we ensure that the generated composite service comply with business rules when being adapted and optimized.  相似文献   

15.
16.
在给定关系模式的属性集及其函数依赖最小覆盖集的基础上,提出一种基于模式图的规范化XML模式设计方法。定义了模式图,在模式图中增加了Keys的描述信息,给出由函数依赖集构造模式图的算法。该模式图独立于具体的XML模式语言,经分析证明,所设计的模式满足XNF。  相似文献   

17.
在光怪陆离的都市夜生活里,有一群人仅仅需要平静的夜,美剧就是他们的精神慰藉。而为他们默默工作的,是暖黄灯光下的魅酷。  相似文献   

18.
The advancement of World Wide Web has revolutionized the way the manufacturers can do business. The manufacturers can collect customer preferences for products and product features from their sales and other product-related Web sites to enter and sustain in the global market. For example, the manufactures can make intelligent use of these customer preference data to decide on which products should be selected for targeted marketing. However, the selected products must attract as many customers as possible to increase the possibility of selling more than their respective competitors. This paper addresses this kind of product selection problem. That is, given a database of existing products P from the competitors, a set of company’s own products Q, a dataset C of customer preferences and a positive integer k, we want to find k-most promising products (k-MPP) from Q with maximum expected number of total customers for targeted marketing. We model k-MPP query and propose an algorithmic framework for processing such query and its variants. Our framework utilizes grid-based data partitioning scheme and parallel computing techniques to realize k-MPP query. The effectiveness and efficiency of the framework are demonstrated by conducting extensive experiments with real and synthetic datasets.  相似文献   

19.
一种有效的贪婪模式匹配算法   总被引:2,自引:0,他引:2  
模式匹配问题是意图获得两个模式中所包含个体对象之间的语义匹配和映射,其结果表示源模式的个体对象与目标模式的个体对象之间存在特定的语义关联.它在数据库应用领域起到关键性的作用,例如数据集成、电子商务、数据仓库、XML消息交换等,特别地,它已成为元数据管理的基本问题.然而,模式匹配很大程度上依赖人工的操作,是一个费时费力的过程.模式匹配问题可以归约为一个组合优化问题:多标记图匹配问题.首先,将模式表示为多标记图,将模式匹配转换为多标记图匹配问题.其次,提出多标记图的相似性度量方法,进而提出基于多标记图相似性的模式匹配目标优化函数.最后,在这个目标函数基础上设计实现了一个贪婪匹配算法,其最显著的特点是综合多种可用的标记信息,灵活准确地获得最优的匹配结果.  相似文献   

20.
Suppose the vertices of a graph G were labeled arbitrarily by positive integers, and let S(v) denote the sum of labels over all neighbors of vertex v. A labeling is lucky if the function S is a proper coloring of G, that is, if we have S(u)≠S(v) whenever u and v are adjacent. The least integer k for which a graph G has a lucky labeling from the set {1,2,…,k} is the lucky number of G, denoted by η(G).Using algebraic methods we prove that η(G)?k+1 for every bipartite graph G whose edges can be oriented so that the maximum out-degree of a vertex is at most k. In particular, we get that η(T)?2 for every tree T, and η(G)?3 for every bipartite planar graph G. By another technique we get a bound for the lucky number in terms of the acyclic chromatic number. This gives in particular that for every planar graph G. Nevertheless we offer a provocative conjecture that η(G)?χ(G) for every graph G.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号