共查询到20条相似文献,搜索用时 11 毫秒
1.
《Theoretical computer science》2005,336(1):89-124
Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data as accurately as possible. In this paper, we address foundational and algorithmic issues related to the semantics of data exchange and to the query answering problem in the context of data exchange. These issues arise because, given a source instance, there may be many target instances that satisfy the constraints of the data exchange problem.We give an algebraic specification that selects, among all solutions to the data exchange problem, a special class of solutions that we call universal. We show that a universal solution has no more and no less data than required for data exchange and that it represents the entire space of possible solutions. We then identify fairly general, yet practical, conditions that guarantee the existence of a universal solution and yield algorithms to compute a canonical universal solution efficiently. We adopt the notion of the “certain answers” in indefinite databases for the semantics for query answering in data exchange. We investigate the computational complexity of computing the certain answers in this context and also address other algorithmic issues that arise in data exchange. In particular, we study the problem of computing the certain answers of target queries by simply evaluating them on a canonical universal solution, and we explore the boundary of what queries can and cannot be answered this way, in a data exchange setting. 相似文献
2.
在数据库研究领域,模式匹配和实体统一是被广泛关注的两个方向。随着对Web数据集成需求的增长,无论是在模式和实体层次,研究这两方面问题是很有实际意义的。当前的研究大多针对两项任务的其中之一。在文章中,基于模式匹配促进实体统一的新思路,提出了一种同时解决这两项任务的方法,实现了它们之间的相互促进机制。在现实的Web异构数据源场景中应用该方法,得到的查准率和查全率都很高,证明了该方法的正确性和有效性。 相似文献
3.
完整性约束常用来定义数据库的数据语义,违反约束的数据库实例为不一致数据库,返回含有不一致结果的查询称为不一致查询。一致性查询目的在于不修改数据库实例而从不一致数据库获取满足约束的查询结果,已有方法因其支持的约束类型有限或计算复杂度高而影响其应用范围。提出了一种基于空值修复的数据库一致性查询方法,首先将原始完整性约束转换为与查询相关的统一约束,然后根据统一约束对原SQL查询进行查询重写,重写后的查询将不一致属性值当做空值来处理以获得满足完整性约束的结果。系统实现与实验证明,该方法在多种完整性约束类型与SQL 相似文献
4.
针对XML函数依赖(XFD)不能充分检测XML局部数据源语义上的数据不一致,借鉴关系数据库中条件函数依赖(CFD)的概念,并根据XML自身结构和约束特性,提出了基于内容感知发现(CAD)XML条件函数依赖(XCFD),CAD使用隐藏在数据值中的内容发现局部XML文档的XCFDs,检测异构数据源中数据一致性,提高数据的质量,并给出了详细的算法,同时引入修剪规则集减少搜索点阵和候选的XCFD的数量,提高算法的效率,使得XCFD无冗余、最小化.通过案例研究表明,基于CAD方法发现的XCFD比现有XFD发现更多的函数依赖和语义约束. 相似文献
5.
6.
7.
In the present article, some special semantic integrity constraints—so called nondeterministic dependencies—are proposed. These dependencies can be regarded as stochastic extensions of functional dependencies. After some basic definitions, the concept of nondeterministic dependency is introduced. Examples are given and an implementation for a statistical analysis system is described. Some properties are discussed. 相似文献
8.
Automatic speech recognition (ASR) has reached a very high level of performance in controlled situations. However, the performance degrades drastically when environmental noise occurs during recognition. Nowadays, the major challenge is to reach a good robustness to adverse conditions. Missing data recognition has been developed to deal with this challenge. Unlike other denoising methods, missing data recognition does not match the whole data with the acoustic models, but instead considers part of the signal as missing, i.e. corrupted by noise. The main challenge of this approach is to identify accurately missing parts (also called masks). The work reported here focuses on this issue. We start from developing Bayesian models of the masks, where every spectral feature is classified as reliable or masked, and is assumed independent of the rest of the signal. This classification strategy results in sparse and isolated masked features, like the squares of a chess-board, while oracle reliable and unreliable features tend to be clustered into consistent time–frequency blocks. We then propose to take into account frequency and temporal dependencies in order to improve the masks’ estimation accuracy. Integrating such dependencies leads to a new architecture of a missing data mask estimator. The proposed classifier has been evaluated on the noisy Aurora2 (digits recognition) and Aurora4 (continuous speech) databases. Experimental results show a significant improvement of recognition accuracy when these dependencies are considered. 相似文献
9.
数据质量和数据清洗研究综述 总被引:75,自引:1,他引:75
对数据质量,尤其是数据清洗的研究进行了综述.首先说明数据质量的重要性和衡量指标,定义了数据清洗问题.然后对数据清洗问题进行分类,并分析了解决这些问题的途径.最后说明数据清洗研究与其他技术的结合情况,分析了几种数据清洗框架.最后对将来数据清洗领域的研究问题作了展望. 相似文献
10.
Omar Benjelloun Hector Garcia-Molina David Menestrina Qi Su Steven Euijong Whang Jennifer Widom 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(1):255-276
We consider the entity resolution (ER) problem (also known as deduplication, or merge–purge), in which records determined
to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating
the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify
four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We
develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh
that exploit the four properties. F-Swoosh in addition assumes knowledge of the “features” (e.g., attributes) used by the
match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information
data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties
do not hold, if an “approximate” result is acceptable. 相似文献
11.
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem 总被引:29,自引:0,他引:29
The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent equational theory that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data. 相似文献
12.
现有查询分析方法通常将实体识别作为线下预处理过程清洗整个数据集,然而,随着数据规模的不断增大,这种高计算复杂性的线下清洗模式已经很难满足实时性分析应用的需求。针对重复充电运营记录上的聚集查询问题,提出一种将近似聚集查询处理与实体识别相结合的方法。首先,通过基于块的采样策略采集样本;然后,在采集到的样本上利用实体识别方法识别出重复的实体;最后,根据实体识别的结果重构得到聚集结果的无偏估计。所提方法避免了识别全部实体的时间代价,通过识别少量样本数据即可返回满足用户需求的查询结果。真实数据集和合成数据集上的实验结果验证了所提方法的高效性和可靠性。 相似文献
13.
不一致数据无法正确反映现实世界,其上的查询结果内含错误或矛盾,而现有的很多不一致数据查询处理相关研究都存在信息丢失的问题.AQA(annotation based query answer)针对这一问题采用信任标签在属性级别上区分一致和不一致数据,避免了信息丢失.但AQA假设记录在依赖左边属性上的分量可信,且只针对函数依赖一种约束,具有应用局限性.在综合约束(函数依赖、包含依赖和域约束)范围内、不确定属性任意的情况下扩展了AQA,重新审视了AQA的数据模型及其上的查询代数,讨论了任意约束在查询结果上的蕴含约束计算问题.实验结果表明,扩展后的AQA非连接类查询的性能和普通的SQL基夺相同,连接查询经优化后性能接近普通SQL查询,但AQA不丢失信息与部分同类研究相比有很大优势. 相似文献
14.
不一致数据无法正确反映现实世界,其上的查询结果内含错误或矛盾,而现有的很多不一致数据查询处理相关研究都存在信息丢失的问题。 AQA(annotation based query answer)针对这一问题采用信任标签在属性级别上区分一致和不一致数据,避免了信息丢失。但 AQA 假设记录在依赖左边属性上的分量可信,且只针对函数依赖一种约束,具有应用局限性。在综合约束(函数依赖、包含依赖和域约束)范围内、不确定属性任意的情况下扩展了AQA,重新审视了 AQA 的数据模型及其上的查询代数,讨论了任意约束在查询结果上的蕴含约束计算问题。实验结果表明,扩展后的AQA非连接类查询的性能和普通的SQL基本相同,连接查询经优化后性能接近普通SQL查询,但AQA不丢失信息,与部分同类研究相比有很大优势。 相似文献
15.
Leopoldo Bertossi Camilla Schwind 《Annals of Mathematics and Artificial Intelligence》2004,40(1-2):5-35
In this article, we characterize in terms of analytic tableaux the repairs of inconsistent relational databases, that is databases that do not satisfy a given set of integrity constraints. For this purpose we provide closing and opening criteria for branches in tableaux that are built for database instances and their integrity constraints. We use the tableaux based characterization as a basis for consistent query answering, that is for retrieving from the database answers to queries that are consistent with respect to the integrity constraints. 相似文献
16.
近年来,针对多源异构数据的实体匹配问题,已经有诸多学者提出不同的解决方法。然而,这些方法几乎都集中在RDFS或OWL等语义框架下进行实体匹配,不具有通用性。此外,针对多数据源实体匹配问题,目前主流解决方式是将其转换为多组两两数据源的实体匹配问题,该种方式直接进行两两匹配的计算复杂度过高,且没有从多数据源全局的角度分析问题。从这些问题出发,提出了一种的实体匹配方法,利用了实体中普遍存在的名称、属性和上下文信息,构建多种索引,缩减计算空间同时生成高质量的候选集;还定义了度量实体相似度的计算方法,有效地判别了实体对是否匹配。并根据实体间边的权重以及互斥关系,提出一种基于图划分的优化算法,划分多个等价实体构成的集合。从互联网中抓取商业领域下品牌和人物类别的真实数据进行实验测试,实验结果表明该方法取得了良好的效果。 相似文献
17.
In this paper, we introduce a concept of Annotation Based Query Answer, and a method for its computation, which can answer queries on relational databases that may violate a set of functional dependencies. In this approach, inconsistency is viewed as a property of data and described with annotations. To be more precise, every piece of data in a relation can have zero or more annotations with it and annotations are propagated along with queries from the source to the output. With annotations, inconsistent da... 相似文献
18.
Temporal XML: modeling, indexing, and query processing 总被引:1,自引:0,他引:1
Flavio Rizzolo Alejandro A. Vaisman 《The VLDB Journal The International Journal on Very Large Data Bases》2008,17(5):1179-1212
In this paper we address the problem of modeling and implementing temporal data in XML. We propose a data model for tracking
historical information in an XML document and for recovering the state of the document as of any given time. We study the
temporal constraints imposed by the data model, and present algorithms for validating a temporal XML document against these
constraints, along with methods for fixing inconsistent documents. In addition, we discuss different ways of mapping the abstract
representation into a temporal XML document, and introduce TXPath, a temporal XML query language that extends XPath 2.0. In
the second part of the paper, we present our approach for summarizing and indexing temporal XML documents. In particular we
show that by indexing continuous paths, i.e., paths that are valid continuously during a certain interval in a temporal XML graph, we can dramatically increase
query performance. To achieve this, we introduce a new class of summaries, denoted TSummary, that adds the time dimension to the well-known path summarization schemes. Within this framework, we present two new summaries:
LCP and Interval summaries. The indexing scheme, denoted TempIndex, integrates these summaries with additional data structures. We give a
query processing strategy based on TempIndex and a type of ancestor-descendant encoding, denoted temporal interval encoding.
We present a persistent implementation of TempIndex, and a comparison against a system based on a non-temporal path index,
and one based on DOM. Finally, we sketch a language for updates, and show that the cost of updating the index is compatible
with real-world requirements. 相似文献
19.
基于独立分量分析的自适应在线算法 总被引:1,自引:1,他引:1
独立分量分析(ICA)是近几年兴起的一种高效的信号处理方法,学习步长的优化问题是自适应ICA重要的一方面,基于变步长思想,定义了一种描述信号分离状态的相似性测度,来衡量输出分量之间的相似性程度,并由此提出一种改进的自适应在线算法。根据相似性程度所反映的信号分离状态自适应调节步长,并建立学习步长和相似性测度变化量的非线性关系,克服了传统算法在信道矩阵变化时对步长自适应调整的不足。性能指标分析和仿真实验证明了算法的收敛性和稳态性能。 相似文献