期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Discovering context-aware conditional functional dependencies

Yuefeng Du Derong Shen Tiezheng Nie Yue Kou Ge Yu 《Frontiers of Computer Science》2017,11(4):688-701

Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This paper presents context-aware conditional functional dependencies(CCFDs) which contribute to provide reasonable values and detect potential errors. Especially, we focus on automatically discovering minimal CCFDs. In this paper, we present context relativity to measure the relationship of CFDs. The overlap of the related CFDs can provide reasonable values which result in more accuracy consistency repairing, and some related CFDs are combined into CCFDs.Moreover,we prove that discovering minimal CCFDs is NP-complete and we design the precise method and the heuristic method. We also present the dominating value to facilitate the process in both the precise method and the heuristic method. Additionally, the context relativity of the CFDs affects the cleaning results. We will give an approximate threshold of context relativity according to data distribution for suggestion. The repairing results are approvedmore accuracy, even evidenced by our empirical evaluation. 相似文献

2.

一种基于CFDs规则的修复序列快速判定方法

王欢张云峰张艳《计算机科学》2018,45(3):311-316

数据一致性是大数据质量管理研究的一个重要内容。条件函数依赖(CFDs)是维护数据一致性的有效技术手段。然而,在修复过程中选择不同的CFDs修复顺序,会影响修复的准确性和效率。因此,如何选取一个正确且合理的修复顺序对数据修复至关重要。针对该问题,提出一种基于CFDs规则的快速判定修复序列的计算方法。首先,设计了一种数据修复框架。然后,利用CFDs之间的关联关系,提出了修复序列图的概念, 以用于CFDs修复顺序的计算。一方面,可以避免某些错误的或者不必要的数据修复,提高修复的准确性。另一方面,使用规则来判定修复顺序比使用实际数据进行判定更为快速。此外,在判定修复序列的过程中,对修复死锁进行了检测,保证了修复过程的可终止性。最后,通过在真实数据集上与现有方法进行对比实验,证明了所提方法具有更高的准确性和运行效率。相似文献

3.

Sampling from repairs of conditional functional dependency violations

George Beskales Ihab F. Ilyas Lukasz Golab Artur Galiullin 《The VLDB Journal The International Journal on Very Large Data Bases》2014,23(1):103-128

Violations of functional dependencies (FDs) and conditional functional dependencies (CFDs) are common in practice, often indicating deviations from the intended data semantics. These violations arise in many contexts such as data integration and Web data extraction. Resolving these violations is challenging for a variety of reasons, one of them being the exponential number of possible repairs. Most of the previous work has tackled this problem by producing a single repair that is nearly optimal with respect to some metric. In this paper, we propose a novel data cleaning approach that is not limited to finding a single repair, namely sampling from the space of possible repairs. We give several motivating scenarios where sampling from the space of CFD repairs is desirable, we propose a new class of useful repairs, and we present an algorithm that randomly samples from this space in an efficient way. We also show how to restrict the space of repairs based on constraints that reflect the accuracy of different parts of the database. We experimentally evaluate our algorithms against previous approaches to show the utility and efficiency of our approach. 相似文献

4.

基于条件函数依赖的数据库一致性检测研究

下载免费PDF全文

耿寅融刘波《计算机工程与应用》2012,48(3):122-125

条件函数依赖是函数依赖在语义上的扩充,可以应用于数据清洗工作,在数据库一致性的修复上应用广泛。讨论了条件函数依赖的相关语义规则,重点研究了基于条件函数依赖对违反数据库一致性元组的检测工作,并引入置信度评价机制,对相关的检测规则进行了改进。改进后的检测方法在基于多个函数依赖的检测中显示出了优越性,使得检测工作更为精简,检测标准更加明确。相似文献

5.

条件函数依赖及其在领域无关数据清洗中的应用

周健昌卜媛媛《微型电脑应用》2012,28(9):23-26,30

条件函数依赖(Conditional Functional Dependeny,CFD)是对函数依赖(Functional Depencency,FD)加入语义约束扩展而来,它在数据库一致性检测、数据清洗方面更优于后者.讨论了条件函数依赖的相关概念及其基本性质,讨论如何将它应用于数据清洗,并对已提出的基于CFD的数据清洗方案提出改进措施,并通过实验说明改进措施的可行性. 相似文献

6.

条件依赖理论及其应用展望

胡艳丽张维明《计算机科学》2009,36(12):115-118

介绍了条件函数依赖理论及如何用于检测不一致数据.首先介绍了条件函数依赖的概念及其推理系统,以及如何通过依赖传播实现视图的规范化;阐述了条件函数依赖的一致性和蕴含判定问题,并在此基础上介绍了基于条件函数依赖检测关系数据库数据一致性的技术;最后讨论了条件函数依赖的扩展及应用. 相似文献

7.

基于水利普查数据的函数依赖关系算法

钱振兴万定生李士进程习锋《计算机与现代化》2014,(8):96-100

条件函数依赖(Conditional Functional Dependencies,CFDs)在数据库一致性的检测上应用广泛。为检测水利普查数据的一致性,本文针对水利普查数据特点,将普查数据分为度量、维度2部分,并对度量数据进行聚类,引入条件函数依赖的概念,同时重新定义条件函数依赖,改进发现条件函数依赖的算法(即CTANE算法);以水库工程数据为例,验证本文改进的算法能准确高效地发现水利普查数据中的条件函数依赖,为检测数据一致性做好准备。相似文献

8.

Consistent data for inconsistent XML document

《Information and Software Technology》2007,49(9-10):947-959

XML document may contain inconsistencies that violate predefined integrity constraints, which causes the data inconsistency problem. In this paper, we consider how to get the consistent data from an inconsistent XML document. There are two basic concepts for this problem: Repair is the data consistent with the integrity constraints, and also minimally differs from the original one. Consistent data is the data common for every possible repair. First we give a general constraint model for XML, which can express the commonly discussed integrity constraints, including functional dependencies, keys and multivalued dependencies. Next we provide a repair framework for inconsistent XML document with three basic update operations: node insertion, node deletion and node value modification. Following this approach, we introduce the concept of repair for inconsistent XML document, discuss the chase method to generate repairs, and prove some important properties of the chase. Finally we give a method to obtain the greatest lower bound of all possible repairs, which is sufficient for consistent data. We also implement prototypes of our method, and evaluate our framework and algorithms in the experiment. 相似文献

9.

Incorporating cardinality constraints and synonym rules into conditional functional dependencies

Wenguang Chen Shuai Ma 《Information Processing Letters》2009,109(14):783-789

We propose an extension of conditional functional dependencies (CFDs), denoted by CFDcs, to express cardinality constraints, domain-specific conventions, and patterns of semantically related constants in a uniform constraint formalism. We show that despite the increased expressive power, the satisfiability and implication problems for CFDcs remain NP-complete and coNP-complete, respectively, the same as their counterparts for CFDs. We also identify tractable special cases. 相似文献

10.

EntityManager: Managing Dirty Data Based on Entity Resolution

下载免费PDF全文

Xue-Li Liu Hong-Zhi Wang Jian-Zhong Li Hong Gao 《计算机科学技术学报》2017,32(3):644-662

Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges. 相似文献

11.

Approximate dependencies in database systems

Aditya N. SahariaTerence M. Barron 《Decision Support Systems》1995,13(3-4)

Functional dependencies are the most commonly used approach for capturing real-word integrity constraints which are to be reflected in a database. There are, however, many useful kinds of constraints, especially approximate ones, that cannot be represented correctly by functional dependencies and therefore are enforced via programs which update the database, if they are enforced at all. This tends to make such constraints invisible since they are not an explicit part of the database, increasing maintenance problems and the likelihood of inconsistencies. We propose a new approach, cluster dependencies, as a way to enforce approximate dependencies. By treating equality as a fuzzy concept and defining appropriate similarity measures, it is possible to represent a broad range of approximate constraints directly in the database by storing and accessing cluster definitions. We discuss different interpretations of cluster dependencies and describe the additional data structures needed to enforce them. We also contrast them with an existing approach, fuzzy functional dependencies, which are much more limited in the kind of approximate constraints they can represent. 相似文献

12.

Disjunctive databases for representing repairs

Cristian Molinaro Jan Chomicki Jerzy Marcinkowski 《Annals of Mathematics and Artificial Intelligence》2009,57(2):103-124

This paper addresses the problem of representing the set of repairs of a possibly inconsistent database by means of a disjunctive database. Specifically, the class of denial constraints is considered. We show that, given a database and a set of denial constraints, there exists a (unique) disjunctive database, called canonical, which represents the repairs of the database w.r.t. the constraints and is contained in any other disjunctive database with the same set of minimal models. We propose an algorithm for computing the canonical disjunctive database. Finally, we study the size of the canonical disjunctive database in the presence of functional dependencies for both subset-based repairs and cardinality-based repairs. 相似文献

13.

基于可能世界模型的关系数据不一致性的修复

徐耀丽李战怀陈群钟评《软件学报》2016,27(7):1685-1699

针对关系数据的不一致性虽然已有各种修复方法被提出,但这些修复策略在构建最终修复方案过程中只分析函数依赖包含属性的信息(即数据集的部分信息),且偏向于修复代价最小的方案,而忽略了数据集的其它属性以及这些属性与函数依赖包含属性之间的相关性。为此,本文提出一种基于可能世界模型的不一致性修复方法。它首先构造可能的修复方案,然后从修复代价和属性值相关性二个方面量化各个候选修复方案的可信性程度,并最后找出最优的修复方案。实验结果验证了本文提出的修复方法取得了比现有基于代价的修复方法更好的修复效果。我们同时也分析了错误率和不同类型概率量化对本文提出的修复方法的影响。相似文献

14.

一种无线传感器网络的概率覆盖增强算法

范兴刚杨静静王恒《软件学报》2016,27(2):418-431

覆盖与连通问题是无线传感器网络的基本问题.研究考虑连通性的概率覆盖增强算法,构建覆盖空洞的修补半径,提出了移动距离和修补半径的关系模型.通过这个关系模型,移动节点在修补圆上选择保持连通的修补位置;根据这个移动距离和空洞面积,移动节点进一步创建空洞的优先级,选择优先级最高的空洞进行修补,节能而高效地实现覆盖增强.仿真结果表明,所提出的算法既能得到较高的覆盖率,又能保证整个网络的连通性. 相似文献

15.

Consistent query answers from virtually integrated XML data

Zijing Tan Chengfei Liu 《Journal of Systems and Software》2010,83(12):2566-2578

When data sources are virtually integrated, there is no common and centralized method to maintain global consistency, so inconsistencies with regard to global integrity constraints are very likely to occur. In this paper, we consider the problem of defining and computing consistent query answers when queries are posed to virtual XML data integration systems, which are specified following the local-as-view approach. We propose a powerful XML constraint model to define global constraints, which can express keys and functional dependencies, and which also extends the newly introduced conditional functional dependencies to XML. We provide an approach to defining XML views, which supports not only edge-path mappings but also data-value bindings to express the join operator. We give formal definitions of repair and consistent query answers with the XML data integration settings. Given a query on the global system, we present a two-step method to compute consistent query answers. First, the given query is transformed using the global constraints, such that to run the transformed query on the original global system will generate exactly the consistent query answers. Because the global instance is not materialized, the query on the global instance is then rewritten in the form of queries on the underlying data sources by reversing rules in view definitions. We illustrate that the XPath query transformations can be implemented in XQuery. Finally, we implement prototypes of our method and evaluate our algorithms in the experiments. 相似文献

16.

Repairing XML functional dependency violations

Zijing Tan Liyong Zhang 《Information Sciences》2011,181(23):5304-5320

We study the problem of repairing XML functional dependency violations by making the smallest modifications in terms of repair cost. Our cost model assigns a weight to each leaf node in the XML document, and the cost of a repair is measured by the total weight of the modified nodes. We define an optimum repair as the repair with the minimum cost among all of the repairs. We prove lower and upper bounds for the optimum XML repair problem. We show that, in practice, it is beyond reach to find the optimum repairs; this problem is already NP-complete for a setting with a fixed DTD, a fixed set of functional dependencies, and equal weights for all of the nodes in the XML document. Instead, we provide an efficient two-step heuristic method to repair XML functional dependency violations. First, the initial violations are captured and fixed by leveraging the conflict hypergraph. Second, the remaining conflicts are resolved by modifying the violating nodes and their related nodes called determinants in a way that guarantees no new violations. We implement our method and evaluate it on synthetic and real-life data. The experimental results demonstrate that our algorithm scales well and is effective at improving data quality. 相似文献

17.

基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现

黄永鑫唐雪飞《计算机应用》2020,40(1):90-95

针对高校实际数据质量检测过程中数据集存在缺失值以及发现的函数依赖个数较少且不准确的问题，提出了一种结合近邻传播（AP）聚类算法和TANE算法的高校函数依赖发现方法（APTANE）。首先，对数据集中的中文字段进行列剖析，将中文字段值用对应的数值来表示；其次，使用AP聚类算法对数据集中的缺失值进行填补；最后，使用TANE算法从处理好的数据集中自动发现出满足非平凡、最小要求的函数依赖。实验结果表明，在使用AP聚类算法对真实的高校数据集进行修复之后，相比于直接使用函数依赖自动发现算法，发现的函数依赖个数增加到了80个，经过缺失值填补后所发现的函数依赖在表示字段间关联关系时也更加准确，减少了领域专家的工作量，提升了高校数据所拥有数据的质量。相似文献

18.

基于输入样本和主数据的编辑规则挖掘算法

杨辉于守健陈少总《计算机系统应用》2017,26(4):162-168

基于编辑规则和主数据的数据修复技术能自动地、确切地修复不一致数据,但目前编辑规则的获取主要依靠专业人员的定义. 为了实现数据清洗全自动化,数据规则的挖掘技术近年来成为研究热点,针对条件函数依赖提出的挖掘算法主要有CFDMiner,CTANE,FastCFD. 在此基础上,扩展条件函数依赖（CFD）的定义,在编辑规则的定义下提出了一种基于输入样本和主数据的编辑规则挖掘算法,主要思路是从输入样本中挖掘出CFD,然后根据输入样本与主数据在属性上的定义域相似性求出输入样本在主数据中的对应属性,从而形成带模式组的编辑规则,此算法能有效地挖掘编辑规则. 且所挖掘的编辑规则按照编辑规则语义能有效地进行数据修复. 相似文献

19.

分布式数据库中冲突检测技术研究

仲志平仲晓辉《微机发展》2012,(1):217-220,224

数据冲突是数据库中数据质量中心问题之一。在集中式数据库中,基于SQL技术可以有效地检测出违背给定条件函数依赖集的元组。然而,当数据库中数据被水平或垂直划分且分布在不同站点时,检测数据冲突将面临更大的挑战,常常需要将数据从一个站点移动到另外一个站点。提出了分布式数据库中条件函数依赖冲突检测算法,该算法不仅能有效地检测出水平划分数据中条件函数依赖冲突,而且能减少数据传输。实验结果证实算法是有效的。相似文献

20.

An Efficient Method for Cleaning Dirty-Events over Uncertain Data in WSNs

下载免费PDF全文

陈默于戈谷峪贾子熙王艳秋《计算机科学技术学报》2011,26(6):942-953

Event detection in wireless sensor networks (WSNs) has attracted much attention due to its importance in many applications. The erroneous abnormal data generated during event detection are prone to lead to false detection results. Therefore, in order to improve the reliability of event detection, we propose a dirty-event cleaning method based on spatio-temporal correlations among sensor data. Unlike traditional fault-tolerant approaches, our method takes into account the inherent uncertainty of sensor measurements and focuses on the type of directional events. A probabilitybased mapping scheme is introduced, which maps uncertain sensor data into binary data. Moreover, we give formulated definitions of transient dirty-event (TDE) and permanent dirty-event (PDE), which are cleaned by a novel fuzzy method and a collaborative cleaning scheme, respectively. Extensive experimental results show the effectiveness of our dirty-event cleaning method. 相似文献