首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we consider functional dependencies among Boolean dependencies (BDs, for short). Armstrong relations are defined for BDs (called BD-Armstrong relations). For BDs, two necessary and sufficient conditions for the existence of BD-Armstrong relations are given. A necessary and sufficient condition for the existence of Armstrong relations for functional dependencies (FDs, for short) is given, which in some sense is more convenient than the condition given in [3]. We give an algorithm that solves the problem of deciding if two BDs imply the same set of functional dependencies. If the BDs are given in perfect disjunctive normal form, then the algorithm requires only polynomial time. Although Mannila and Räihä have shown that for some relations exponential time is needed for computing any cover of the set of FDs defined in this relation, as a consequence, we show that the problem of deciding if two relations satisfy the same set of FDs can be solved in polynomial time. Another consequence is a new correspondence of the families of functional dependencies to the families of Sperner systems. By this correspondence, the estimate of the number of databases given previously in [6] is improved. It is shown that there is a one-to-one correspondence between the closure of the FDs that hold in a BD and its so-calledbasic cover. As applications of basic covers, we obtain a representation of a key, the family of minimal keys and a representation of canonical covers.This research was supported by the Hungarian Foundation for Scientific Research, Grant Nos. OTKA 2575, 2149.  相似文献   

2.
Functional dependencies (FDs) and inclusion dependencies (INDs) convey most of data semantics in relational databases and are very useful in practice since they generalize keys and foreign keys. Nevertheless, FDs and INDs are often not available, obsolete or lost in real-life databases. Several algorithms have been proposed for mining these dependencies, but the output is always in the same format: a simple list of dependencies, hard to understand for the user. In this paper, we define informative Armstrong databases (IADBs) from databases as being small subsets of an existing database, satisfying exactly the same FDs and INDs. They are an extension of the classical notion of Armstrong databases, but more suitable for the understanding of dependencies, since tuples are real-world tuples. The main result of this paper is to bound the size of an IADB in the case of non-circular INDs. A constructive proof of this result is given, from which an algorithm has been devised. An implementation and experiments against a real-life database were performed; the obtained database contains 0.6% of the initial database tuples only. More importantly, such semantic sampling of databases appear to be a key feature for the understanding of existing databases at the logical level.  相似文献   

3.
Inferring dependencies from relations: a conceptual clustering approach   总被引:1,自引:0,他引:1  
In this paper we consider two related types of data dependencies that can hold in a relation: conjunctive implication rules between attribute‐value pairs, and functional dependencies. We present a conceptual clustering approach that can be used, with some small modifications, for inferring a cover for both types of dependencies. The approach consists of two steps. First, a particular clustered representation of the relation, called concept (or Galois ) lattice , is built. Then, a cover is extracted from the lattice built in the earlier step. Our main emphasis is on the second step. We study the computational complexity of the proposed approach and present an experimental comparison with other methods that confirms its validity. The results of the experiments show that our algorithm for extracting implication rules from concept lattices clearly outperforms an earlier algorithm, and suggest that the overall lattice‐based approach to inferring functional dependencies from relations can be seen as an alternative to traditional methods.  相似文献   

4.
张守志  施伯乐 《软件学报》2003,14(10):1692-1696
介绍了一种发现最小函数依赖集的方法.这种方法基于一致集的概念,根据一致集导出最大集及其补集,然后生成最小非平凡函数依赖集.通过使用带状划分数据库减少求一致集的运算次数,使用逐层求精的算法来计算最小非平凡函数依赖集的左部.其结果可用于数据库的重新组织和设计、属性约简、聚类、关联规则提取等知识发现工作中.  相似文献   

5.
聂培尧 《软件学报》1994,5(3):37-42
数据依赖在数据库设计中起着十分重要的作用.自Codd提出函数依赖(FDs)、Fagin引入多值依赖(MVDs)后,近几年来人们又根据设计中的需要引入多种新的依赖,如在工程数据库设计中所引进的传递闭包依赖(CDs)等.对这些依赖一般是按其是否具有完备的公理系统而划分为两大类,因为完备性公理系统往往具有有效的判定算法为先决条件.本文对CDs和FDs的k元完备公理系统存在问题进行了研究,证明了CDs和FDs不具有共同的k元完备公理系统这一结论.  相似文献   

6.
We are interested in specifying functional dependencies (FDs) for data-centric XML documents (XML documents that are used mainly for data storage). FDs are a natural constraint. Specifying FDs for XML documents is more difficult because unlike relational databases, XML documents do not have uniform structures. This paper introduces XML Template Functional Dependencies (XTFDs), which are able to specify FDs for XML documents. This paper also presents a necessary and sufficient condition for an XTFD to cause data redundancy in XML documents. Further, we propose Attribute Rule and Text String Rule as two procedures that can be repeatedly applied to remove redundancy caused by XTFDs. In addition, we prove that if an XML document has data redundancy with respect to an FD specified by using the tree tuple approach, it would have data redundancy with respect to an XTFD and show by example that XTFDs can specify some FDs for XML documents that the tree tuple approach cannot.  相似文献   

7.
一个多时间粒度下时态函数依赖的有限属性闭包算法   总被引:2,自引:0,他引:2  
为了有效地进行时态数据库设计,支持多时间粒度的时态函数依赖(TFDs)被用于时态模式的规范化.时态模式规范化所要解决的一个关键问题是求解时态函数依赖的有限属性闭包问题.由于多时间粒度的使用,使得有限属性闭包问题变得非常复杂.实际上,TFDs与传统的函数依赖(FDs)之间存在着密切的联系.通过分析这些联系和封闭时态类型集的特性,利用传统FDs的相关算法提出一个有效的求解有限属性闭包的算法.通过分析和与相关算法的实验比较,该算法更加有效.  相似文献   

8.
A formal system for reasoning about functional dependencies (FDs) and subset dependencies (SDs) defined over relational expressions is described. An FD e:X → Y indicates that Y is functionally dependent on X in the relation denoted by expression e; an SD e ? f indicates that the relation denoted by e is a subset of that denoted by f. The system is shown to be sound and complete by resorting to the analytic tableaux method. Applications of the system include the problem of determining if a constraint of a subschema is implied by the constraints of the base schema and the development of database design methodologies similar to normalization.  相似文献   

9.
利用粒计算方法对粗糙关系数据库(Rough Relational Database,RRDB)的粗糙函数依赖进行研究。首先提出了两种类型的粗糙函数依赖及粗糙相似关系的概念,分析了如何利用位模式表示粗糙关系的属性值,在此基础上给出了利用粒计算方法对粗糙关系的属性间的依赖关系的进行判定的算法,实验验证算法是有效可行的。  相似文献   

10.
Formal Concept Analysis (FCA), in which data is represented as a formal context, offers a framework for Association Rules Mining (ARM) by handling functional dependencies in the data. However, with the size of the formal context, the number of rules grows exponentially. In this article, we apply Fuzzy K-Means clustering on the data set to reduce the formal context and FCA on the reduced data set for mining association rules. With experiments on two real-world healthcare data sets, we offer the evidence for performance of FKM-based FCA in mining association rules.  相似文献   

11.
《Knowledge》2002,15(7):399-405
We define an optimal class association rule set to be the minimum rule set with the same predictive power of the complete class association rule set. Using this rule set instead of the complete class association rule set we can avoid redundant computation that would otherwise be required for mining predictive association rules and hence improve the efficiency of the mining process significantly. We present an efficient algorithm for mining the optimal class association rule set using an upward closure property of pruning weak rules before they are actually generated. We have implemented the algorithm and our experimental results show that our algorithm generates the optimal class association rule set, whose size is smaller than 1/17 of the complete class association rule set on average, in significantly less rime than generating the complete class association rule set. Our proposed criterion has been shown very effective for pruning weak rules in dense databases.  相似文献   

12.
We consider the problem of defining a normalized approximation measure for multi-valued dependencies in relational database theory. An approximation measure is a function mapping relation instances to real numbers. The number to which an instance is mapped, intuitively, describes the strength of the dependency in that instance. A normalized approximation measure for functional dependencies has been proposed previously: the minimum number of tuples that need be removed for the functional dependency to hold divided by the total number of tuples. This leads naturally to a normalized measure for multi-valued dependencies: the minimum number of tuples that need be removed for the multi-valued dependency to hold divided by the total number of tuples.The measure for functional dependencies can be computed efficiently, O(|r|log(|r|)) where |r| is the relation instance. However, we show that an efficient algorithm for computing the analogous measure for multi-valued dependencies is not likely to exist. A polynomial time algorithm for computing the measure would lead to a polynomial time algorithm for an NP-complete problem (proven by a reduction from the maximum edge biclique problem in graph theory). Hence, we argue that it is not a good measure. We propose an alternate measure based on the lossless join characterization of multi-valued dependencies. This measure is efficiently computable, O(|r|2).  相似文献   

13.
Statistical dependency analysis is the basis of all empirical science. A commonly occurring problem is to find the most significant dependency rules, which describe either positive or negative dependencies between categorical attributes. In medical science, for example, one is interested in genetic factors, which can either predispose or prevent diseases. The requirement of statistical significance is essential, because the discoveries should hold also in future data. Typically, the significance is estimated either by Fisher??s exact test or the ?? 2-measure. The problem is computationally very difficult, because the number of all possible dependency rules increases exponentially with the number of attributes. As a solution, different kinds of restrictions and heuristics have been applied, but a general, scalable search method has been missing. In this paper, we introduce an efficient algorithm, called Kingfisher, for searching for the best non-redundant dependency rules with statistical significance measures. The rules can express either positive or negative dependencies between a set of positive attributes and a single consequent attribute. The algorithm itself is independent from the used goodness measure, but we concentrate on Fisher??s exact test and the ?? 2-measure. The algorithm is based on an application of the branch-and-bound search strategy, supplemented by several pruning properties. Especially, we prove a new lower bound for Fisher??s p and introduce a new effective pruning principle. According to our experiments on classical benchmark data, the algorithm is well scalable and can efficiently handle even dense and high-dimensional data sets. An interesting observation was that Fisher??s exact test did not only produce more reliable rules than the ?? 2-measure, but it also performed the search much faster.  相似文献   

14.
《Artificial Intelligence》2001,125(1-2):155-207
Although neural networks have shown very good performance in many application domains, one of their main drawbacks lies in the incapacity to provide an explanation for the underlying reasoning mechanisms.The “explanation capability” of neural networks can be achieved by the extraction of symbolic knowledge. In this paper, we present a new method of extraction that captures nonmonotonic rules encoded in the network, and prove that such a method is sound.We start by discussing some of the main problems of knowledge extraction methods. We then discuss how these problems may be ameliorated. To this end, a partial ordering on the set of input vectors of a network is defined, as well as a number of pruning and simplification rules. The pruning rules are then used to reduce the search space of the extraction algorithm during a pedagogical extraction, whereas the simplification rules are used to reduce the size of the extracted set of rules. We show that, in the case of regular networks, the extraction algorithm is sound and complete.We proceed to extend the extraction algorithm to the class of non-regular networks, the general case. We show that non-regular networks always contain regularities in their subnetworks. As a result, the underlying extraction method for regular networks can be applied, but now in a decompositional fashion. In order to combine the sets of rules extracted from each subnetwork into the final set of rules, we use a method whereby we are able to keep the soundness of the extraction algorithm.Finally, we present the results of an empirical analysis of the extraction system, using traditional examples and real-world application problems. The results have shown that a very high fidelity between the extracted set of rules and the network can be achieved.  相似文献   

15.
Inter-sequence pattern mining can find associations across several sequences in a sequence database, which can discover both a sequential pattern within a transaction and sequential patterns across several different transactions. However, inter-sequence pattern mining algorithms usually generate a large number of recurrent frequent patterns. We have observed mining closed inter-sequence patterns instead of frequent ones can lead to a more compact yet complete result set. Therefore, in this paper, we propose a model of closed inter-sequence pattern mining and an efficient algorithm called CISP-Miner for mining such patterns, which enumerates closed inter-sequence patterns recursively along a search tree in a depth-first search manner. In addition, several effective pruning strategies and closure checking schemes are designed to reduce the search space and thus accelerate the algorithm. Our experiment results demonstrate that the proposed CISP-Miner algorithm is very efficient and outperforms a compared EISP-Miner algorithm in most cases.  相似文献   

16.
Feature selection is viewed as an important preprocessing step for pattern recognition, machine learning and data mining. Traditional hill-climbing search approaches to feature selection have difficulties to find optimal reducts. And the current stochastic search strategies, such as GA, ACO and PSO, provide a more robust solution but at the expense of increased computational effort. It is necessary to investigate fast and effective search algorithms. Rough set theory provides a mathematical tool to discover data dependencies and reduce the number of features contained in a dataset by purely structural methods. In this paper, we define a structure called power set tree (PS-tree), which is an order tree representing the power set, and each possible reduct is mapped to a node of the tree. Then, we present a rough set approach to feature selection based on PS-tree. Two kinds of pruning rules for PS-tree are given. And two novel feature selection algorithms based on PS-tree are also given. Experiment results demonstrate that our algorithms are effective and efficient.  相似文献   

17.
MAFIA: a maximal frequent itemset algorithm   总被引:4,自引:0,他引:4  
We present a new algorithm for mining maximal frequent itemsets from a transactional database. The search strategy of the algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms that significantly improve mining performance. Our implementation for support counting combines a vertical bitmap representation of the data with an efficient bitmap compression scheme. In a thorough experimental analysis, we isolate the effects of individual components of MAFIA including search space pruning techniques and adaptive compression. We also compare our performance with previous work by running tests on very different types of data sets. Our experiments show that MAFIA performs best when mining long itemsets and outperforms other algorithms on dense data by a factor of three to 30.  相似文献   

18.
刘慧婷  沈盛霞  赵鹏  姚晟 《计算机应用》2015,35(10):2911-2914
由于不确定数据的向下封闭属性,挖掘全部频繁项集的方法会得到一个指数级的结果。为获得一个较小的合适的结果集,研究了在不确定数据上挖掘频繁闭项集,并提出了一种新的频繁闭项集挖掘算法——NA-PFCIM。该算法将项集挖掘过程看作一个概率分布函数,考虑到基于正态分布模型的方法提取的频繁项集精确度较高,而且支持大型数据库,采用了正态分布模型提取频繁项集。同时,为了减少搜索空间以及避免冗余计算,利用基于深度优先搜索的策略来获得所有的概率频繁闭项集。该算法还设计了两个剪枝策略:超集修剪和子集修剪。最后,在常用的数据集(T10I4D100K、Accidents、Mushroom、Chess)上,将提出的NA-PFCIM算法和基于泊松分布的A-PFCIM算法进行比较。实验结果表明,NA-PFCIM算法能够减少所要扩展的项集,同时减少项集频繁概率的计算,其性能优于对比算法。  相似文献   

19.
条件函数依赖是函数依赖在语义上的扩充,可以应用于数据清洗工作,在数据库一致性的修复上应用广泛。讨论了条件函数依赖的相关语义规则,重点研究了基于条件函数依赖对违反数据库一致性元组的检测工作,并引入置信度评价机制,对相关的检测规则进行了改进。改进后的检测方法在基于多个函数依赖的检测中显示出了优越性,使得检测工作更为精简,检测标准更加明确。  相似文献   

20.
High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result, analyzing HUIs can be very time consuming for users. Moreover, a large set of HUIs also makes HUIM algorithms less efficient in terms of execution time and memory consumption. To address this problem, closed high-utility itemsets (CHUIs), concise and lossless representations of all HUIs, were proposed recently. Although mining CHUIs is useful and desirable, it remains a computationally expensive task. This is because current algorithms often generate a huge number of candidate itemsets and are unable to prune the search space effectively. In this paper, we address these issues by proposing a novel algorithm called CLS-Miner. The proposed algorithm utilizes the utility-list structure to directly compute the utilities of itemsets without producing candidates. It also introduces three novel strategies to reduce the search space, namely chain-estimated utility co-occurrence pruning, lower branch pruning, and pruning by coverage. Moreover, an effective method for checking whether an itemset is a subset of another itemset is introduced to further reduce the time required for discovering CHUIs. To evaluate the performance of the proposed algorithm and its novel strategies, extensive experiments have been conducted on six benchmark datasets having various characteristics. Results show that the proposed strategies are highly efficient and effective, that the proposed CLS-Miner algorithmoutperforms the current state-ofthe- art CHUD and CHUI-Miner algorithms, and that CLSMiner scales linearly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号