首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Relational learning can be described as the task of learning first-order logic rules from examples. It has enabled a number of new machine learning applications, e.g. graph mining and link analysis. Inductive Logic Programming (ILP) performs relational learning either directly by manipulating first-order rules or through propositionalization, which translates the relational task into an attribute-value learning task by representing subsets of relations as features. In this paper, we introduce a fast method and system for relational learning based on a novel propositionalization called Bottom Clause Propositionalization (BCP). Bottom clauses are boundaries in the hypothesis search space used by ILP systems Progol and Aleph. Bottom clauses carry semantic meaning and can be mapped directly onto numerical vectors, simplifying the feature extraction process. We have integrated BCP with a well-known neural-symbolic system, C-IL2P, to perform learning from numerical vectors. C-IL2P uses background knowledge in the form of propositional logic programs to build a neural network. The integrated system, which we call CILP++, handles first-order logic knowledge and is available for download from Sourceforge. We have evaluated CILP++ on seven ILP datasets, comparing results with Aleph and a well-known propositionalization method, RSD. The results show that CILP++ can achieve accuracy comparable to Aleph, while being generally faster, BCP achieved statistically significant improvement in accuracy in comparison with RSD when running with a neural network, but BCP and RSD perform similarly when running with C4.5. We have also extended CILP++ to include a statistical feature selection method, mRMR, with preliminary results indicating that a reduction of more than 90 % of features can be achieved with a small loss of accuracy.  相似文献   

2.
This work analyzes experimentally discretization algorithms for handling continuous attributes in evolutionary learning. We consider a learning system that induces a set of rules in a fragment of first-order logic (evolutionary inductive logic programming), and introduce a method where a given discretization algorithm is used to generate initial inequalities, which describe subranges of attributes' values. Mutation operators exploiting information on the class label of the examples (supervised discretization) are used during the learning process for refining inequalities. The evolutionary learning system is used as a platform for testing experimentally four algorithms: two variants of the proposed method, a popular supervised discretization algorithm applied prior to induction, and a discretization method which does not use information on the class labels of the examples (unsupervised discretization). Results of experiments conducted on artificial and real life datasets suggest that the proposed method provides an effective and robust technique for handling continuous attributes by means of inequalities.  相似文献   

3.
We consider data sets that consist of n-dimensional binary vectors representing positive and negative examples for some (possibly unknown) phenomenon. A subset S of the attributes (or variables) of such a data set is called a support set if the positive and negative examples can be distinguished by using only the attributes in S. In this paper we study the problem of finding small support sets, a frequently arising task in various fields, including knowledge discovery, data mining, learning theory, logical analysis of data, etc. We study the distribution of support sets in randomly generated data, and discuss why finding small support sets is important. We propose several measures of separation (real valued set functions over the subsets of attributes), formulate optimization models for finding the smallest subsets maximizing these measures, and devise efficient heuristic algorithms to solve these (typically NP-hard) optimization problems. We prove that several of the proposed heuristics have a guaranteed constant approximation ratio, and we report on computational experience comparing these heuristics with some others from the literature both on randomly generated and on real world data sets.  相似文献   

4.
Searching the hypothesis space bounded below by a bottom clause is the basis of several state-of-the-art ILP systems (e.g. Progol, Aleph). These systems use refinement operators together with search heuristics to explore a bounded hypothesis space. It is known that the search space of these systems is limited to a sub-graph of the general subsumption lattice. However, the structure and properties of this sub-graph have not been properly characterised. In this paper firstly, we characterise the hypothesis space considered by the ILP systems which use a bottom clause to constrain the search. In particular, we discuss refinement in Progol as a representative of these ILP systems. Secondly, we study the lattice structure of this bounded hypothesis space. Thirdly, we give a new analysis of refinement operators, least generalisation and greatest specialisation in the subsumption order relative to a bottom clause. The results of this study are important for better understanding of the constrained refinement space of ILP systems such as Progol and Aleph, which proved to be successful for solving real-world problems (despite being incomplete with respect to the general subsumption order). Moreover, characterising this refinement sub-lattice can lead to more efficient ILP algorithms and operators for searching this particular sub-lattice. For example, it is shown that, unlike for the general subsumption order, efficient least generalisation operators can be designed for the subsumption order relative to a bottom clause.  相似文献   

5.
一种连续值属性约简方法ReCA   总被引:1,自引:1,他引:0  
属性约简是Rough集理论的主要应用和研究内容之一.现有的各种属性约简方法大多适用于离散值属性.对于连续值属性的数据处理,通常做法是先对其进行离散化.这种先期对数据进行的处理会丢失一些信息,易于使约简产生错误.针对连续值信息系统,提出了一种新的属性约简方法ReCA,该方法将连续值属性离散化与属性约简过程融为一体,以基于信息熵的不确定性度量作为适应度函数,通过进化计算同时得到约简属性集合和离散化的断点集合.实验表明,该方法不仅可以有效地进行属性约简,而且与Rough集及C4.5两种方法相比,得到的属性数目少、测试精度较高.  相似文献   

6.
Association rules are one of the most frequently used tools for finding relationships between different attributes in a database. There are various techniques for obtaining these rules, the most common of which are those which give categorical association rules. However, when we need to relate attributes which are numeric and discrete, we turn to methods which generate quantitative association rules, a far less studied method than the above. In addition, when the database is extremely large, many of these tools cannot be used. In this paper, we present an evolutionary tool for finding association rules in databases (both small and large) comprising quantitative and categorical attributes without the need for an a priori discretization of the domain of the numeric attributes. Finally, we evaluate the tool using both real and synthetic databases.  相似文献   

7.
Neighborhood rough set based heterogeneous feature subset selection   总被引:6,自引:0,他引:6  
Feature subset selection is viewed as an important preprocessing step for pattern recognition, machine learning and data mining. Most of researches are focused on dealing with homogeneous feature selection, namely, numerical or categorical features. In this paper, we introduce a neighborhood rough set model to deal with the problem of heterogeneous feature subset selection. As the classical rough set model can just be used to evaluate categorical features, we generalize this model with neighborhood relations and introduce a neighborhood rough set model. The proposed model will degrade to the classical one if we specify the size of neighborhood zero. The neighborhood model is used to reduce numerical and categorical features by assigning different thresholds for different kinds of attributes. In this model the sizes of the neighborhood lower and upper approximations of decisions reflect the discriminating capability of feature subsets. The size of lower approximation is computed as the dependency between decision and condition attributes. We use the neighborhood dependency to evaluate the significance of a subset of heterogeneous features and construct forward feature subset selection algorithms. The proposed algorithms are compared with some classical techniques. Experimental results show that the neighborhood model based method is more flexible to deal with heterogeneous data.  相似文献   

8.
Attribute-value based representations, standard in today's data mining systems, have a limited expressiveness. Inductive Logic Programming provides an interesting alternative, particularly for learning from structured examples whose parts, each with its own attributes, are related to each other by means of first-order predicates. Several subsets of first-order logic (FOL) with different expressive power have been proposed in Inductive Logic Programming (ILP). The challenge lies in the fact that the more expressive the subset of FOL the learner works with, the more critical the dimensionality of the learning task. The Datalog language is expressive enough to represent realistic learning problems when data is given directly in a relational database, making it a suitable tool for data mining. Consequently, it is important to elaborate techniques that will dynamically decrease the dimensionality of learning tasks expressed in Datalog, just as Feature Subset Selection (FSS) techniques do it in attribute-value learning. The idea of re-using these techniques in ILP runs immediately into a problem as ILP examples have variable size and do not share the same set of literals. We propose here the first paradigm that brings Feature Subset Selection to the level of ILP, in languages at least as expressive as Datalog. The main idea is to first perform a change of representation, which approximates the original relational problem by a multi-instance problem. The representation obtained as the result is suitable for FSS techniques which we adapted from attribute-value learning by taking into account some of the characteristics of the data due to the change of representation. We present the simple FSS proposed for the task, the requisite change of representation, and the entire method combining those two algorithms. The method acts as a filter, preprocessing the relational data, prior to the model building, which outputs relational examples with empirically relevant literals. We discuss experiments in which the method was successfully applied to two real-world domains.  相似文献   

9.
余泽 《计算机系统应用》2014,23(12):125-130
混合属性聚类是近年来的研究热点,对于混合属性数据的聚类算法要求处理好数值属性以及分类属性,而现存许多算法没有很好得平衡两种属性,以至于得不到令人满意的聚类结果.针对混合属性,在此提出一种基于交集的聚类融合算法,算法单独用基于相对密度的算法处理数值属性,基于信息熵的算法处理分类属性,然后通过基于交集的融合算法融合两个聚类成员,最终得到聚类结果.算法在UCI数据集Zoo上进行验证,与现存k-prototypes与EM算法进行了比较,在聚类的正确率上都优于k-prototypes与EM算法,还讨论了融合算法中交集元素比的取值对算法结果的影响.  相似文献   

10.
Inductive learning systems can be effectively used to acquire classification knowledge from examples. Many existing symbolic learning algorithms can be applied in domains with continuous attributes when integrated with a discretization algorithm to transform the continuous attributes into ordered discrete ones. In this paper, a new information theoretic discretization method optimized for supervised learning is proposed and described. This approach seeks to maximize the mutual dependence as measured by the interdependence redundancy between the discrete intervals and the class labels, and can automatically determine the most preferred number of intervals for an inductive learning application. The method has been tested in a number of inductive learning examples to show that the class-dependent discretizer can significantly improve the classification performance of many existing learning algorithms in domains containing numeric attributes  相似文献   

11.
We present a data mining method which integrates discretization, generalization and rough set feature selection. Our method reduces the data horizontally and vertically. In the first phase, discretization and generalization are integrated. Numeric attributes are discretized into a few intervals. The primitive values of symbolic attributes are replaced by high level concepts and some obvious superfluous or irrelevant symbolic attributes are also eliminated. The horizontal reduction is done by merging identical tuples after substituting an attribute value by its higher level value in a pre- defined concept hierarchy for symbolic attributes, or the discretization of continuous (or numeric) attributes. This phase greatly decreases the number of tuples we consider further in the database(s). In the second phase, a novel context- sensitive feature merit measure is used to rank features, a subset of relevant attributes is chosen, based on rough set theory and the merit values of the features. A reduced table is obtained by removing those attributes which are not in the relevant attributes subset and the data set is further reduced vertically without changing the interdependence relationships between the classes and the attributes. Finally, the tuples in the reduced relation are transformed into different knowledge rules based on different knowledge discovery algorithms. Based on these principles, a prototype knowledge discovery system DBROUGH-II has been constructed by integrating discretization, generalization, rough set feature selection and a variety of data mining algorithms. Tests on a telecommunication customer data warehouse demonstrates that different kinds of knowledge rules, such as characteristic rules, discriminant rules, maximal generalized classification rules, and data evolution regularities, can be discovered efficiently and effectively.  相似文献   

12.
连续属性离散化是知识系统中的一个重要环节,一个好的离散化方法能够简化知识的描述和便于对知识系统的处理。而求取连续属性值的最优断点集合是一个NP难题。提出一种连续属性模糊离散化的Norm-FD方法:根据正态分布特点采用正态离散化算法(Norm-D算法),使其离散结果达到需要离散区间数,根据属性值和与其相邻的区间关系将具体属性值用F-Inter算法转化为用隶属度、分区号和偏向系数三个参数表示。  相似文献   

13.
针对名义型属性和数值型属性并存的混合型数据,结合多粒度邻域粗糙集和直觉模糊集,分别定义模糊覆盖粗糙隶属度和非隶属度.基于不同的属性集序列和不同的邻域半径,构建多粒度邻域粗糙直觉模糊集模型,证明模型相关性质.然后提出乐观和悲观多粒度邻域粗糙直觉模糊集的近似集,并讨论模型性质.最后使用文中模型计算实例,说明其能较好地解决名义型属性和数值型属性的混合型数据的处理问题.  相似文献   

14.
为了解决数据挖掘和机器学习领域中连续属性离散化问题,提出一种改进的自适应离散粒子群优化算法。将连续属性的断点集合作为离散粒子群,通过粒子间的相互作用最小化断点子集,同时引入模拟退火算法作为局部搜索策略,提高了粒子群的多样性和寻找全局最优解的能力。利用粗糙集理论中决策属性对条件属性的依赖度来衡量决策表的一致性,从而达到连续属性离散化的目的,最后采用多组数据对此算法的性能进行了检验,并与其他算法做了对比实验,实验结果表明此算法是有效的。  相似文献   

15.
王伟  高亮  吴涛 《微机发展》2008,18(3):53-55
由于粗糙集只能对离散属性进行处理,因而连续属性的离散化也就成了粗糙集的主要问题之一。提出了一种从模糊聚类出发的离散化方法,并给出了一个判别函数,由该函数从聚类结果中选择最优的一个解,因而是一种自寻优的求解过程,避免了人为划分类数的主观影响。最后进行了实验比较,证实了该方法的有效性和合理性。  相似文献   

16.
一种混合属性数据流聚类算法   总被引:5,自引:0,他引:5  
杨春宇  周杰 《计算机学报》2007,30(8):1364-1371
数据流聚类是数据流挖掘中的重要问题.现实世界中的数据流往往同时具有连续属性和标称属性,但现有算法局限于仅处理其中一种属性,而对另一种采取简单舍弃的办法.目前还没有能在算法层次上进行混合属性数据流聚类的算法.文中提出了一种针对混合属性数据流的聚类算法;建立了数据流到达的泊松过程模型;用频度直方图对离散属性进行了描述;给出了混合属性条件下微聚类生成、更新、合并和删除算法.在公共数据集上的实验表明,文中提出的算法具有鲁棒的性能.  相似文献   

17.
针对不完备信息系统提出了一种新的粗糙集离散化算法。通过分析候选断点与决策类之间的影响关系,定义了候选断点对决策类的区分能力,并以此作为断点重要性的度量,实现不完备信息系统中连续属性的离散化。仿真实验验证了该算法的有效性。  相似文献   

18.
一种有效的用于数据挖掘的动态概念聚类算法   总被引:11,自引:0,他引:11  
郭建生  赵奕  施鹏飞 《软件学报》2001,12(4):582-591
概念聚类适用于领域知识不完整或领域知识缺乏时的数据挖掘任务.定义了一种基于语义的距离判定函数,结合领域知识对连续属性值进行概念化处理,对于用分类属性和数值属性混合描述数据对象的情况,提出了一种动态概念聚类算法DDCA(domain-baseddynamicclusteringalgorithm).该算法能够自动确定聚类数目,依据聚类内部属性值的频繁程度修正聚类中心,通过概念归纳处理,用概念合取表达式解释聚类输出.研究表明,基于语义距离判定函数和基于领域知识的动态概念聚类的算法DDCA是有效的.  相似文献   

19.
Many domains in the field of Inductive Logic Programming (ILP) involve highly unbalanced data. A common way to measure performance in these domains is to use precision and recall instead of simply using accuracy. The goal of our research is to find new approaches within ILP particularly suited for large, highly-skewed domains. We propose Gleaner, a randomized search method that collects good clauses from a broad spectrum of points along the recall dimension in recall-precision curves and employs an “at least L of these K clauses” thresholding method to combine sets of selected clauses. Our research focuses on Multi-Slot Information Extraction (IE), a task that typically involves many more negative examples than positive examples. We formulate this problem into a relational domain, using two large testbeds involving the extraction of important relations from the abstracts of biomedical journal articles. We compare Gleaner to ensembles of standard theories learned by Aleph, finding that Gleaner produces comparable testset results in a fraction of the training time. Editor: Rui Camacho  相似文献   

20.
基于改进遗传算法的连续属性离散化方法   总被引:1,自引:0,他引:1  
粗糙集中的离散化要求在保持原有决策系统的不可分辩关系情况下,用尽量少的断点进行离散化,而求取连续属性值的最优断点集合是一个NP难题.把连续属性值离散化问题作为一种约束优化问题,采用一种改进的遗传算法来获得最优解,并针对离散化问题设计了相应的编码方式和交叉方法.实验结果表明,采用改进的遗传算法求解连续属性值最优断点集合是可行的.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号