首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Attribute-value based representations, standard in today's data mining systems, have a limited expressiveness. Inductive Logic Programming provides an interesting alternative, particularly for learning from structured examples whose parts, each with its own attributes, are related to each other by means of first-order predicates. Several subsets of first-order logic (FOL) with different expressive power have been proposed in Inductive Logic Programming (ILP). The challenge lies in the fact that the more expressive the subset of FOL the learner works with, the more critical the dimensionality of the learning task. The Datalog language is expressive enough to represent realistic learning problems when data is given directly in a relational database, making it a suitable tool for data mining. Consequently, it is important to elaborate techniques that will dynamically decrease the dimensionality of learning tasks expressed in Datalog, just as Feature Subset Selection (FSS) techniques do it in attribute-value learning. The idea of re-using these techniques in ILP runs immediately into a problem as ILP examples have variable size and do not share the same set of literals. We propose here the first paradigm that brings Feature Subset Selection to the level of ILP, in languages at least as expressive as Datalog. The main idea is to first perform a change of representation, which approximates the original relational problem by a multi-instance problem. The representation obtained as the result is suitable for FSS techniques which we adapted from attribute-value learning by taking into account some of the characteristics of the data due to the change of representation. We present the simple FSS proposed for the task, the requisite change of representation, and the entire method combining those two algorithms. The method acts as a filter, preprocessing the relational data, prior to the model building, which outputs relational examples with empirically relevant literals. We discuss experiments in which the method was successfully applied to two real-world domains.  相似文献   

2.
Scaling Up Inductive Logic Programming by Learning from Interpretations   总被引:4,自引:0,他引:4  
When comparing inductive logic programming (ILP) and attribute-value learning techniques, there is a trade-off between expressive power and efficiency. Inductive logic programming techniques are typically more expressive but also less efficient. Therefore, the data sets handled by current inductive logic programming systems are small according to general standards within the data mining community. The main source of inefficiency lies in the assumption that several examples may be related to each other, so they cannot be handled independently.Within the learning from interpretations framework for inductive logic programming this assumption is unnecessary, which allows to scale up existing ILP algorithms. In this paper we explain this learning setting in the context of relational databases. We relate the setting to propositional data mining and to the classical ILP setting, and show that learning from interpretations corresponds to learning from multiple relations and thus extends the expressiveness of propositional learning, while maintaining its efficiency to a large extent (which is not the case in the classical ILP setting).As a case study, we present two alternative implementations of the ILP system TILDE (Top-down Induction of Logical DEcision trees): TILDEclassic, which loads all data in main memory, and TILDELDS, which loads the examples one by one. We experimentally compare the implementations, showing TILDELDS can handle large data sets (in the order of 100,000 examples or 100 MB) and indeed scales up linearly in the number of examples.  相似文献   

3.
李艳娟  郭茂祖 《电脑学习》2012,2(3):13-17,22
归纳逻辑程序设计是机器学习与逻辑程序设计交叉所形成的一个研究领域,克服了传统机器学习方法的两个主要限制:即知识表示的限制和背景知识利用的限制,成为机器学习的前沿研究课题。首先从归纳逻辑程序设计的产生背景、定义、应用领域及问题背景介绍了归纳逻辑程序设计系统的概貌,对归纳逻辑程序设计方法的研究现状进行了总结和分析,最后探讨了该领域的进一步的研究方向。  相似文献   

4.
A particularly successful role for Inductive Logic Programming (ILP) is as a tool for discovering useful relational features for subsequent use in a predictive model. Conceptually, the case for using ILP to construct relational features rests on treating these features as functions, the automated discovery of which necessarily requires some form of first-order learning. Practically, there are now several reports in the literature that suggest that augmenting any existing feature with ILP-discovered relational features can substantially improve the predictive power of a model. While the approach is straightforward enough, much still needs to be done to scale it up to explore more fully the space of possible features that can be constructed by an ILP system. This is in principle, infinite and in practice, extremely large. Applications have been confined to heuristic or random selections from this space. In this paper, we address this computational difficulty by allowing features and models to be constructed in a distributed manner. That is, there is a network of computational units, each of which employs an ILP engine to construct some small number of features and then builds a (local) model. We then employ an asynchronous consensus-based algorithm, in which neighboring nodes share information and update local models. This gossip-based information exchange results in the formation of non-stationary Markov chains. For a category of models (those with convex loss functions), it can be shown (using the Supermartingale Convergence Theorem) that the algorithm will result in all nodes converging to a consensus model. In practice, it may be slow to achieve this convergence. Nevertheless, our results on synthetic and real datasets suggest that in relatively short time the “best” node in the network reaches a model whose predictive accuracy is comparable to that obtained using more computational effort in a non-distributed setting (the best node is identified as the one whose weights converge first).  相似文献   

5.
This paper is concerned with problems that arise when submitting large quantities of data to analysis by an Inductive Logic Programming (ILP) system. Complexity arguments usually make it prohibitive to analyse such datasets in their entirety. We examine two schemes that allow an ILP system to construct theories by sampling from this large pool of data. The first, “subsampling”, is a single-sample design in which the utility of a potential rule is evaluated on a randomly selected sub-sample of the data. The second, “logical windowing”, is multiple-sample design that tests and sequentially includes errors made by a partially correct theory. Both schemes are derived from techniques developed to enable propositional learning methods (like decision trees) to cope with large datasets. The ILP system CProgol, equipped with each of these methods, is used to construct theories for two datasets—one artificial (a chess endgame) and the other naturally occurring (a language tagging problem). In each case, we ask the following questions of CProgol equipped with sampling: (1) Is its theory comparable in predictive accuracy to that obtained if all the data were used (that is, no sampling was employed)?; and (2) Is its theory constructed in less time than the one obtained with all the data? For the problems considered, the answers to these questions is “yes”. This suggests that an ILP program equipped with an appropriate sampling method could begin to address problems satisfactorily that have hitherto been inaccessible simply due to data extent.  相似文献   

6.
《Information Fusion》2008,9(3):344-353
In real-world sensor networks, the monitored processes generating time-stamped data may change drastically over time. An online data-mining algorithm called OLIN (on-line information network) adapts itself automatically to the rate of concept drift in a non-stationary data stream by repeatedly constructing a classification model from every sliding window of training examples. In this paper, we introduce a new real-time data-mining algorithm called IOLIN (incremental on-line information network), which saves a significant amount of computational effort by updating an existing model as long as no major concept drift is detected. The proposed algorithm builds upon the oblivious decision-tree classification model called “information network” (IN) and it implements three different types of model updating operations. In the experiments with multi-year streams of traffic sensors data, no statistically significant difference between the accuracy of the incremental algorithm (IOLIN) vs. the regenerative one (OLIN) has been observed.  相似文献   

7.
在大数据时代下, 以高效自主隐式特征提取能力闻名的深度学习引发了新一代人工智能的热潮, 然而其背后黑箱不可解释的“捷径学习”现象成为制约其进一步发展的关键性瓶颈问题. 解耦表征学习通过探索大数据内部蕴含的物理机制和逻辑关系复杂性, 从数据生成的角度解耦数据内部多层次、多尺度的潜在生成因子, 促使深度网络模型学会像人类一样对数据进行自主智能感知, 逐渐成为新一代基于复杂性的可解释深度学习领域内重要研究方向, 具有重大的理论意义和应用价值. 本文系统地综述了解耦表征学习的研究进展, 对当前解耦表征学习中的关键技术及典型方法进行了分类阐述, 分析并汇总了现有各类算法的适用场景并对此进行了可视化实验性能展示, 最后指明了解耦表征学习今后的发展趋势以及未来值得研究的方向.  相似文献   

8.
Three-dimensional models, or pharmacophores, describing Euclidean constraints on the location on small molecules of functional groups (like hydrophobic groups, hydrogen acceptors and donors, etc.), are often used in drug design to describe the medicinal activity of potential drugs (or ‘ligands’). This medicinal activity is produced by interaction of the functional groups on the ligand with a binding site on a target protein. In identifying structure-activity relations of this kind there are three principal issues: (1) It is often difficult to “align” the ligands in order to identify common structural properties that may be responsible for activity; (2) Ligands in solution can adopt different shapes (or `conformations’) arising from torsional rotations about bonds. The 3-D molecular substructure is typically sought on one or more low-energy conformers; and (3) Pharmacophore models must, ideally, predict medicinal activity on some quantitative scale. It has been shown that the logical representation adopted by Inductive Logic Programming (ILP) naturally resolves many of the difficulties associated with the alignment and multi-conformation issues. However, the predictions of models constructed by ILP have hitherto only been nominal, predicting medicinal activity to be present or absent. In this paper, we investigate the construction of two kinds of quantitative pharmacophoric models with ILP: (a) Models that predict the probability that a ligand is “active”; and (b) Models that predict the actual medicinal activity of a ligand. Quantitative predictions are obtained by the utilising the following statistical procedures as background knowledge: logistic regression and naive Bayes, for probability prediction; linear and kernel regression, for activity prediction. The multi-conformation issue and, more generally, the relational representation used by ILP results in some special difficulties in the use of any statistical procedure. We present the principal issues and some solutions. Specifically, using data on the inhibition of the protease Thermolysin, we demonstrate that it is possible for an ILP program to construct good quantitative structure-activity models. We also comment on the relationship of this work to other recent developments in statistical relational learning. Editors: Tamás Horváth and Akihiro Yamamoto  相似文献   

9.
RRL is a relational reinforcement learning system based on Q-learning in relational state-action spaces. It aims to enable agents to learn how to act in an environment that has no natural representation as a tuple of constants. For relational reinforcement learning, the learning algorithm used to approximate the mapping between state-action pairs and their so called Q(uality)-value has to be very reliable, and it has to be able to handle the relational representation of state-action pairs. In this paper we investigate the use of Gaussian processes to approximate the Q-values of state-action pairs. In order to employ Gaussian processes in a relational setting we propose graph kernels as a covariance function between state-action pairs. The standard prediction mechanism for Gaussian processes requires a matrix inversion which can become unstable when the kernel matrix has low rank. These instabilities can be avoided by employing QR-factorization. This leads to better and more stable performance of the algorithm and a more efficient incremental update mechanism. Experiments conducted in the blocks world and with the Tetris game show that Gaussian processes with graph kernels can compete with, and often improve on, regression trees and instance based regression as a generalization algorithm for RRL. Editors: David Page and Akihiro Yamamoto  相似文献   

10.
基于RDBMS的XML数据管理技术研究   总被引:1,自引:0,他引:1  
李黎  杨春  吴微 《计算机工程与设计》2007,28(24):6008-6011
XML是一种专门为Internet所设计的标记语言,但是它已逐渐成为Internet上数据表示以及数据交换的标准,是一种发展势头良好的新兴数据管理手段.关系数据库管理系统(RDBMS)是一种技术成熟、应用十分广泛的系统.在数据管理上,XML技术和数据库技术各有优势和不足,XML和数据库结合技术成为学术界的研究热点.在对XML和数据库结合技术进行了研究的基础上一个基于RDBMS的XML数据管理的实现框架(XRM)被提出,该框架依据不同的映射策略,解析Schema文件或DTD,生成对应的关系模式,利用RDBMS存储中间件,使用户能透明地通过RDBMS来管理XML数据.该框架充分考虑了结构的灵活性和扩展性.  相似文献   

11.
Relational rule learning algorithms are typically designed to construct classification and prediction rules. However, relational rule learning can be adapted also to subgroup discovery. This paper proposes a propositionalization approach to relational subgroup discovery, achieved through appropriately adapting rule learning and first-order feature construction. The proposed approach was successfully applied to standard ILP problems (East-West trains, King-Rook-King chess endgame and mutagenicity prediction) and two real-life problems (analysis of telephone calls and traffic accident analysis). Editors: Hendrik Blockeel, David Jensen and Stefan Kramer An erratum to this article is available at .  相似文献   

12.
The growth of machine-generated relational databases, both in the sciences and in industry, is rapidly outpacing our ability to extract useful information from them by manual means. This has brought into focus machine learning techniques like Inductive Logic Programming (ILP) that are able to extract human-comprehensible models for complex relational data. The price to pay is that ILP techniques are not efficient: they can be seen as performing a form of discrete optimisation, which is known to be computationally hard; and the complexity is usually some super-linear function of the number of examples. While little can be done to alter the theoretical bounds on the worst-case complexity of ILP systems, some practical gains may follow from the use of multiple processors. In this paper we survey the state-of-the-art on parallel ILP. We implement several parallel algorithms and study their performance using some standard benchmarks. The principal findings of interest are these: (1) of the techniques investigated, one that simply constructs models in parallel on each processor using a subset of data and then combines the models into a single one, yields the best results; and (2) sequential (approximate) ILP algorithms based on randomized searches have lower execution times than (exact) parallel algorithms, without sacrificing the quality of the solutions found. This is an extended version of the paper entitled Strategies to Parallelize ILP Systems, published in the Proceedings of the 15th International Conference on Inductive Logic Programming (ILP 2005), vol. 3625 of LNAI, pp. 136–153, Springer-Verlag.  相似文献   

13.
Inductive Logic Programming (ILP) studies learning from examples, within the framework provided by clausal logic. ILP has become a popular subject in the field of data mining due to its ability to discover patterns in relational domains. Several ILP-based concept discovery systems are developed which employs various search strategies, heuristics and language pattern limitations. LINUS, GOLEM, CIGOL, MIS, FOIL, PROGOL, ALEPH and WARMR are well-known ILP-based systems. In this work, firstly introductory information about ILP is given, and then the above-mentioned systems and an ILP-based concept discovery system called C2D are briefly described and the fundamentals of their mechanisms are demonstrated on a running example. Finally, a set of experimental results on real-world problems are presented in order to evaluate and compare the performance of the above-mentioned systems.  相似文献   

14.
Image retrieval based on augmented relational graph representation   总被引:1,自引:1,他引:0  
The “semantic gap” problem is one of the main difficulties in image retrieval tasks. Semi-supervised learning, typically integrated with the relevance feedback techniques, is an effective method to narrow down the semantic gap. However, in semi-supervised learning, the amount of unlabeled data is usually much greater than that of labeled data. Therefore, the performance of a semi-supervised learning algorithm relies heavily on its effectiveness of using the relationships between the labeled and unlabeled data. This paper proposes a novel algorithm to better explore those relationships by augmenting the relational graph representation built on the entire data set, expected to increase the intra-class weights while decreasing the inter-class weights and linking the potential intra-class data. The augmented relational matrix can be directly used in any semi-supervised learning algorithms. The experimental results in a range of feedback-based image retrieval tasks show that the proposed algorithm not only achieves good generality, but also outperforms other algorithms in the same semi-supervised learning framework.  相似文献   

15.
To help computers make better decisions, it is desirable to describe all our knowledge in computer-understandable terms. This is easy for knowledge described in terms on numerical values: we simply store the corresponding numbers in the computer. This is also easy for knowledge about precise (well-defined) properties which are either true or false for each object: we simply store the corresponding “true” and “false” values in the computer. The challenge is how to store information about imprecise properties. In this paper, we overview different ways to fully store the expert information about imprecise properties. We show that in the simplest case, when the only source of imprecision is disagreement between different experts, a natural way to store all the expert information is to use random sets; we also show how fuzzy sets naturally appear in such random set representation. We then show how the random set representation can be extended to the general (“fuzzy”) case when, in addition to disagreements, experts are also unsure whether some objects satisfy certain properties or not.  相似文献   

16.
多关系数据分类方法综述   总被引:1,自引:1,他引:0  
多关系数据分类是多关系数据挖掘重要任务之一,它能够直接从多关系数据表中发现有效模式,比命题分类方法具有更大优势。根据知识表示形式及相关策略的不同将多关系数据分类分为归纳逻辑程序设计关系分类方法、图的关系分类方法和基于关系数据库的关系分类方法。着重论述了它们所采用的具体关系分类技术及其特点,对这些方法进行了对比,最后讨论了它们当前所面临的挑战性问题。  相似文献   

17.
Relational learning can be described as the task of learning first-order logic rules from examples. It has enabled a number of new machine learning applications, e.g. graph mining and link analysis. Inductive Logic Programming (ILP) performs relational learning either directly by manipulating first-order rules or through propositionalization, which translates the relational task into an attribute-value learning task by representing subsets of relations as features. In this paper, we introduce a fast method and system for relational learning based on a novel propositionalization called Bottom Clause Propositionalization (BCP). Bottom clauses are boundaries in the hypothesis search space used by ILP systems Progol and Aleph. Bottom clauses carry semantic meaning and can be mapped directly onto numerical vectors, simplifying the feature extraction process. We have integrated BCP with a well-known neural-symbolic system, C-IL2P, to perform learning from numerical vectors. C-IL2P uses background knowledge in the form of propositional logic programs to build a neural network. The integrated system, which we call CILP++, handles first-order logic knowledge and is available for download from Sourceforge. We have evaluated CILP++ on seven ILP datasets, comparing results with Aleph and a well-known propositionalization method, RSD. The results show that CILP++ can achieve accuracy comparable to Aleph, while being generally faster, BCP achieved statistically significant improvement in accuracy in comparison with RSD when running with a neural network, but BCP and RSD perform similarly when running with C4.5. We have also extended CILP++ to include a statistical feature selection method, mRMR, with preliminary results indicating that a reduction of more than 90 % of features can be achieved with a small loss of accuracy.  相似文献   

18.
A Multistrategy Approach to Relational Knowledge Discovery in Databases   总被引:1,自引:0,他引:1  
When learning from very large databases, the reduction of complexity is extremely important. Two extremes of making knowledge discovery in databases (KDD) feasible have been put forward. One extreme is to choose a very simple hypothesis language, thereby being capable of very fast learning on real-world databases. The opposite extreme is to select a small data set, thereby being able to learn very expressive (first-order logic) hypotheses. A multistrategy approach allows one to include most of these advantages and exclude most of the disadvantages. Simpler learning algorithms detect hierarchies which are used to structure the hypothesis space for a more complex learning algorithm. The better structured the hypothesis space is, the better learning can prune away uninteresting or losing hypotheses and the faster it becomes.We have combined inductive logic programming (ILP) directly with a relational database management system. The ILP algorithm is controlled in a model-driven way by the user and in a data-driven way by structures that are induced by three simple learning algorithms.  相似文献   

19.
20.
针对目前归纳逻辑程序设计(inductive logic programming,ILP)系统要求训练数据充分且无法利用无标记数据的不足,提出了一种利用无标记数据学习一阶规则的算法——关系tri-training(relational-tri-training,R-tri-training)算法。该算法将基于命题逻辑表示的半监督学习算法tri-training的思想引入到基于一阶逻辑表示的ILP系统,在ILP框架下研究如何利用无标记样例信息辅助分类器训练。R-tri-training算法首先根据标记数据和背景知识初始化三个不同的ILP系统,然后迭代地用无标记样例对三个分类器进行精化,即如果两个分类器对一个无标记样例的标记结果一致,则在一定条件下该样例将被标记给另一个分类器作为新的训练样例。标准数据集上实验结果表明:R-tri-training能有效地利用无标记数据提高学习性能,且R-tri-training算法性能优于GILP(genetic inductive logic programming)、NFOIL、KFOIL和ALEPH。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号