首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 381 毫秒
Relational learning can be described as the task of learning first-order logic rules from examples. It has enabled a number of new machine learning applications, e.g. graph mining and link analysis. Inductive Logic Programming (ILP) performs relational learning either directly by manipulating first-order rules or through propositionalization, which translates the relational task into an attribute-value learning task by representing subsets of relations as features. In this paper, we introduce a fast method and system for relational learning based on a novel propositionalization called Bottom Clause Propositionalization (BCP). Bottom clauses are boundaries in the hypothesis search space used by ILP systems Progol and Aleph. Bottom clauses carry semantic meaning and can be mapped directly onto numerical vectors, simplifying the feature extraction process. We have integrated BCP with a well-known neural-symbolic system, C-IL2P, to perform learning from numerical vectors. C-IL2P uses background knowledge in the form of propositional logic programs to build a neural network. The integrated system, which we call CILP++, handles first-order logic knowledge and is available for download from Sourceforge. We have evaluated CILP++ on seven ILP datasets, comparing results with Aleph and a well-known propositionalization method, RSD. The results show that CILP++ can achieve accuracy comparable to Aleph, while being generally faster, BCP achieved statistically significant improvement in accuracy in comparison with RSD when running with a neural network, but BCP and RSD perform similarly when running with C4.5. We have also extended CILP++ to include a statistical feature selection method, mRMR, with preliminary results indicating that a reduction of more than 90 % of features can be achieved with a small loss of accuracy.  相似文献   

汉语功能块描述了句子的基本骨架,是联结句法结构和语义描述的重要桥梁。本文提出了两种不同功能块分析模型: 边界识别模型和序列标记模型,并使用不同的机器学习方法进行了计算模拟。通过两种模型分析结果的有机融合,充分利用了两者分析结果的互补性,对汉语句子的主谓宾状四个典型功能块的自动识别性能达到了80%以上。实验结果显示,基于局部词汇语境机器学习算法可以从不同侧面准确识别出大部分功能块,句子中复杂从句和多动词连用结构等是主要的识别难点。  相似文献   

归纳逻辑程序设计(ILP)是机器学习的一个重要分支,给定一个样例集和相关背景知识,ILP研究如何构建与其相一致的逻辑程序,这些逻辑程序由有限一阶子句组成。文章描述了一种综合当前一些ILP方法多方面优势的算法ICCR,ICCR溶合了以FOIL为代表的自顶向下搜索策略和以GOLEM为代表的自底向上搜索策略,并能根据需要发明新谓词、学习递归逻辑程序,对比实验表明,对相同的样例及背景知识,ICCR比FOIL和GOLEM能学到精度更高的目标逻辑程序。  相似文献   

Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse domains. These results are necessary, but not sufficient, for ILP to be adopted as a tool for data analysis in an era of very large machine-generated scientific and industrial datasets, accompanied by programs that provide ready access to complex relational information in machine-readable forms (ontologies, parsers, and so on). Besides the usual issues about the ease of use, ILP is now confronted with questions of implementation. We are concerned here with two of these, namely: can an ILP system construct models efficiently when (a) Dataset sizes are too large to fit in the memory of a single machine; and (b) Search space sizes becomes prohibitively large to explore using a single machine. In this paper, we examine the applicability to ILP of a popular distributed computing approach that provides a uniform way for performing data and task parallel computations in ILP. The MapReduce programming model allows, in principle, very large numbers of processors to be used without any special understanding of the underlying hardware or software involved. Specifically, we show how the MapReduce approach can be used to perform the coverage-test that is at the heart of many ILP systems, and to perform multiple searches required by a greedy set-covering algorithm used by some popular ILP systems. Our principal findings with synthetic and real-world datasets for both data and task parallelism are these: (a) Ignoring overheads, the time to perform the computations concurrently increases with the size of the dataset for data parallelism and with the size of the search space for task parallelism. For data parallelism this increase is roughly in proportion to increases in dataset size; (b) If a MapReduce implementation is used as part of an ILP system, then benefits for data parallelism can only be expected above some minimal dataset size, and for task parallelism can only be expected above some minimal search-space size; and (c) The MapReduce approach appears better suited to exploit data-parallelism in ILP.  相似文献   

To date, Inductive Logic Programming (ILP) systems have largely assumed that all data needed for learning have been provided at the onset of model construction. Increasingly, for application areas like telecommunications, astronomy, text processing, financial markets and biology, machine-generated data are being generated continuously and on a vast scale. We see at least four kinds of problems that this presents for ILP: (1) it may not be possible to store all of the data, even in secondary memory; (2) even if it were possible to store the data, it may be impractical to construct an acceptable model using partitioning techniques that repeatedly perform expensive coverage or subsumption-tests on the data; (3) models constructed at some point may become less effective, or even invalid, as more data become available (exemplified by the “drift” problem when identifying concepts); and (4) the representation of the data instances may need to change as more data become available (a kind of “language drift” problem). In this paper, we investigate the adoption of a stream-based on-line learning approach to relational data. Specifically, we examine the representation of relational data in both an infinite-attribute setting, and in the usual fixed-attribute setting, and develop implementations that use ILP engines in combination with on-line model-constructors. The behaviour of each program is investigated using a set of controlled experiments, and performance in practical settings is demonstrated by constructing complete theories for some of the largest biochemical datasets examined by ILP systems to date, including one with a million examples; to the best of our knowledge, the first time this has been empirically demonstrated with ILP on a real-world data set.  相似文献   

Hypotheses constructed by inductive logic programming (ILP) systems are finite sets of definite clauses. Top-down ILP systems usually adopt the following greedy clause-at-a-time strategy to construct such a hypothesis: start with the empty set of clauses and repeatedly add the clause that most improves the quality of the set. This paper formulates and analyses an alternative method for constructing hypotheses. The method, calledcautious induction, consists of a first stage, which finds a finite set of candidate clauses, and a second stage, which selects a finite subset of these clauses to form a hypothesis. By using a less greedy method in the second stage, cautious induction can find hypotheses of higher quality than can be found with a clause-at-a-time algorithm. We have implemented a top-down, cautious ILP system called CILS. This paper presents CILS and compares it to Progol, a top-down clause-at-a-time ILP system. The sizes of the search spaces confronted by the two systems are analysed and an experiment examines their performance on a series of mutagenesis learning problems. Simon Anthony, BEng.: Simon, perhaps better known as “Mr. Cautious” in Inductive Logic Programming (ILP) circles, completed a BEng in Information Engineering at the University of York in 1995. He remained at York as a research student in the Intelligent Systems Group. Concentrating on ILP, his research interests are Cautious Induction and developing number handling techniques using Constraint Logic Programming. Alan M. Frisch, Ph.D.: He is the Reader in Intelligent Systems at the University of York (UK), and he heads the Intelligent Systems Group in the Department of Computer Science. He was awarded a Ph. D. in Computer Science from the University of Rochester (USA) in 1986 and has held faculty positions at the University of Sussex (UK) and the University of Illinois at Urbana-Champaign (USA). For over 15 years Dr. Frisch has been conducting research on a wide range of topics in the area of automated reasoning, including knowledge retrieval, probabilistic inference, constraint solving, parsing as deduction, inductive logic programming and the integration of constraint solvers into automated deduction systems.  相似文献   

New words could benefit many NLP tasks such as sentence chunking and sentiment analysis. However, automatic new word extraction is a challenging task because new words usually have no fixed language pattern, and even appear with the new meanings of existing words. To tackle these problems, this paper proposes a novel method to extract new words. It not only considers domain specificity, but also combines with multiple statistical language knowledge. First, we perform a filtering algorithm to obtain a candidate list of new words. Then, we employ the statistical language knowledge to extract the top ranked new words. Experimental results show that our proposed method is able to extract a large number of new words both in Chinese and English corpus, and notably outperforms the state-of-the-art methods. Moreover, we also demonstrate our method increases the accuracy of Chinese word segmentation by 10% on corpus containing new words.  相似文献   

Inductive Logic Programming (ILP) deals with the problem of finding a hypothesis covering positive examples and excluding negative examples, where both hypotheses and examples are expressed in first-order logic. In this paper we employ constraint satisfaction techniques to model and solve a problem known as template ILP consistency, which assumes that the structure of a hypothesis is known and the task is to find unification of the contained variables. In particular, we present a constraint model with index variables accompanied by a Boolean model to strengthen inference and hence improve efficiency. The efficiency of models is demonstrated experimentally.  相似文献   

This paper is concerned with problems that arise when submitting large quantities of data to analysis by an Inductive Logic Programming (ILP) system. Complexity arguments usually make it prohibitive to analyse such datasets in their entirety. We examine two schemes that allow an ILP system to construct theories by sampling from this large pool of data. The first, “subsampling”, is a single-sample design in which the utility of a potential rule is evaluated on a randomly selected sub-sample of the data. The second, “logical windowing”, is multiple-sample design that tests and sequentially includes errors made by a partially correct theory. Both schemes are derived from techniques developed to enable propositional learning methods (like decision trees) to cope with large datasets. The ILP system CProgol, equipped with each of these methods, is used to construct theories for two datasets—one artificial (a chess endgame) and the other naturally occurring (a language tagging problem). In each case, we ask the following questions of CProgol equipped with sampling: (1) Is its theory comparable in predictive accuracy to that obtained if all the data were used (that is, no sampling was employed)?; and (2) Is its theory constructed in less time than the one obtained with all the data? For the problems considered, the answers to these questions is “yes”. This suggests that an ILP program equipped with an appropriate sampling method could begin to address problems satisfactorily that have hitherto been inaccessible simply due to data extent.  相似文献   

自动文摘系统中一个关键的问题是找出能构成摘要的重点句子。找出这些句子的方法很多,但用机器学习的方法却较少,该文提出了一种关于文摘句式的自动学习方法。该方法以经过简单的预处理的若干语句为训练样本集,以正例句为基点进行由底向上的泛化学习,抽象出关于句式的一般概念,形成句式规则集,作为判断文中哪些语句可作为文摘句的有效手段。这是文摘系统实现的核心部分。  相似文献   

组块分析的主要任务是语块的识别和划分,它使句法分析的任务在某种程度上得到简化。针对长句子组块分析所遇到的困难,该文提出了一种基于分治策略的组块分析方法。该方法的基本思想是首先对句子进行最长名词短语识别,根据识别的结果,将句子分解为最长名词短语部分和句子框架部分;然后,针对不同的分析单元选用不同的模型加以分析,再将分析结果进行组合,完成整个组块分析过程。该方法将整句分解为更小的组块分析单元,降低了句子的复杂度。通过在宾州中文树库CTB4数据集上的实验结果显示,各种组块识别结果平均F1值结果为91.79%,优于目前其他的组块分析方法。  相似文献   

《Artificial Intelligence》2007,171(16-17):939-950
In this paper we propose a new formalization of the inductive logic programming (ILP) problem for a better handling of exceptions. It is now encoded in first-order possibilistic logic. This allows us to handle exceptions by means of prioritized rules, thus taking lessons from non-monotonic reasoning. Indeed, in classical first-order logic, the exceptions of the rules that constitute a hypothesis accumulate and classifying an example in two different classes, even if one is the right one, is not correct. The possibilistic formalization provides a sound encoding of non-monotonic reasoning that copes with rules with exceptions and prevents an example to be classified in more than one class. The benefits of our approach with respect to the use of first-order decision lists are pointed out. The possibilistic logic view of ILP problem leads to an optimization problem at the algorithmic level. An algorithm based on simulated annealing that in one turn computes the set of rules together with their priority levels is proposed. The reported experiments show that the algorithm is competitive to standard ILP approaches on benchmark examples.  相似文献   

Rough Problem Settings for ILP Dealing With Imperfect Data   总被引:1,自引:0,他引:1  
This paper applies rough set theory to Inductive Logic Programming (ILP, a relatively new method in machine learning) to deal with imperfect data occurring in large real-world applications. We investigate various kinds of imperfect data in ILP and propose rough problem settings to deal with incomplete background knowledge (where essential predicates/clauses are missing), indiscernible data (where some examples belong to both sets of positive and negative training examples), missing classification (where some examples are unclassified) and too strong declarative bias (hence the failure in searching for solutions). The rough problem settings relax the strict requirements in the standard normal problem setting for ILP, so that rough but useful hypotheses can be induced from imperfect data. We give simple measures of learning quality for the rough problem settings. For other kinds of imperfect data (noise data, too sparse data, missing values, real-valued data, etc.), while referring to their traditional handling techniques, we also point out the possibility of new methods based on rough set theory.  相似文献   

陈旭  万九卿 《自动化学报》2017,43(3):376-389
提出一种新的多细胞联合检测与跟踪方法,通过椭圆拟合构建细胞观测假说的完备集合,定义了多种局部事件来描述细胞的行为以及检测阶段可能出现的错误.通过引入相应的标签变量,将细胞跟踪建模为结构化预测问题,通过求解一个带约束的整数规划问题得到细胞轨迹的全局最优解.针对结构化预测模型中的参数学习问题,本文采用Block-coordinate Frank-Wolfe优化算法根据给定的训练样本求解模型的最优参数,同时给出了该算法的非线性核化版本.本文在多个公开数据集上对提出的算法进行了验证,结果表明,本文的实验表现相比于传统算法有着显著的提升.  相似文献   

Attribute-value based representations, standard in today's data mining systems, have a limited expressiveness. Inductive Logic Programming provides an interesting alternative, particularly for learning from structured examples whose parts, each with its own attributes, are related to each other by means of first-order predicates. Several subsets of first-order logic (FOL) with different expressive power have been proposed in Inductive Logic Programming (ILP). The challenge lies in the fact that the more expressive the subset of FOL the learner works with, the more critical the dimensionality of the learning task. The Datalog language is expressive enough to represent realistic learning problems when data is given directly in a relational database, making it a suitable tool for data mining. Consequently, it is important to elaborate techniques that will dynamically decrease the dimensionality of learning tasks expressed in Datalog, just as Feature Subset Selection (FSS) techniques do it in attribute-value learning. The idea of re-using these techniques in ILP runs immediately into a problem as ILP examples have variable size and do not share the same set of literals. We propose here the first paradigm that brings Feature Subset Selection to the level of ILP, in languages at least as expressive as Datalog. The main idea is to first perform a change of representation, which approximates the original relational problem by a multi-instance problem. The representation obtained as the result is suitable for FSS techniques which we adapted from attribute-value learning by taking into account some of the characteristics of the data due to the change of representation. We present the simple FSS proposed for the task, the requisite change of representation, and the entire method combining those two algorithms. The method acts as a filter, preprocessing the relational data, prior to the model building, which outputs relational examples with empirically relevant literals. We discuss experiments in which the method was successfully applied to two real-world domains.  相似文献   

一种高效的最小二乘支持向量机分类器剪枝算法   总被引:2,自引:0,他引:2  
针对最小二乘支持向量机丧失稀疏性的问题,提出了一种高效的剪枝算法.为了避免解初始的线性代数方程组,采用了一种自下而上的策略.在训练的过程中,根据一些特定的剪枝条件,块增量学习和逆学习交替进行,一个小的支持向量集能够自动形成.使用此集合,可以构造最终的分类器.为了测试新算法的有效性,把它应用于5个UCI数据集.实验结果表明:使用新的剪枝算法,当增量块的大小等于2时,在几乎不损失精度的情况下,可以得到稀疏解.另外,和SMO算法相比,新算法的速度更快.新的算法不仅适用于最小二乘支持向量机分类器,也可向最小二乘支持向量回归机推广.  相似文献   

Several applications of Inductive Logic Programming (ILP) are presented. These belong to various areas of engineering, including mechanical, environmental, software, and dynamical systems engineering. The particular applications are finite element mesh design, biological classification of river water quality, data reification, inducing program invariants, learning qualitative models of dynamic systems, and learning control rules for dynamic systems. A number of other applications are briefly mentioned. Finally, a discussion of the advantages and disadvantages of ILP as compared to other approaches to machine learning is given.  相似文献   

Scheduling periodic tasks onto a multiprocessor architecture under several constraints such as performance, cost, energy, and reliability is a major challenge in embedded systems. In this paper, we present an Integer Linear Programming (ILP) based framework that maps a given task set onto an Heterogeneous Multiprocessor System-on-Chip (HMPSoC) architecture. Our framework can be used with several objective functions; minimizing energy consumption, minimizing cost (i.e., the number of heterogeneous processors), and maximizing reliability of the system under performance constraints. We use Dynamic Voltage Scaling (DVS) for reducing energy consumption while we employ task duplication to maximize reliability. We illustrate the effectiveness of our approach through several experiments, each with a different number of tasks to be scheduled. We also propose two heuristics based on Earliest Deadline First (EDF) algorithm for minimizing energy under performance and cost constraints. Our experiments on generated task sets show that ILP-based method reduces the energy consumption up to 62% percent against a method that does not apply DVS. Heuristic methods obtain promising results when compared to optimal results generated by our ILP-based method.  相似文献   

Identifying the correct sense of a word in context is crucial for many tasks in natural language processing (machine translation is an example). State-of-the art methods for Word Sense Disambiguation (WSD) build models using hand-crafted features that usually capturing shallow linguistic information. Complex background knowledge, such as semantic relationships, are typically either not used, or used in specialised manner, due to the limitations of the feature-based modelling techniques used. On the other hand, empirical results from the use of Inductive Logic Programming (ILP) systems have repeatedly shown that they can use diverse sources of background knowledge when constructing models. In this paper, we investigate whether this ability of ILP systems could be used to improve the predictive accuracy of models for WSD. Specifically, we examine the use of a general-purpose ILP system as a method to construct a set of features using semantic, syntactic and lexical information. This feature-set is then used by a common modelling technique in the field (a support vector machine) to construct a classifier for predicting the sense of a word. In our investigation we examine one-shot and incremental approaches to feature-set construction applied to monolingual and bilingual WSD tasks. The monolingual tasks use 32 verbs and 85 verbs and nouns (in English) from the SENSEVAL-3 and SemEval-2007 benchmarks; while the bilingual WSD task consists of 7 highly ambiguous verbs in translating from English to Portuguese. The results are encouraging: the ILP-assisted models show substantial improvements over those that simply use shallow features. In addition, incremental feature-set construction appears to identify smaller and better sets of features. Taken together, the results suggest that the use of ILP with diverse sources of background knowledge provide a way for making substantial progress in the field of WSD. A.S. is also an Adjust Professor at the Department of Computer Science and Engineering, University of New South Wales; and a Visiting Professor at the Computing Laboratory, University of Oxford.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号