首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
基于向量空间模型中义项词语的无导词义消歧   总被引:22,自引:0,他引:22  
鲁松  白硕  黄雄 《软件学报》2002,13(6):1082-1089
有导词义消歧机器学习方法的引入虽然使词义消歧取得了长足的进步,但由于需要大量人力进行词义标注,使其难以适用于大规模词义消歧任务.针对这一问题,提出了一种避免人工词义标注巨大工作量的无导学习方法.在仅需义项词语知识库的支持下,将待消歧多义词与义项词语映射到向量空间中,基于k-NN(k=1)方法,计算二者相似度来实现词义消歧任务.在对10个典型多义词进行词义消歧的测试实验中,采用该方法取得了平均正确率为83.13%的消歧结果.  相似文献   

2.
《计算机工程》2017,(9):210-213
词义消歧在机器翻译、信息检索、语音语义识别等方面具有重要作用。为提高消歧质量,细化特征粒度,提出一种多特征词义消歧方案。通过依存句法分析提取上下文中多义词及义项的词性、依存结构、依存词等特征,细化特征粒度,并根据多特征构造权值函数,选择权值最大的义项作为多义词的义项。实验结果表明,与单一特征词义消歧相比,采用依存句法分析的多特征词义消歧方案细化了特征粒度,提高了消歧准确率。  相似文献   

3.
词义消歧是自然语言处理中的一个关键问题,为提高大规模词义消歧的准确率,提出了一种基于模板的无导词义消歧方法。利用多义词不同义项的同义或近义单义词对该义项进行表述,综合考虑共现词出现的位置、上下文距离及出现频次,据此构造语境模板,有效地解决了多义词义项确定的困难。实验结果表明,本文提出的方法在消歧性能方面有较明显的改善。  相似文献   

4.
基于k-means聚类的无导词义消歧   总被引:5,自引:3,他引:5  
无导词义消歧避免了人工词义标注的巨大工作量,可以适应大规模的多义词消歧工作,具有广阔的应用前景。这篇文章提出了一种无导词义消歧的方法,该方法采用二阶context 构造上下文向量,使用k-means算法进行聚类,最后通过计算相似度来进行词义的排歧. 实验是在抽取术语的基础上进行的,在多个汉语高频多义词的两组测试中取得了平均准确率82167 %和80187 %的较好的效果。  相似文献   

5.
车玲  张仰森 《计算机工程》2012,38(20):152-155
以条件随机场(CRF)作为构建词义消歧模型库的概率模型,利用CRF分别训练高频义项和低频义项标点句语料,应用生成的模型文件进行消歧实验.通过分析标注结果中的概率值确定阈值,以区分标注正确项和错误项.使用表现较好的模型文件和相应阈值构建面向词义消歧的条件随机场模型库.实验结果证明,对低频义项建模的词义消歧效果优于对高频义项进行建模,可以达到80%以上的正确率,并且可以获得较高的召回率.  相似文献   

6.
基于MDL聚类的无导词义消歧   总被引:2,自引:0,他引:2  
无导词义消歧避免了人工词义标注的巨大工作量,可以适应大规模的多义词消歧工作,具有广阔的应用前景.提出了一种无导词义消歧的方法,该方法以hownet词库为词典,采用二阶上下文构造上下文向量,使用MDL算法进行聚类,最后通过计算相似度来进行词义的排歧.实验是在抽取术语的基础上进行的,在8个汉语高频多义词的测试中取得了平均准确率81.12%的较好的效果.  相似文献   

7.
该文介绍了一种基于Linux环境使用Perl语言实现的词义消岐小系统,算法主要由预处理、计算词向量和上下文向量多义词向量几个过程。通过计算某一义项与目标项的义项词语所具有相似性,比指示其他义项词语具有更强的相似性,以此为基础来完成多义词词义的消歧工作.并对语料进行实验得出结果分析不足,待进一步研究。  相似文献   

8.
词义消歧要解决如何让计算机理解多义词在上下文中的具体含义,对信息检索、机器翻译、文本分类和自动文摘等自然语言处理问题有着十分重要的作用。通过引入句法信息,提出了一种新的词义消歧方法。构造歧义词汇上下文的句法树,提取句法信息、词性信息和词形信息作为消歧特征。利用贝叶斯模型来建立词义消歧分类器,并将其应用到测试数据集上。实验结果表明:消歧的准确率有所提升,达到了65%。  相似文献   

9.
多分类器集成的汉语词义消歧研究   总被引:10,自引:0,他引:10  
词义消歧长期以来一直是自然语言处理中的热点和难题,集成方法被认为是机器学习研究的四大趋势之一.系统研究了9种集成学习方法在汉语词义消歧中的应用.9种集成方法分别是乘法规则、均值、最大值、最小值、多数投票、序列投票、加权投票、概率加权和单分类器融合,其中乘法规则、均值、最大值3种集成方法还未曾应用于词义消歧.选取支持向量机模型、朴素贝叶斯和决策树作为3个单分类器.在两个不同的数据集上进行了实验,其一是选自现代汉语语义标注语料库的18个多义词,其二是国际语义评测SemEval-2007的中英文对译选择词消歧任务.实验结果显示,首次在词义消歧中引入应用的3种集成方法乘法、均值、最大值有良好的性能表现,3种方法的消歧准确率均高于最佳单分类器SVM,而且优于其他6种集成方法.  相似文献   

10.
基于向量空间模型的有导词义消歧   总被引:22,自引:1,他引:21  
词义消歧一直是自然语言理解中的一个关键问题,该问题解决的好坏直接关系到自然语言处理中诸多应用问题的效果优劣。由于自然语言知识表示的困难,在手工规则的词义消歧难以达到理想效果的情况下,各种有导机器学习方法被应用于词义消歧任务中,借鉴前人的成果引入信息检索领域中空间模型文档词语权重计算技术来解决多义词义项的知识表示问题,并提出了上下文位置权重的计算方法,给出了一种基于向量空间模型的词义消岐有导机器学习方法。该方法将多义词的义项和上下文分别映射到向量空间中,通过计算多义词上下文向量与义项向量的距离,采用k-NN(k=1)方法来确定上下文向量的义项分类。在9个汉语高频多义词的开放和封闭测试中均取得了突出的成绩(封闭测试平均正确率为96.31%,开放测试平均正确率为92.98%),验证了该方法的有效性。  相似文献   

11.
The article presents a new approach of calculating the weight of base classifiers from a committee of classifiers. The obtained weights are interpreted in the context of the interval-valued sets. The work proposes four different ways of calculating weights which consider both the correctness and incorrectness of the classification. The proposed weights have been used in the algorithms which combine the outputs of base classifiers. In this work we use both the outputs, represented by rank and measure level. Research experiments have involved several bases available in the UCI repository and two data sets that have generated distributions. The performed experiments compare algorithms which are based on calculating the weights according to the resubstitution and algorithms proposed in the work. The ensemble of classifiers has also been compared with the base classifiers entering the committee.  相似文献   

12.
It has been established that committee classifiers, in which the outputs of different, individual network classifiers are combined in various ways, can produce better accuracy than the best individual in the committee. We describe results showing that these advantages are obtained when neural networks are applied to a taxonomic problem in marine science: the classification of images of marine phytoplankton. Significant benefits were found when individual networks, trained on different classes of input, having comparable individual performances, were combined. Combining networks of very different accuracy did not improve performance when measured against the best single network, but nor was it reduced. An alternative architecture, which we term a collective machine, in which the different data types are combined in a single network, was found to have significantly better accuracy than the committee machine architectures. The performance gains and resilience to non-discriminatory types of data suggest the techniques have great utility in the development of general purpose, network classifiers.  相似文献   

13.
Diversity of classifiers is generally accepted as being necessary for combining them in a committee. Quantifying diversity of classifiers, however, is difficult as there is no formal definition thereof. Numerous measures have been proposed in literature, but their performance is often know to be suboptimal. Here several common methods are compared with a novel approach focusing on the diversity of the errors made by the member classifiers. Experiments with combining classifiers for handwritten character recognition are presented. The results show that the approach of diversity of errors is beneficial, and that the novel exponential error count measure is capable of consistently finding an effective member classifier set.  相似文献   

14.
In the information retrieval framework, there are problems where the goal is to recover objects of a particular class from big sets of unlabelled objects. In some of these problems, only examples from the class we want to recover are available. For such problems, the machine learning community has developed algorithms that are able to learn binary classifiers in the absence of negative examples. Among them, we can find the positive Bayesian network classifiers, algorithms that induce Bayesian network classifiers from positive and unlabelled examples. The main drawback of these algorithms is that they require some previous knowledge about the a priori probability distribution of the class. In this paper, we propose a wrapper approach to tackle the learning when no such information is available, setting this probability at the optimal value in terms of the recovery of positive examples. The evaluation of classifiers in positive unlabelled learning problems is a non-trivial question. We have also worked on this problem, and we have proposed a new guiding metric to be used in the search for the optimal a priori probability of the positive class that we have called the pseudo F. We have empirically tested the proposed metric and the wrapper classifiers on both synthetic and real-life datasets. The results obtained in this empirical comparison show that the wrapper Bayesian network classifiers provide competitive results, particularly when the actual a priori probability of the positive class is high.  相似文献   

15.
This paper discusses two techniques for improving the recognition accuracy for online handwritten character recognition: committee classification and adaptation to the user. Combining classifiers is a common method for improving recognition performance. Improvements are possible because the member classifiers may make different errors. Much variation exists in handwritten characters, and adaptation is one feasible way of dealing with such variation. Even though adaptation is usually performed for single classifiers, it is also possible to use adaptive committees. Some novel adaptive committee structures, namely, the dynamically expanding context (DEC), modified current best learning (MCBL), and class-confidence critic combination (CCCC), are presented and evaluated. They are shown to be able to improve on their member classifiers, with CCCC offering the best performance. Also, the effect of having either more or less diverse sets of member classifiers is considered.Received: 17 September 2002, Accepted: 22 October 2002, Published online: 4 July 2003  相似文献   

16.
《Pattern recognition letters》2003,24(1-3):455-471
Bagging forms a committee of classifiers by bootstrap aggregation of training sets from a pool of training data. A simple alternative to bagging is to partition the data into disjoint subsets. Experiments with decision tree and neural network classifiers on various datasets show that, given the same size partitions and bags, disjoint partitions result in performance equivalent to, or better than, bootstrap aggregates (bags). Many applications (e.g., protein structure prediction) involve use of datasets that are too large to handle in the memory of the typical computer. Hence, bagging with samples the size of the data is impractical. Our results indicate that, in such applications, the simple approach of creating a committee of n classifiers from disjoint partitions each of size 1/n (which will be memory resident during learning) in a distributed way results in a classifier which has a bagging-like performance gain. The use of distributed disjoint partitions in learning is significantly less complex and faster than bagging.  相似文献   

17.
Ensembles that combine the decisions of classifiers generated by using perturbed versions of the training set where the classes of the training examples are randomly switched can produce a significant error reduction, provided that large numbers of units and high class switching rates are used. The classifiers generated by this procedure have statistically uncorrelated errors in the training set. Hence, the ensembles they form exhibit a similar dependence of the training error on ensemble size, independently of the classification problem. In particular, for binary classification problems, the classification performance of the ensemble on the training data can be analysed in terms of a Bernoulli process. Experiments on several UCI datasets demonstrate the improvements in classification accuracy that can be obtained using these class-switching ensembles.  相似文献   

18.
The combination of classifiers leads to substantial reduction of misclassification error in a wide range of applications and benchmark problems. We suggest using an out-of-bag sample for combining different classifiers. In our setup, a linear discriminant analysis is performed using the observations in the out-of-bag sample, and the corresponding discriminant variables computed for the observations in the bootstrap sample are used as additional predictors for a classification tree. Two classifiers are combined and therefore method and variable selection bias is no problem for the corresponding estimate of misclassification error, the need of an additional test sample disappears. Moreover, the procedure performs comparable to the best classifiers used in a number of artificial examples and applications.  相似文献   

19.
Many real-world applications reveal difficulties in learning classifiers from imbalanced data. Although several methods for improving classifiers have been introduced, the identification of conditions for the efficient use of the particular method is still an open research problem. It is also worth to study the nature of imbalanced data, characteristics of the minority class distribution and their influence on classification performance. However, current studies on imbalanced data difficulty factors have been mainly done with artificial datasets and their conclusions are not easily applicable to the real-world problems, also because the methods for their identification are not sufficiently developed. In our paper, we capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. First, we confirm their occurrence in real data by exploring multidimensional visualizations of selected datasets. Then, we introduce a method for an identification of these types of examples, which is based on analyzing a class distribution in a local neighbourhood of the considered example. Two ways of modeling this neighbourhood are presented: with k-nearest examples and with kernel functions. Experiments with artificial datasets show that these methods are able to re-discover simulated types of examples. Next contributions of this paper include carrying out a comprehensive experimental study with 26 real world imbalanced datasets, where (1) we identify new data characteristics basing on the analysis of types of minority examples; (2) we demonstrate that considering the results of this analysis allow to differentiate classification performance of popular classifiers and pre-processing methods and to evaluate their areas of competence. Finally, we highlight directions of exploiting the results of our analysis for developing new algorithms for learning classifiers and pre-processing methods.  相似文献   

20.
Many data mining applications have a large amount of data but labeling data is usually difficult, expensive, or time consuming, as it requires human experts for annotation. Semi-supervised learning addresses this problem by using unlabeled data together with labeled data in the training process. Co-Training is a popular semi-supervised learning algorithm that has the assumptions that each example is represented by multiple sets of features (views) and these views are sufficient for learning and independent given the class. However, these assumptions are strong and are not satisfied in many real-world domains. In this paper, a single-view variant of Co-Training, called Co-Training by Committee (CoBC) is proposed, in which an ensemble of diverse classifiers is used instead of redundant and independent views. We introduce a new labeling confidence measure for unlabeled examples based on estimating the local accuracy of the committee members on its neighborhood. Then we introduce two new learning algorithms, QBC-then-CoBC and QBC-with-CoBC, which combine the merits of committee-based semi-supervised learning and active learning. The random subspace method is applied on both C4.5 decision trees and 1-nearest neighbor classifiers to construct the diverse ensembles used for semi-supervised learning and active learning. Experiments show that these two combinations can outperform other non committee-based ones.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号