首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 218 毫秒
1.
张丽  黄东 《微机发展》2006,16(4):170-172
电子邮件(e-mail)是人们日常生活中不可缺少的通信手段之一,但是垃圾邮件却给人们带来了很大的危害。文中主要是针对中文垃圾邮件,给出了一种基于Winnow算法的基于邮件内容的反垃圾邮件引擎原型的设计,对于未知邮件可以达到较好的区分效果。首先对邮件的内容进行解码、分词,采用信息增益选取特征项;然后采用Winnow算法构造分类器;最后采用部分邮件样本进行测试,测试结果可以进行反馈学习。最后的测试数据分析表明系统达到了比较好的效果。  相似文献   

2.
基于Winnow算法的反垃圾邮件引擎的设计与实现   总被引:1,自引:1,他引:0  
电子邮件(e-mail)是人们日常生活中不可缺少的通信手段之一,但是垃圾邮件却给人们带来了很大的危害。文中主要是针对中文垃圾邮件,给出了一种基于Winnow算法的基于邮件内容的反垃圾邮件引擎原型的设计,对于未知邮件可以达到较好的区分效果。首先对邮件的内容进行解码、分词,采用信息增益选取特征项;然后采用Winnow算法构造分类器;最后采用部分邮件样本进行测试,测试结果可以进行反馈学习。最后的测试数据分析表明系统达到了比较好的效果。  相似文献   

3.
通过用于垃圾文本流过滤的在线文本分类研究,提出了一种新的条件概率集成方法。采用语汇序列表示文本,使用索引结构存储分类知识,设计实现了分类模型的在线训练算法和在线分类算法。抽取电子邮件和手机短信的多种文本特征,分别在TREC07P电子邮件语料和真实中文手机短信语料上进行了垃圾信息过滤实验。实验结果表明,提出的方法能够获得很好的垃圾信息过滤效果。  相似文献   

4.
基于向量空间模型的文本分类方法的研究与实现   总被引:14,自引:0,他引:14  
文本分类可以有效地解决信息杂乱的现象并有助于定位所需的信息.传统的文本分类方法一般从单一或片面的测试指标出发进行特征抽取,造成单个特征的"过度拟合"问题.文中综合考虑了频度、分散度和集中度等几项测试指标,提出了一种新的特征抽取算法,使得选出的特征能够在上述测试指标中达到整体最优.将这一方法应用于改进的向量空间模型,实验结果表明该方法具有较高的精度和召回率.  相似文献   

5.
实体关系自动抽取   总被引:36,自引:7,他引:36  
实体关系抽取是信息抽取领域中的重要研究课题。本文使用两种基于特征向量的机器学习算法,Winnow 和支持向量机(SVM) ,在2004 年ACE(Automatic Content Extraction) 评测的训练数据上进行实体关系抽取实验。两种算法都进行适当的特征选择,当选择每个实体的左右两个词为特征时,达到最好的抽取效果,Winnow和SVM算法的加权平均F-Score 分别为73108 %和73127 %。可见在使用相同的特征集,不同的学习算法进行实体关系的识别时,最终性能差别不大。因此使用自动的方法进行实体关系抽取时,应当集中精力寻找好的特征。  相似文献   

6.
设计了一种手机终端上基于短信内容的垃圾短信过滤系统。系统采用了平衡Winnow算法,该算法具有分类速度快、性能好以及支持在线更新的优点,适用于手机终端资源有限、需要实时或者定期更新分类器的情况。通过一系列的实验分析,证明该方法的有效性,并给出了对该方法的全面评估。对于该算法将来在信息过滤领域的应用,提供了全面的分析依据。  相似文献   

7.
针对传统朴素贝叶斯分类模型在入侵取证中存在的特征项冗余问题,以及没有考虑入侵行为所涉及的数据属性间的差别问题,提出一种基于改进的属性加权朴素贝叶斯分类方法。用一种改进的基于特征冗余度的信息增益算法对特征项集进行优化,并在此优化结果的基础上,提取出其中的特征冗余度判别函数作为权值引入贝叶斯分类算法中,对不同的条件属性赋予不同的权值。经实验验证,该算法能有效地选择特征向量,降低分类干扰,提高检测精度。  相似文献   

8.
叶军  金忠 《计算机科学》2017,44(7):309-314
针对概念分解算法没有同时考虑数据空间和特征属性空间中的高阶几何结构信息的问题,提出了一种基于对偶超图正则化的概念分解算法。该算法通过分别在数据空间和特征属性空间中构建无向加权的拉普拉斯超图正则项,分别反映了数据流形和特征流形的多元几何结构信息,弥补了传统图模型只能表达数据间成对关系的缺陷。采用交替迭代的方法求解算法的目标函数并证明了算法的收敛性。在3个真实数据库(TDT2、PIE、COIL20)上的实验表明,该方法在数据的聚类表示的效果方面优于其他方法。  相似文献   

9.
基于CAPTCHA和Winnow算法的垃圾短信过滤研究   总被引:1,自引:1,他引:0  
为识别并过滤掉日益增多的垃圾短信,提出了基于全自动人机识别系统(CAPTCHA)和Winnow算法的过滤方法。在CAPTCHA方法中,根据用户能否正确辨认图片,人类和计算机能被辨别,该方法能有效地过滤计算机发送的组垃圾短信。改进的Winnow过滤器可以直接处理原始文本,节省了中文分词时间,而且利用了复合分类思想,提高了分类精度。实验结果表明,CAPTCHA和改进的Winnow算法相结合能较准确地过滤掉垃圾短信。  相似文献   

10.
针对常规MRF分割模型不能有效描述图像的局部特征、常导致图像的过分割现象,提出了一种局部自适应先验的MRF模型。该模型利用图像的邻接区域信息建立了一种局部自适应特征MRF模型。基于提出的模型,建立了一种具有快速收敛策略的区域BP算法对MRF模型的区域消息进行传递,有效解决了区域BP算法的计算量大的问题。实验结果表明,与常规区域BP算法相比,提出的分割方法具有更快的分割速度和精度。  相似文献   

11.
在基于Winnow算法的基础上引入混淆词和介词搭配的方法.首先通过混淆集获得训练集,对训练集进行预处理后利用文本特征提取方法获得特征词集,然后对特征词集进行Winnow训练得到带有权重的特征词集并把出现在混淆词后的介词提取出来生成介词向量,最后从测试集提取特征并进行结合Winnow算法和混淆词与介词搭配方法的测试得到真词错误检查的结果.混淆词与介词搭配方法的加入使得某些混淆词的正确率、召回率以及F1测度提高了10%~20%,有的甚至提高到了100%.  相似文献   

12.
The performance of two online linear classifiers—the Perceptron and Littlestone’s Winnow—is explored for two anti-spam filtering benchmark corpora—PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: information gain (IG), document frequency (DF) and odds ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using odds ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering.  相似文献   

13.
A Winnow-Based Approach to Context-Sensitive Spelling Correction   总被引:4,自引:0,他引:4  
Golding  Andrew R.  Roth  Dan 《Machine Learning》1999,34(1-3):107-130
A large class of machine-learning problems in natural language require the characterization of linguistic context. Two characteristic properties of such problems are that their feature space is of very high dimensionality, and their target concepts depend on only a small subset of the features in the space. Under such conditions, multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good theoretical properties. In the work reported here, we present an algorithm combining variants of Winnow and weighted-majority voting, and apply it to a problem in the aforementioned class: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting to for too, casual for causal, and so on. We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a statistics-based method representing the state of the art for this task. We find: (1) When run with a full (unpruned) set of features, WinSpell achieves accuracies significantly higher than BaySpell was able to achieve in either the pruned or unpruned condition; (2) When compared with other systems in the literature, WinSpell exhibits the highest performance; (3) While several aspects of WinSpell's architecture contribute to its superiority over BaySpell, the primary factor is that it is able to learn a better linear separator than BaySpell learns; (4) When run on a test set drawn from a different corpus than the training set was drawn from, WinSpell is better able than BaySpell to adapt, using a strategy we will present that combines supervised learning on the training set with unsupervised learning on the (noisy) test set.  相似文献   

14.
A Markov chain Monte Carlo method has previously been introduced to estimate weighted sums in multiplicative weight update algorithms when the number of inputs is exponential. However, the original algorithm still required extensive simulation of the Markov chain in order to get accurate estimates of the weighted sums. We propose an optimized version of the original algorithm that produces exactly the same classifications while often using fewer Markov chain simulations. We also apply three other sampling techniques and empirically compare them with the original Metropolis sampler to determine how effective each is in drawing good samples in the least amount of time, in terms of accuracy of weighted sum estimates and in terms of Winnow’s prediction accuracy. We found that two other samplers (Gibbs and Metropolized Gibbs) were slightly better than Metropolis in their estimates of the weighted sums. For prediction errors, there is little difference between any pair of MCMC techniques we tested. Also, on the data sets we tested, we discovered that all approximations of Winnow have no disadvantage when compared to brute force Winnow (where weighted sums are exactly computed), so generalization accuracy is not compromised by our approximation. This is true even when very small sample sizes and mixing times are used. An early version of this paper appeared as Tao and Scott (2003).  相似文献   

15.
The number of adjustments required to learn the average LTU function of d features, each of which can take on n equally spaced values, grows as approximately n2d when the standard perceptron training algorithm is used on the complete input space of n points and perfect classification is required. We demonstrate a simple modification that reduces the observed growth rate in the number of adjustments to approximately d2(log (d) + log(n)) with most, but not all input presentation orders. A similar speed-up is also produced by applying the simple but computationally expensive heuristic ";don't overgeneralize" to the standard training algorithm. This performance is very close to the theoretical optimum for learning LTU functions by any method, and is evidence that perceptron-like learning algorithms can learn arbitrary LTU functions in polynomial, rather than exponential time under normal training conditions. Similar modifications can be applied to the Winnow algorithm, achieving similar performance improvements and demonstrating the generality of the approach.  相似文献   

16.
实现了基本的Winnow算法、Balanced Winnow算法和带反馈学习功能的Winnow算法,并将其成功地应用于大规模垃圾邮件过滤,分别在SEWM2007和SEWM2008数据集上对上述三个算法进行了对比实验.实验结果表明,Winnow算法及其变体在分类效果和效率上都优于Logiisfic算法.  相似文献   

17.
英文文本中的真词错误即输入的错词是和原词相似的另一个有效词。该文主要研究了对该类错误的检测。通过从所要检测的单词的上下文中提取句法和语义两个方面的特征,运用文档频率和信息增益进行特征筛选,实现了对上下文特征的有效提取。最终把判断该单词使用的正确与否看作分类问题,使用 Winnow分类算法进行训练和测试。通过5阶交叉验证,所收集的61组混淆集的平均正确率与召回率分别为96%,79.47%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号