首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
提出了一种基于支持向量且能识别噪音特征的文本特征评估方法,以及一种具有自我反馈学习能力的文本分类系统。该系统能够根据分类结果识别样本噪音句子、调整分类器参数、维护和优化文本库、提高分类器性能并使分类器能适应实际情况的不断变化。实验结果表明该方法能有效改善自动文本分类系统的性能。  相似文献   

2.
用Matlab语言建构贝叶斯分类器   总被引:2,自引:1,他引:2  
文本分类是文本挖掘的基础与核心,分类器的构建是文本分类的关键,利用贝叶斯网络可以构造出分类性能较好的分类器。文中利用Matlab构造出了两种分类器:朴素贝叶斯分类器NBC,用互信息测度和条件互信息测度构建了TANC。用UCI上下载的标准数据集验证所构造的分类器,实验结果表明,所建构的几种分类器的性能总体比文献中列的高些,从而表明所建立的分类器的有效性和正确性。笔者对所建构的分类器进行优化并应用于文本分类中。  相似文献   

3.
针对文本分类问题,基于特征分布评估权值调节特征概率标准差设计了一种无须特征选择的高效的线性文本分类器.该算法的基本思路是使用特征概率标准差量化特征在文档类中的离散度,并作为特征的基础权重,同时以后验概率的Beta分布函数为基础,运用概率确定性密度函数,评估特征在类别中的分布信息得到特征分布权值,将其调节基础权重得到特征权重,实现了线性文本分类器.在20Newsgroup、复旦中文分类语料、Reuters-21578三个语料集进行了比较实验,实验结果表明,新算法分类性能相对传统算法优势显著,且稳定、高效、实用,适于大规模文本分类任务.  相似文献   

4.
一种用于大规模文本分类的特征表示方法   总被引:4,自引:0,他引:4       下载免费PDF全文
随着网络和信息技术的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术。文本的特征表示严重地限制了文本分类性能的提升。以经典的向量空间模型和tf-idf权值计算公式为基础,提出了以应用于文本分类为目的的权值改进公式p-idf公式。在比较了贝叶斯、K近邻、神经网络和支持向量机四种典型的文本分类器的基础上,采用支持向量机分类器搭建了一个文本分类试验系统。经过科学的试验比较了tf-idf、p-idf、LTC三种权值公式在文本分类系统中对分类器性能的影响,证实了所提出的p-idf公式的合理性和有效性。  相似文献   

5.
新型快速中文文本分类器的设计与实现   总被引:1,自引:0,他引:1       下载免费PDF全文
为了提高中文文本分类的效率与精度,设计了一种新型的分类器。该分类器采用基于词频、互信息和类别信息的综合评估函数进行选择特征;在特征权重计算上,由于传统TF-IDF方法没有考虑特征类间和类内分布,提出了一种将词频和综合评估函数值相结合的权重计算方法;最后设计了一种基于贝叶斯原理的快速分类器。实验证明该分类器简单有效。  相似文献   

6.
本文提出了一种提高中文文本分类器推广性能的方法。一般而言,采用机器学习的方法对文本集合进行训练,可以获得文本分类器。本文引入了文本语义不变性常识,并将其融合到文本分类器中,提出了改进文本分类器的方法。与支撑向量机相结合,设计并实现了改进的文本分类器。对中文文本分类的实验表明,文本语义不变性常识的运用有效地改善了分类器的推广性能。  相似文献   

7.
SVMDT分类器及其在文本分类中的应用研究   总被引:13,自引:0,他引:13  
基于SVM(Support Vectort Machine)理论的分类器已经发展为一种通用的二值分类器,但它不量要于多值的场合。在分析经典的SVM分类算法和决策树分类算法的基础上,提出了将SVM和二叉决等掣结合的方法来实现多值分类器(SVMDT),并将其应用于文本分类,实验表明在分类精度和速度上具有良好的性能。  相似文献   

8.
介绍中文文本分类的流程及相关技术。在分析传统的文本特征选择不足的基础上,提出了基于粗糙集与集成学习结合的文本分类方法,通过粗糙集进行文本的特征选择,采用一种集成学习算法AdaBoost.M1来提高弱分类器的分类性能,对中文文本进行分类。实验证明,这种算法分类结果的F1值比C4.5、kNN分类器都高,具有更加优良的分类性能。  相似文献   

9.
基于连通分量特征的文本检测与分割   总被引:3,自引:0,他引:3       下载免费PDF全文
自然背景中的文本识别具有巨大的应用价值,但其应用却一直受到文本检测和分割技术的限制。为了更有效地进行文本检测与分割,提出了一种基于连通分量特征的自然场景中文本检测分割算法。该算法首先将原始图片通过Niblack方法分解为许多连通分量;接着,用一个级联分类器和一个SVM组成的两阶段分类模块来验证这些连通分量的文本特征。由于文本连通分量和非文本连通分量在特征上存在差异,大多数非文本会被级联分类器丢弃,而SVM则能在此结果上做进一步的验证,因此最终输出只有文本的二值图像。最后用该算法在测试数据上进行了评估实验,评估结果表明,检测精度超过90%,响应超过93%。  相似文献   

10.
特征选择是文本分类中一种重要的文本预处理技术,它能够有效地提高分类器的精度和效率。文本分类中特征选择的关键是寻求有效的特征评价指标。一般来说,同一个特征评价指标对不同的分类器,其效果不同,由此,一个好的特征评价指标应当考虑分类器的特点。由于朴素贝叶斯分类器简单、高效而且对特征选择很敏感,因此,对用于该种分类器的特征选择方法的研究具有重要的意义。有鉴于此,提出了一种有效的用于贝叶斯分类器的多类别文本特征评价指标:CDM。利用贝叶斯分类器在两个多类别的文本数据集上进行了实验。实验结果表明提出的CDM指标具有比其它特征评价指标更好的特征选择效果。  相似文献   

11.
苗学问  杨云  雷迅  张卫 《测控技术》2011,30(12):106-110
基于健康退化曲线对军用飞机故障预测与健康管理( PHM)技术的内涵、基本功能和能力需求进行探讨,在此基础上,以科学评价PHM( prognostics and health management)系统的诊断和预测能力为目标,从能力需求出发提出PHM系统性能度量方法体系(包括诊断性能度量、预测性能度量以及综合度量),并对...  相似文献   

12.
This work addresses the problem of detecting novel sentences from an incoming stream of text data, by studying the performance of different novelty metrics, and proposing a mixed metric that is able to adapt to different performance requirements. Existing novelty metrics can be divided into two types, symmetric and asymmetric, based on whether the ordering of sentences is taken into account. After a comparative study of several different novelty metrics, we observe complementary behavior in the two types of metrics. This finding motivates a new framework of novelty measurement, i.e. the mixture of both symmetric and asymmetric metrics. This new framework of novelty measurement performs superiorly under different performance requirements varying from high-precision to high-recall as well as for data with different percentages of novel sentences. Because it does not require any prior information, the new metric is very suitable for real-time knowledge base applications such as novelty mining systems where no training data is available beforehand.  相似文献   

13.
Text categorization plays an important role in applications where information is filtered, monitored, personalized, categorized, organized or searched. Feature selection remains as an effective and efficient technique in text categorization. Feature selection metrics are commonly based on term frequency or document frequency of a word. We focus on relative importance of these frequencies for feature selection metrics. The document frequency based metrics of discriminative power measure and GINI index were examined with term frequency for this purpose. The metrics were compared and analyzed on Reuters 21,578 dataset. Experimental results revealed that the term frequency based metrics may be useful especially for smaller feature sets. Two characteristics of term frequency based metrics were observed by analyzing the scatter of features among classes and the rate at which information in data was covered. These characteristics may contribute toward their superior performance for smaller feature sets.  相似文献   

14.
基于多标记学习的汽车评论文本多性能识别   总被引:1,自引:0,他引:1  
针对汽车产品评论文本中出现的多方面性能,提出一种基于多标记学习的汽车评论文本多方面性能识别方法。首先,结合文本挖掘方法,利用多标记文本特征选择方法选取特征,将非结构化的文本转化为结构化的多标记数据集。在此基础上,使用四种多标记分类方法,对待识别的评论文档标注一个或多个方面标记。最后,以八种多标记评价指标评估方面识别的性能。在新浪汽车评论语料上的实验表明,方面识别的子集准确率达到了95%,验证了方法的可行性。  相似文献   

15.
With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.  相似文献   

16.
Gaze interaction is a promising input modality for people who are unable to control their fingers and arms. This paper suggests a number of new metrics that can be applied to the analysis of gaze typing interfaces and to the evaluation of user performance. These metrics are derived from a close examination of eight subjects typing text by gazing on a dwell-time activated onscreen keyboard during a seven-day experiment. One of the metrics, termed “Attended keys per character”, measures the number of keys that are attended for each typed character. This metric turned out to be particularly well correlated to the actual numbers of errors committed (r = 0.915). In addition to introducing metrics specific for gaze typing, the paper discusses how the metrics could make remote progress monitoring possible and provides some general advice on how to introduce gaze typing for novice users.  相似文献   

17.
A number of metrics have been proposed in the literature to measure text re-use between pairs of sentences or short passages. These individual metrics fail to reliably detect paraphrasing or semantic equivalence between sentences, due to the subjectivity and complexity of the task, even for human beings. This paper analyzes a set of five simple but weak lexical metrics for measuring textual similarity and presents a novel paraphrase detector with improved accuracy based on abductive machine learning. The objective here is 2-fold. First, the performance of each individual metric is boosted through the abductive learning paradigm. Second, we investigate the use of decision-level and feature-level information fusion via abductive networks to obtain a more reliable composite metric for additional performance enhancement. Several experiments were conducted using two benchmark corpora and the optimal abductive models were compared with other approaches. Results demonstrate that applying abductive learning has significantly improved the results of individual metrics and further improvement was achieved through fusion. Moreover, building simple models of polynomial functional elements that identify and integrate the smallest subset of relevant metrics yielded better results than those obtained from the support vector machine classifiers utilizing the same datasets and considered metrics. The results were also comparable to the best result reported in the literature even with larger number of more powerful features and/or using more computationally intensive techniques.  相似文献   

18.
将传统的文本相似度量方法直接移植到短文本时,由于短文本内容简短的特性会导致数据稀疏而造成计算结果出现偏差。该文通过使用复杂网络表征短文本,提出了一种新的短文本相似度量方法。该方法首先对短文本进行预处理,然后对短文本建立复杂网络模型,计算短文本词语的复杂网络特征值,再借助外部工具计算短文本词语之间的语义相似度,然后结合短文本语义相似度定义计算短文本之间的相似度。最后在基准数据集上进行聚类实验,验证本文提出的短文本相似度计算方法在基于F-度量值标准上,优于传统的TF-IDF方法和另一种基于词项语义相似度的计算方法。  相似文献   

19.
In this paper, we present an effective approach for grouping text lines in online handwritten Japanese documents by combining temporal and spatial information. With decision functions optimized by supervised learning, the approach has few artificial parameters and utilizes little prior knowledge. First, the strokes in the document are grouped into text line strings according to off-stroke distances. Each text line string, which may contain multiple lines, is segmented by optimizing a cost function trained by the minimum classification error (MCE) method. At the temporal merge stage, over-segmented text lines (caused by stroke classification errors) are merged with a support vector machine (SVM) classifier for making merge/non-merge decisions. Last, a spatial merge module corrects the segmentation errors caused by delayed strokes. Misclassified text/non-text strokes (stroke type classification precedes text line grouping) can be corrected at the temporal merge stage. To evaluate the performance of text line grouping, we provide a set of performance metrics for evaluating from multiple aspects. In experiments on a large number of free form documents in the Tokyo University of Agriculture and Technology (TUAT) Kondate database, the proposed approach achieves the entity detection metric (EDM) rate of 0.8992 and the edit-distance rate (EDR) of 0.1114. For grouping of pure text strokes, the performance reaches EDM of 0.9591 and EDR of 0.0669.  相似文献   

20.
生成对抗网络是图像合成的重要方法,也是目前实现文字生成图像任务最多的手段。随着跨模态生成研究不断地深入,文字生成图像的真实度与语义相关性得到了巨大提升,无论是生成花卉、鸟类、人脸等自然图像,还是生成场景图和布局,都取得了较好的成果。同时,文字生成图像技术也存在面临着一些挑战,如难以生成复杂场景中的多个物体,以及现有的评估指标不能准确地评估新提出的文字生成图像算法,需要提出新的算法评价指标。回顾了文字生成图像方法自提出以来的发展状况,列举了近年提出的文字生成图像算法、常用数据集和评估指标。最后从数据集、指标、算法和应用方面探讨了目前存在的问题,并展望了今后的研究方向。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号