首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Text categorization presents unique challenges to traditional classification methods due to the large number of features inherent in the datasets from real-world applications of text categorization, and a great deal of training samples. In high-dimensional document data, the classes are typically categorized only by subsets of features, which are typically different for the classes of different topics. This paper presents a simple but effective classifier for text categorization using class-dependent projection based method. By projecting onto a set of individual subspaces, the samples belonging to different document classes are separated such that they are easily to be classified. This is achieved by developing a new supervised feature weighting algorithm to learn the optimized subspaces for all the document classes. The experiments carried out on common benchmarking corpuses showed that the proposed method achieved both higher classification accuracy and lower computational costs than some distinguishing classifiers in text categorization, especially for datasets including document categories with overlapping topics.  相似文献   

2.
何丽  刘军 《计算机工程》2006,32(20):4-6
提出了一种基于概念特征向量的NB文档分类方法。该方法在未标注文档集上通过SOM(Self-Organizing Maps)聚类产生若干初始文档类,并为每个文档类分配一个类标签,使用最大信息熵的方法建立每个文档类的概念特征向量。在概念特征向量空间上建立最终的文档分类器:CFB-NB。  相似文献   

3.
Grouping images into semantically meaningful categories using low-level visual features is a challenging and important problem in content-based image retrieval. Based on these groupings, effective indices can be built for an image database. In this paper, we show how a specific high-level classification problem (city images vs landscapes) can be solved from relatively simple low-level features geared for the particular classes. We have developed a procedure to qualitatively measure the saliency of a feature towards a classification problem based on the plot of the intra-class and inter-class distance distributions. We use this approach to determine the discriminative power of the following features: color histogram, color coherence vector, DCT coefficient, edge direction histogram, and edge direction coherence vector. We determine that the edge direction-based features have the most discriminative power for the classification problem of interest here. A weighted k-NN classifier is used for the classification which results in an accuracy of 93.9% when evaluated on an image database of 2716 images using the leave-one-out method. This approach has been extended to further classify 528 landscape images into forests, mountains, and sunset/sunrise classes. First, the input images are classified as sunset/sunrise images vs forest & mountain images (94.5% accuracy) and then the forest & mountain images are classified as forest images or mountain images (91.7% accuracy). We are currently identifying further semantic classes to assign to images as well as extracting low level features which are salient for these classes. Our final goal is to combine multiple 2-class classifiers into a single hierarchical classifier.  相似文献   

4.
We demonstrate a text-mining method, called associative Naïve Bayes (ANB) classifier, for automated linking of MEDLINE documents to gene ontology (GO). The approach of this paper is a nontrivial extension of document classification methodology from a fixed set of classes C={c1,c2,…,cn} to a knowledge hierarchy like GO. Due to the complexity of GO, we use a knowledge representation structure. With that structure, we develop the text mining classifier, called ANB classifier, which automatically links Medline documents to GO. To check the performance, we compare our datasets under several well-known classifiers: NB classifier, large Bayes classifier, support vector machine and ANB classifier. Our results, described in the following, indicate its practical usefulness.  相似文献   

5.
快速成型切片数据的优化算法研究   总被引:4,自引:0,他引:4  
为了能够顺利地进行 STL模型切片轮廓数据的进一步处理 ,提出了对切片数据进行优化处理的算法 .对由于STL模型的缺陷造成切片之后的轮廓信息数据有大量的冗余数据 ,提出了一种冗余数据的滤除算法 ;针对切片轮廓的不封闭 ,给出了有效的修正算法 ;同时给出了对切片轮廓的内外边界进行自动识别的算法 .该算法高效简单 ,提高了后续的数据处理的效率和成型件的加工质量 ,改善了零件成型的加工性能  相似文献   

6.
文档图像分割的研究对于打印、传真以及这样的数据处理工作具有十分重要的意义 .提出了一个文档图像分割的新算法 .分割算法的特征是基于根据文档图像中各种图像类型直方图的不同特性 .算法中重要的特征是通过小波图像来加强原始图像的特征 ,从而使得精确度提高  相似文献   

7.
提出一种基于关系权重的文本表示方法.通过优化关系权重,在文本向量中体现了不同特征项在不同类别中重要程度的差异,使得在此权重下不同类别的文本得到更准确的区分.运用SVM分类实验表明,基于关系权重的文本表示方法,较之传统的 TF-IDF 文本表示法,有更高的准确率和召回率.  相似文献   

8.
A new differential LSI space-based probabilistic document classifier   总被引:1,自引:0,他引:1  
We have developed a new effective probabilistic classifier for document classification by introducing the concept of differential document vectors and DLSI (differential latent semantic indexing) spaces. A combined use of the projections on and the distances to the DLSI spaces introduced from the differential document vectors improves the adaptability of the LSI (latent semantic indexing) method by capturing unique characteristics of documents. Using the intra- and extra-document statistics, both a simple posteriori calculation on a small example and an experiment on a large Reuters-21578 database demonstrate the advantage of the DLSI space-based probabilistic classifier over the LSI space-based classifier in classification performance.  相似文献   

9.
基于隐马尔可夫模型的文本分类算法   总被引:2,自引:0,他引:2  
杨健  汪海航 《计算机应用》2010,30(9):2348-2350
自动文本分类领域近年来已经产生了若干成熟的分类算法,但这些算法主要基于概率统计模型,没有与文本自身的语法和语义建立起联系。提出了将隐马尔可夫序列分析模型(HMM)用于自动文本分类的算法,首先构造表示文档类别的特征词集合,并以文档类别的特征词序列作为不同HMM分类器的观察序列,而HMM的状态转换序列则隐含地表示了不同类别文档内容的形成演化过程。分类时,具有最大生成概率的HMM分类器类标即为测试文档的分类结果。该算法构造的分类器模型一定程度上体现了不同类别文档的语法和语义特征,并可以实现多类别的自动文本分类,分类效率较高。  相似文献   

10.
11.
This paper describes an algorithm for the determination of zone content type of a given zone within a document image. We take a statistical based approach and represent each zone with 25 dimensional feature vectors. An optimized decision tree classifier is used to classify each zone into one of nine zone content classes. A performance evaluation protocol is proposed. The training and testing data sets include a total of 24,177 zones from the University of Washington English Document Image database III. The algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.  相似文献   

12.
Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify it by type in the absence of domain-specific models. Our approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class. We use image features such as percentages of text and non-text (graphics, images, tables, and rulings) content regions, column structures, relative point sizes of fonts, density of content area, and statistics of features of connected components which can be derived without class knowledge. In order to obtain class labels for training samples, we conducted a study where subjects ranked document pages with respect to their resemblance to representative page images. Class labels can also be assigned based on known document types, or can be defined by the user. We implemented our classification scheme using decision tree classifiers and self-organizing maps. Received June 15, 2000 / Revised November 15, 2000  相似文献   

13.
In order to process large numbers of explicit knowledge documents such as patents in an organized manner, automatic document categorization and search are required. In this paper, we develop a document classification and search methodology based on neural network technology that helps companies manage patent documents more effectively. The classification process begins by extracting key phrases from the document set by means of automatic text processing and determining the significance of key phrases according to their frequency in text. In order to maintain a manageable number of independent key phrases, correlation analysis is applied to compute the similarities between key phrases. Phrases with higher correlations are synthesized into a smaller set of phrases. Finally, the back-propagation network model is adopted as a classifier. The target output identifies a patent document’s category based on a hierarchical classification scheme, in this case, the international patent classification (IPC) standard. The methodology is tested using patents related to the design of power hand-tools. Related patents are automatically classified using pre-trained neural network models. In the prototype system, two modules are used for patent document management. The automatic classification module helps the user classify patent documents and the search module helps users find relevant and related patent documents. The result shows an improvement in document classification and identification over previously published methods of patent document management.  相似文献   

14.
This paper describes an independent handwriting style classifier that has been designed to select the best recognizer for a given style of writing. For this purpose a definition of handwriting legibility has been defined and a method implemented that can predict this legibility. The technique consists of two phases. In the feature-extraction phase, a set of 36 features is extracted from the image contour. In the classification phase, two nonparametric classification techniques are applied to the extracted features in order to compare their effectiveness in classifying words into legible, illegible, and middle classes. In the first method, a multiple discriminant analysis (MDA) is used to transform the space of extracted features (36 dimensions) into an optimal discriminant space for a nearest mean based classifier. In the second method, a probabilistic neural network (PNN) based on the Bayes strategy and nonparametric estimation of probability density function is used. The experimental results show that the PNN method gives superior classification results when compared with the MDA method. For the legible, illegible, and middle handwriting the method provides 86.5% (legible/illegible), 65.5% (legible/middle), and 90.5% (middle/illegible) correct classification for two classes. For the three-class legibility classification the rate of correct classification is 67.33% using a PNN classifier.Received: 6 September 2002, Accepted: 19 September 2002, Published online: 6 June 2003  相似文献   

15.
基于SVM的图像分类研究   总被引:1,自引:0,他引:1  
图像分类技术有着重要的应用前景,而且对于基于内容的图像检索的发展会有积极的推动作用。多类图像分类是图像分类中的难点,对基于SVM的多类图像分类方法进行了研究,提出在二类支持向量机的基础上构造多类分类器的方法,实验结果证明和传统方法相比,分类准确率有了较大的提高。  相似文献   

16.
基于模糊高斯基函数神经网络的遥感图像分类   总被引:8,自引:0,他引:8       下载免费PDF全文
针对遥感图像分类的特点,提出了一种基于模糊高斯基函数神经网络的遥感图像分类器。该分类器将模糊技术与神经网络相结合,采用神经网络来实现模糊推理,利用神经网络的学习能力来达到调整模糊隶属函数和模型规则的目的,从而使系统具备了自适应的特性,实验结果表明,这种基于模糊高斯基孙数神经网络的分类器经过训练后,可应用于遥感图像的分类,其分类精度明显高于传统的最大似然分类法。  相似文献   

17.
In this paper, we report our experience on the use of phrases as basic features in the email classification problem. We performed extensive empirical evaluation using our large email collections and tested with three text classification algorithms, namely, a naive Bayes classifier and two k-NN classifiers using TF-IDF weighting and resemblance respectively. The investigation includes studies on the effect of phrase size, the size of local and global sampling, the neighbourhood size, and various methods to improve the classification accuracy. We determined suitable settings for various parameters of the classifiers and performed a comparison among the classifiers with their best settings. Our result shows that no classifier dominates the others in terms of classification accuracy. Also, we made a number of observations on the special characteristics of emails. In particular, we observed that public emails are easier to classify than private ones.  相似文献   

18.
一种基于视频监控的运动目标快速分类方法   总被引:1,自引:0,他引:1  
利用运动目标的视觉图像和运动信息获取运动目标的归整轮廓,用双目视觉测量出运动目标的特征数据,先将运动目标分为人和车辆两类,然后采用不同特征进行分类。细化分类采用多特征的树型分类器,逐层细化分类;该分类方法避免了样本训练和模板匹配,分类速度快。  相似文献   

19.
基于密度的kNN文本分类器训练样本裁剪方法   总被引:36,自引:2,他引:36  
随着WWW的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术。kNN方法作为一种简单、有效、非参数的分类方法,在文本分类中得到广泛的应用。但是这种方法计算量大,而且训练样本的分布不均匀会造成分类准确率的下降。针对kNN方法存在的这两个问题,提出了一种基于密度的kNN分类器训练样本裁剪方法,这种方法不仅降低了kNN方法的计算量,而且使训练样本的分布密度趋于均匀,减少了边界点处测试样本的误判。实验结果显示,这种方法具有很好的性能。  相似文献   

20.
纺织品缺陷分类是利用计算机视觉技术检测纺织品品质的一个关键环节。提出了一种基于小波框架的纺织品缺陷分类新方法。该方法使用纺织品图像的小波框架来描述缺陷的纹理特征。在最小分类误差训练框架下,通过联合设计一个基于线性变换矩阵的特征提取器和一个分类器,来获取面向缺陷分类的小波框架特征,并最小化分类器的错误概率。该方法对包含9类纺织品缺陷的329个样本,以及328个无缺陷样本进行了分类实验评估,获得了931%的分类准确率,相比传统的基于小波变换的分类方法提高了272%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号