首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In this paper. we present the MIFS-C variant of the mutual information feature-selection algorithms. We present an algorithm to find the optimal value of the redundancy parameter, which is a key parameter in the MIFS-type algorithms. Furthermore, we present an algorithm that speeds up the execution time of all the MIFS variants. Overall, the presented MIFS-C has comparable classification accuracy (in some cases even better) compared with other MIFS algorithms, while its running time is faster. We compared this feature selector with other feature selectors, and found that it performs better in most cases. The MIFS-C performed especially well for the breakeven and F-measure because the algorithm can be tuned to optimise these evaluation measures. Jan Bakus received the B.A.Sc. and M.A.Sc. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1996 and 1998, respectively, and Ph.D. degree in systems design engineering in 2005. He is currently working at Maplesoft, Waterloo, ON, Canada as an applications engineer, where he is responsible for the development of application specific toolboxes for the Maple scientific computing software. His research interests are in the area of feature selection for text classification, text classification, text clustering, and information retrieval. He is the recipient of the Carl Pollock Fellowship award from the University of Waterloo and the Datatel Scholars Foundation scholarship from Datatel. Mohamed S. Kamel holds a Ph.D. in computer science from the University of Toronto, Canada. He is at present Professor and Director of the Pattern Analysis and Machine Intelligence Laboratory in the Department of Electrical and Computing Engineering, University of Waterloo, Canada. Professor Kamel holds a Canada Research Chair in Cooperative Intelligent Systems. Dr. Kamel's research interests are in machine intelligence, neural networks and pattern recognition with applications in robotics and manufacturing. He has authored and coauthored over 200 papers in journals and conference proceedings, 2 patents and numerous technical and industrial project reports. Under his supervision, 53 Ph.D. and M.A.Sc. students have completed their degrees. Dr. Kamel is a member of ACM, AAAI, CIPS and APEO and has been named s Fellow of IEEE (2005). He is the editor-in-chief of the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, the International Journal of Image and Graphics, Pattern Recognition Letters and is a member of the editorial board of the Intelligent Automation and Soft Computing. He has served as a consultant to many Companies, including NCR, IBM, Nortel, VRP and CSA. He is a member of the board of directors and cofounder of Virtek Vision International in Waterloo.  相似文献   

2.
The management of a huge and growing amount of information available nowadays makes Automatic Document Classification (ADC), besides crucial, a very challenging task. Furthermore, the dynamics inherent to classification problems, mainly on the Web, make this task even more challenging. Despite this fact, the actual impact of such temporal evolution on ADC is still poorly understood in the literature. In this context, this work concerns to evaluate, characterize and exploit the temporal evolution to improve ADC techniques. As first contribution we highlight the proposal of a pragmatical methodology for evaluating the temporal evolution in ADC domains. Through this methodology, we can identify measurable factors associated to ADC models degradation over time. Going a step further, based on such analyzes, we propose effective and efficient strategies to make current techniques more robust to natural shifts over time. We present a strategy, named temporal context selection, for selecting portions of the training set that minimize those factors. Our second contribution consists of proposing a general algorithm, called Chronos, for determining such contexts. By instantiating Chronos, we are able to reduce uncertainty and improve the overall classification accuracy. Empirical evaluations of heuristic instantiations of the algorithm, named WindowsChronos and FilterChronos, on two real document collections demonstrate the usefulness of our proposal. Comparing them against state-of-the-art ADC algorithms shows that selecting temporal contexts allows improvements on the classification accuracy up to 10%. Finally, we highlight the applicability and the generality of our proposal in practice, pointing out this study as a promising research direction.  相似文献   

3.
We present a method for the classification of multi-labeled text documents explicitly designed for data stream applications that require to process a virtually infinite sequence of data using constant memory and constant processing time.Our method is composed of an online procedure used to efficiently map text into a low-dimensional feature space and a partition of this space into a set of regions for which the system extracts and keeps statistics used to predict multi-label text annotations. Documents are fed into the system as a sequence of words, mapped to a region of the partition, and annotated using the statistics computed from the labeled instances colliding in the same region. This approach is referred to as clashing.We illustrate the method in real-world text data, comparing the results with those obtained using other text classifiers. In addition, we provide an analysis about the effect of the representation space dimensionality on the predictive performance of the system. Our results show that the online embedding indeed approximates the geometry of the full corpus-wise TF and TF-IDF space. The model obtains competitive F measures with respect to the most accurate methods, using significantly fewer computational resources. In addition, the method achieves a higher macro-averaged F measure than methods with similar running time. Furthermore, the system is able to learn faster than the other methods from partially labeled streams.  相似文献   

4.
Abstract The bag-of-words approach to text document representation typically results in vectors of the order of 5000–20,000 components as the representation of documents. To make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion. As a remedy, we suggest feature transforms based on Linear Discriminant Analysis (LDA). Since LDA requires operating both with large and dense matrices, we propose an efficient intermediate dimension reduction step using either a random transform or LSI. We report good classification results with the combined feature transform on a subset of the Reuters-21578 database. Drastic reduction of the feature vector dimensionality from 5000 to 12 actually improves the classification performance.An erratum to this article can be found at  相似文献   

5.
Boosting text segmentation via progressive classification   总被引:1,自引:4,他引:1  
A novel approach for reconciling tuples stored as free text into an existing attribute schema is proposed. The basic idea is to subject the available text to progressive classification, i.e., a multi-stage classification scheme where, at each intermediate stage, a classifier is learnt that analyzes the textual fragments not reconciled at the end of the previous steps. Classification is accomplished by an ad hoc exploitation of traditional association mining algorithms, and is supported by a data transformation scheme which takes advantage of domain-specific dictionaries/ontologies. A key feature is the capability of progressively enriching the available ontology with the results of the previous stages of classification, thus significantly improving the overall classification accuracy. An extensive experimental evaluation shows the effectiveness of our approach.  相似文献   

6.
Text categorization presents unique challenges to traditional classification methods due to the large number of features inherent in the datasets from real-world applications of text categorization, and a great deal of training samples. In high-dimensional document data, the classes are typically categorized only by subsets of features, which are typically different for the classes of different topics. This paper presents a simple but effective classifier for text categorization using class-dependent projection based method. By projecting onto a set of individual subspaces, the samples belonging to different document classes are separated such that they are easily to be classified. This is achieved by developing a new supervised feature weighting algorithm to learn the optimized subspaces for all the document classes. The experiments carried out on common benchmarking corpuses showed that the proposed method achieved both higher classification accuracy and lower computational costs than some distinguishing classifiers in text categorization, especially for datasets including document categories with overlapping topics.  相似文献   

7.
Text mining techniques include categorization of text, summarization, topic detection, concept extraction, search and retrieval, document clustering, etc. Each of these techniques can be used in finding some non-trivial information from a collection of documents. Text mining can also be employed to detect a document’s main topic/theme which is useful in creating taxonomy from the document collection. Areas of applications for text mining include publishing, media, telecommunications, marketing, research, healthcare, medicine, etc. Text mining has also been applied on many applications on the World Wide Web for developing recommendation systems. We propose here a set of criteria to evaluate the effectiveness of text mining techniques in an attempt to facilitate the selection of appropriate technique.  相似文献   

8.
中文特征词的选取是中文信息预处理内容之一,对文档分类有重要影响。中文分词处理后,采用特征词构建的向量模型表示文档时,导致特征词的稀疏性和高维性,从而影响文档分类的性能和精度。在分析、总结多种经典文本特征选取方法基础上,以文档频为主,实现文档集中的特征词频及其分布为修正的特征词选取方法(DC)。采用宏F值和微F值为评价指标,通过实验对比证明,该方法的特征选取效果好于经典文本特征选取方法。  相似文献   

9.
本文阐述了一个中文文本分类系统的设计和实现,对文本分类系统的系统结构、特征提取、训练算法、分类算法等进行了详细介绍,将基于统计的二元分词方法应用于中文文本分类,并提出了一种基于汉语中单字词及二字词统计特性的中文文本分类方法,实现了在事先没有词表的情况下,通过统计构造单字及二字词词表,从而对文本进行分词,然后再进行文本的分类。  相似文献   

10.
Using text classification and multiple concepts to answer e-mails   总被引:1,自引:0,他引:1  
In text mining, the applications domain of text classification techniques is very broad to include text filtering, word identification, and web page classification, etc. Through text classification techniques, documents can be placed into previously defined classifications in order to save on time costs especially when manual document search methods are employed. This research uses text classification techniques applied to e-mail reply template suggestions in order to lower the burden of customer service personnel in responding to e-mails. Suggested templates allows customer service personnel, using a pre-determined number of templates, to find the needed reply template, and not waste time in searching for relevant answers from too much information available. Current text classification techniques are still single-concept based. This research hopes to use a multiple concept method to integrate the relationship between concepts and classifications which will thus allow easy text classification. Through integration of different concepts and classifications, a dynamically unified e-mail concept can recommend different appropriate reply templates. In so doing, the differences between e-mails can be definitely determined, effectively improving the accuracy of the suggested template. In addition, for e-mails with two or more questions, this research tries to come up with an appropriate reply template. Based on experimental verification, the method proposed in this research effectively proposes a template for e-mails of multiple questions. Therefore, using multiple concepts to display the document topic is definitely a clearer way of extracting information that a document wants to convey when the vector of similar documents is used.  相似文献   

11.
Textual databases are useful sources of information and knowledge and if these are well utilised then issues related to future project management and product or service quality improvement may be resolved. A large part of corporate information, approximately 80%, is available in textual data formats. Text Classification techniques are well known for managing on-line sources of digital documents. The identification of key issues discussed within textual data and their classification into two different classes could help decision makers or knowledge workers to manage their future activities better. This research is relevant for most text based documents and is demonstrated on Post Project Reviews (PPRs) which are valuable source of information and knowledge. The application of textual data mining techniques for discovering useful knowledge and classifying textual data into different classes is a relatively new area of research. The research work presented in this paper is focused on the use of hybrid applications of text mining or textual data mining techniques to classify textual data into two different classes. The research applies clustering techniques at the first stage and Apriori Association Rule Mining at the second stage. The Apriori Association Rule of Mining is applied to generate Multiple Key Term Phrasal Knowledge Sequences (MKTPKS) which are later used for classification. Additionally, studies were made to improve the classification accuracies of the classifiers i.e. C4.5, K-NN, Naïve Bayes and Support Vector Machines (SVMs). The classification accuracies were measured and the results compared with those of a single term based classification model. The methodology proposed could be used to analyse any free formatted textual data and in the current research it has been demonstrated on an industrial dataset consisting of Post Project Reviews (PPRs) collected from the construction industry. The data or information available in these reviews is codified in multiple different formats but in the current research scenario only free formatted text documents are examined. Experiments showed that the performance of classifiers improved through adopting the proposed methodology.  相似文献   

12.
Detection of both scene text and graphic text in video images is gaining popularity in the area of information retrieval for efficient indexing and understanding the video. In this paper, we explore a new idea of classifying low contrast and high contrast video images in order to detect accurate boundary of the text lines in video images. In this work, high contrast refers to sharpness while low contrast refers to dim intensity values in the video images. The method introduces heuristic rules based on combination of filters and edge analysis for the classification purpose. The heuristic rules are derived based on the fact that the number of Sobel edge components is more than the number of Canny edge components in the case of high contrast video images, and vice versa for low contrast video images. In order to demonstrate the use of this classification on video text detection, we implement a method based on Sobel edges and texture features for detecting text in video images. Experiments are conducted using video images containing both graphic text and scene text with different fonts, sizes, languages, backgrounds. The results show that the proposed method outperforms existing methods in terms of detection rate, false alarm rate, misdetection rate and inaccurate boundary rate.  相似文献   

13.
In this paper, we propose a new discriminant analysis using composite features for pattern classification. A composite feature consists of a number of primitive features, each of which corresponds to an input variable. The covariance of composite features is obtained from the inner product of composite features and can be considered as a generalized form of the covariance of primitive features. It contains information on statistical dependency among multiple primitive features. A discriminant analysis (C-LDA) using the covariance of composite features is a generalization of the linear discriminant analysis (LDA). Unlike LDA, the number of extracted features can be larger than the number of classes in C-LDA, which is a desirable property especially for binary classification problems. Experimental results on several data sets indicate that C-LDA provides better classification results than other methods based on primitive features.  相似文献   

14.
This paper presents two algorithms for smoothing and feature extraction for fingerprint classification. Deutsch's(2) Thinning algorithm (rectangular array) is used for thinning the digitized fingerprint (binary version). A simple algorithm is also suggested for classifying the fingerprints. Experimental results obtained using such algorithms are presented.  相似文献   

15.
The evaluation of feature selection methods for text classification with small sample datasets must consider classification performance, stability, and efficiency. It is, thus, a multiple criteria decision-making (MCDM) problem. Yet there has been few research in feature selection evaluation using MCDM methods which considering multiple criteria. Therefore, we use MCDM-based methods for evaluating feature selection methods for text classification with small sample datasets. An experimental study is designed to compare five MCDM methods to validate the proposed approach with 10 feature selection methods, nine evaluation measures for binary classification, seven evaluation measures for multi-class classification, and three classifiers with 10 small datasets. Based on the ranked results of the five MCDM methods, we make recommendations concerning feature selection methods. The results demonstrate the effectiveness of the used MCDM-based method in evaluating feature selection methods.  相似文献   

16.
With the growth of computational culture, the concept of cultural syntactic unit has been investigated quite intensively in recent years. The existing approaches of extracting music feature always consider putting notes together according to certain rules. However, sometimes, it might be better take the integrity of melody into account. In this paper, we introduce a syntactic unit named music gene and propose a music feature extraction method for era classification. Music genes are extracted from XML files and estimated utilizing their intervals to investigate their representation of music feature. We evaluate the performance of music gene and the previous proposed motif to illustrate the effectiveness of music gene in classification. Support vector machines are applied to classify XML files into their respective classes by learning from training data. We obtain the classification accuracy for a collection of 764 music genes, demonstrating that significant classification can be achieved using higher level music feature.  相似文献   

17.
18.
Within the last decade increasing computing power and the scientific advancement of algorithms allowed the analysis of various aspects of human faces such as facial expression estimation [20], head pose estimation [17], person identification [2] or face model fitting [31]. Today, computer scientists can use a bunch of different techniques to approach this challenge 4, 29, 3, 17, 9 and 21. However, each of them still has to deal with non-perfect accuracy or high execution times.  相似文献   

19.
Text classification is usually based on constructing a model through learning from training examples to automatically classify text documents. However, as the size of text document repositories grows rapidly, the storage requirement and computational cost of model learning become higher. Instance selection is one solution to solve these limitations whose aim is to reduce the data size by filtering out noisy data from a given training dataset. In this paper, we introduce a novel algorithm for these tasks, namely a biological-based genetic algorithm (BGA). BGA fits a “biological evolution” into the evolutionary process, where the most streamlined process also complies with the reasonable rules. In other words, after long-term evolution, organisms find the most efficient way to allocate resources and evolve. Consequently, we can closely simulate the natural evolution of an algorithm, such that the algorithm will be both efficient and effective. The experimental results based on the TechTC-100 and Reuters-21578 datasets show the outperformance of BGA over five state-of-the-art algorithms. In particular, using BGA to select text documents not only results in the largest dataset reduction rate, but also requires the least computational time. Moreover, BGA can make the k-NN and SVM classifiers provide similar or slightly better classification accuracy than GA.  相似文献   

20.
Consider a supervised learning problem in which examples contain both numerical- and text-valued features. To use traditional feature-vector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic—in the most straight-forward approach each number would be considered a distinct token and treated as a word. This paper presents an alternative approach for the use of text classification methods for supervised learning problems with numerical-valued features in which the numerical features are converted into bag-of-words features, thereby making them directly usable by text classification methods. We show that even on purely numerical-valued data the results of text classification on the derived text-like representation outperforms the more naive numbers-as-tokens representation and, more importantly, is competitive with mature numerical classification methods such as C4.5, Ripper, and SVM. We further show that on mixed-mode data adding numerical features using our approach can improve performance over not adding those features.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号