首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In pattern classification, it is needed to efficiently treat not only feature vectors but also feature matrices defined as two-way data, while preserving the two-way structure such as spatio-temporal relationships. The classifier for the feature matrix is generally formulated in a bilinear form composed of row and column weights which jointly result in a matrix weight. The rank of the matrix should be low from the viewpoint of generalization performance and computational cost. For that purpose, we propose a low-rank bilinear classifier based on the efficient convex optimization. In the proposed method, the classifier is optimized by minimizing the trace norm of the classifier (matrix) to reduce the rank without any hard constraint on it. We formulate the optimization problem in a tractable convex form and provide the procedure to solve it efficiently with the global optimum. In addition, we propose two novel extensions of the bilinear classifier in terms of multiple kernel learning and cross-modal learning. Through kernelizing the bilinear method, we naturally induce a novel multiple kernel learning. The method integrates both the inter kernels between heterogeneous reproducing kernel Hilbert spaces (RKHSs) and the ordinary kernels within respective RKHSs into a new discriminative kernel in a unified manner using the bilinear model. Besides, for cross-modal learning, we consider to map into the common space the multi-modal features which are subsequently classified in that space. We show that the projection and the classification are jointly represented by the bilinear model, and then propose the method to optimize both of them simultaneously in the bilinear framework. In the experiments on various visual classification tasks, the proposed methods exhibit favorable performances compared to the other methods.  相似文献   

2.
王铁建  吴飞  荆晓远 《计算机科学》2017,44(12):131-134, 168
提出一种多核字典学习方法,用以对软件模块是否存在缺陷进行预测。用于软件缺陷预测的历史数据具有结构复杂、类不平衡的特点,用多个核函数构成的合成核将这些数据映射到一个高维特征空间,通过对多核字典基的选择,得到一个类别平衡的多核字典,用以对新的软件模块进行分类和预测,并判定其中是否存在缺陷。在NASA MDP数据集上的实验表明,与其他软件缺陷预测方法相比,多核字典学习方法能够针对软件缺陷历史数据结构复杂、类不平衡的特点,较好地解决软件缺陷预测问题。  相似文献   

3.
Software defects, produced inevitably in software projects, seriously affect the efficiency of software testing and maintenance. An appealing solution is the software defect prediction (SDP) that has achieved good performance in many software projects. However, the difference between features and the difference of the same feature between training data and test data may degrade defect prediction performance if such differences violate the model's assumption. To address this issue, we propose a SDP method based on feature transfer learning (FTL), which performs a transformation sequence for each feature in order to map the original features to another feature space. Specifically, FTL first uses the reinforcement learning scheme that automatically learns a strategy for transferring the potential feature knowledge from the training data. Then, we use the learned feature knowledge to inspire the transformation of the test data. The classifier is trained by the transformed training data and predicts defects for transformed test data. We evaluate the validity of FTL on 43 projects from PROMISE and NASA MDP using three classifiers, logistic regression, random forest, and Naive Bayes (NB). Experimental results indicate that FTL is better than the original classifiers and has the best performance on the NB classifier. For PROMISE, after using FTL, the average results of F1-score, AUC, MCC are 0.601, 0.757, and 0.350 respectively, which are 24.9%, 2.6%, and 16.7% higher than the original NB classifier results. The number of projects with improved performance accounts for 83.87%, 83.87%, and 64.52%. Similarly, FTL performs well on NASA MDP. Besides, compared with four feature engineering (FE) methods, FTL achieves an excellent improvement on most projects and the average performance is also better than or close to the FE methods.  相似文献   

4.
N-gram analysis for computer virus detection   总被引:2,自引:0,他引:2  
Generic computer virus detection is the need of the hour as most commercial antivirus software fail to detect unknown and new viruses. Motivated by the success of datamining/machine learning techniques in intrusion detection systems, recent research in detecting malicious executables is directed towards devising efficient non-signature-based techniques that can profile the program characteristics from a set of training examples. Byte sequences and byte n-grams are considered to be basis of feature extraction. But as the number of n-grams is going to be very large, several methods of feature selections were proposed in literature. A recent report on use of information gain based feature selection has yielded the best-known result in classifying malicious executables from benign ones. We observe that information gain models the presence of n-gram in one class and its absence in the other. Through a simple example we show that this may lead to erroneous results. In this paper, we describe a new feature selection measure, class-wise document frequency of byte n-grams. We empirically demonstrate that the proposed method is a better method for feature selection. For detection, we combine several classifiers using Dempster Shafer Theory for better classification accuracy instead of using any single classifier. Our experimental results show that such a scheme detects virus program far more efficiently than the earlier known methods.  相似文献   

5.

Large scale online kernel learning aims to build an efficient and scalable kernel-based predictive model incrementally from a sequence of potentially infinite data points. Current state-of-the-art large scale online kernel learning focuses on improving efficiency. Two key approaches to gain efficiency through approximation are (1) limiting the number of support vectors, and (2) using an approximate feature map. They often employ a kernel with a feature map with intractable dimensionality. While these approaches can deal with large scale datasets efficiently, this outcome is achieved by compromising predictive accuracy because of the approximation. We offer an alternative approach that puts the kernel used at the heart of the approach. It focuses on creating a sparse and finite-dimensional feature map of a kernel called Isolation Kernel. Using this new approach, to achieve the above aim of large scale online kernel learning becomes extremely simple—simply use Isolation Kernel instead of a kernel having a feature map with intractable dimensionality. We show that, using Isolation Kernel, large scale online kernel learning can be achieved efficiently without sacrificing accuracy.

  相似文献   

6.
Text search is a classical problem in Computer Science, with many data-intensive applications. For this problem, suffix arrays are among the most widely known and used data structures, enabling fast searches for phrases, terms, substrings and regular expressions in large texts. Potential application domains for these operations include large-scale search services, such as Web search engines, where it is necessary to efficiently process intensive-traffic streams of on-line queries. This paper proposes strategies to enable such services by means of suffix arrays. We introduce techniques for deploying suffix arrays on clusters of distributed-memory processors and then study the processing of multiple queries on the distributed data structure. Even though the cost of individual search operations in sequential (non-distributed) suffix arrays is low in practice, the problem of processing multiple queries on distributed-memory systems, so that hardware resources are used efficiently, is relevant to services aimed at achieving high query throughput at low operational costs. Our theoretical and experimental performance studies show that our proposals are suitable solutions for building efficient and scalable on-line search services based on suffix arrays.  相似文献   

7.
《Applied Soft Computing》2007,7(3):908-914
This paper presents a least square support vector machine (LS-SVM) that performs text classification of noisy document titles according to different predetermined categories. The system's potential is demonstrated with a corpus of 91,229 words from University of Denver's Penrose Library catalogue. The classification accuracy of the proposed LS-SVM based system is found to be over 99.9%. The final classifier is an LS-SVM array with Gaussian radial basis function (GRBF) kernel, which uses the coefficients generated by the latent semantic indexing algorithm for classification of the text titles. These coefficients are also used to generate the confidence factors for the inference engine that present the final decision of the entire classifier. The system is also compared with a K-nearest neighbor (KNN) and Naïve Bayes (NB) classifier and the comparison clearly claims that the proposed LS-SVM based architecture outperforms the KNN and NB based system. The comparison between the conventional linear SVM based classifiers and neural network based classifying agents shows that the LS-SVM with LSI based classifying agents improves text categorization performance significantly and holds a lot of potential for developing robust learning based agents for text classification.  相似文献   

8.
This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.  相似文献   

9.
This paper presents the implementation of a new text document classification framework that uses the Support Vector Machine (SVM) approach in the training phase and the Euclidean distance function in the classification phase, coined as Euclidean-SVM. The SVM constructs a classifier by generating a decision surface, namely the optimal separating hyper-plane, to partition different categories of data points in the vector space. The concept of the optimal separating hyper-plane can be generalized for the non-linearly separable cases by introducing kernel functions to map the data points from the input space into a high dimensional feature space so that they could be separated by a linear hyper-plane. This characteristic causes the implementation of different kernel functions to have a high impact on the classification accuracy of the SVM. Other than the kernel functions, the value of soft margin parameter, C is another critical component in determining the performance of the SVM classifier. Hence, one of the critical problems of the conventional SVM classification framework is the necessity of determining the appropriate kernel function and the appropriate value of parameter C for different datasets of varying characteristics, in order to guarantee high accuracy of the classifier. In this paper, we introduce a distance measurement technique, using the Euclidean distance function to replace the optimal separating hyper-plane as the classification decision making function in the SVM. In our approach, the support vectors for each category are identified from the training data points during training phase using the SVM. In the classification phase, when a new data point is mapped into the original vector space, the average distances between the new data point and the support vectors from different categories are measured using the Euclidean distance function. The classification decision is made based on the category of support vectors which has the lowest average distance with the new data point, and this makes the classification decision irrespective of the efficacy of hyper-plane formed by applying the particular kernel function and soft margin parameter. We tested our proposed framework using several text datasets. The experimental results show that this approach makes the accuracy of the Euclidean-SVM text classifier to have a low impact on the implementation of kernel functions and soft margin parameter C.  相似文献   

10.
Ji  Haijin  Huang  Song  Wu  Yaning  Hui  Zhanwei  Zheng  Changyou 《Software Quality Journal》2019,27(3):923-968

Software defect prediction (SDP) plays a significant part in identifying the most defect-prone modules before software testing and allocating limited testing resources. One of the most commonly used classifiers in SDP is naive Bayes (NB). Despite the simplicity of the NB classifier, it can often perform better than more complicated classification models. In NB, the features are assumed to be equally important, and the numeric features are assumed to have a normal distribution. However, the features often do not contribute equivalently to the classification, and they usually do not have a normal distribution after performing a Kolmogorov-Smirnov test; this may harm the performance of the NB classifier. Therefore, this paper proposes a new weighted naive Bayes method based on information diffusion (WNB-ID) for SDP. More specifically, for the equal importance assumption, we investigate six weight assignment methods for setting the feature weights and then choose the most suitable one based on the F-measure. For the normal distribution assumption, we apply the information diffusion model (IDM) to compute the probability density of each feature instead of the acquiescent probability density function of the normal distribution. We carry out experiments on 10 software defect data sets of three types of projects in three different programming languages provided by the PROMISE repository. Several well-known classifiers and ensemble methods are included for comparison. The final experimental results demonstrate the effectiveness and practicability of the proposed method.

  相似文献   

11.
Executing customer analysis in a systemic way is one of the possible solutions for each enterprise to understand the behavior of consumer patterns in an efficient and in-depth manner. Further investigation of customer patterns helps the firm to develop efficient decisions and in turn, helps to optimize the enterprise’s business and maximizes consumer satisfaction correspondingly. To conduct an effective assessment about the customers, Naive Bayes(also called Simple Bayes), a machine learning model is utilized. However, the efficacious of the simple Bayes model is utterly relying on the consumer data used, and the existence of uncertain and redundant attributes in the consumer data enables the simple Bayes model to attain the worst prediction in consumer data because of its presumption regarding the attributes applied. However, in practice, the NB premise is not true in consumer data, and the analysis of these redundant attributes enables simple Bayes model to get poor prediction results. In this work, an ensemble attribute selection methodology is performed to overcome the problem with consumer data and to pick a steady uncorrelated attribute set to model with the NB classifier. In ensemble variable selection, two different strategies are applied: one is based upon data perturbation (or homogeneous ensemble, same feature selector is applied to a different subsamples derived from the same learning set) and the other one is based upon function perturbation (or heterogeneous ensemble different feature selector is utilized to the same learning set). Furthermore, the feature set captured from both ensemble strategies is applied to NB individually and the outcome obtained is computed. Finally, the experimental outcomes show that the proposed ensemble strategies perform efficiently in choosing a steady attribute set and increasing NB classification performance efficiently.  相似文献   

12.
It is widely recognized that whether the selected kernel matches the data determines the performance of kernel-based methods. Ideally it is expected that the data is linearly separable in the kernel induced feature space, therefore, Fisher linear discriminant criterion can be used as a cost function to optimize the kernel function. However, the data may not be linearly separable even after kernel transformation in many applications, e.g., the data may exist as multimodally distributed structure, in this case, a nonlinear classifier is preferred, and obviously Fisher criterion is not a suitable choice as kernel optimization rule. Motivated by this issue, we propose a localized kernel Fisher criterion, instead of traditional Fisher criterion, as the kernel optimization rule to increase the local margins between embedded classes in kernel induced feature space. Experimental results based on some benchmark data and measured radar high-resolution range profile (HRRP) data show that the classification performance can be improved by using the proposed method.  相似文献   

13.
Software defect prediction aims to predict the defect proneness of new software modules with the historical defect data so as to improve the quality of a software system. Software historical defect data has a complicated structure and a marked characteristic of class-imbalance; how to fully analyze and utilize the existing historical defect data and build more precise and effective classifiers has attracted considerable researchers’ interest from both academia and industry. Multiple kernel learning and ensemble learning are effective techniques in the field of machine learning. Multiple kernel learning can map the historical defect data to a higher-dimensional feature space and make them express better, and ensemble learning can use a series of weak classifiers to reduce the bias generated by the majority class and obtain better predictive performance. In this paper, we propose to use the multiple kernel learning to predict software defect. By using the characteristics of the metrics mined from the open source software, we get a multiple kernel classifier through ensemble learning method, which has the advantages of both multiple kernel learning and ensemble learning. We thus propose a multiple kernel ensemble learning (MKEL) approach for software defect classification and prediction. Considering the cost of risk in software defect prediction, we design a new sample weight vector updating strategy to reduce the cost of risk caused by misclassifying defective modules as non-defective ones. We employ the widely used NASA MDP datasets as test data to evaluate the performance of all compared methods; experimental results show that MKEL outperforms several representative state-of-the-art defect prediction methods.  相似文献   

14.
Gaussian mixture model (GMM) based approaches have been commonly used for speaker recognition tasks. Methods for estimation of parameters of GMMs include the expectation-maximization method which is a non-discriminative learning based method. Discriminative classifier based approaches to speaker recognition include support vector machine (SVM) based classifiers using dynamic kernels such as generalized linear discriminant sequence kernel, probabilistic sequence kernel, GMM supervector kernel, GMM-UBM mean interval kernel (GUMI) and intermediate matching kernel. Recently, the pyramid match kernel (PMK) using grids in the feature space as histogram bins and vocabulary-guided PMK (VGPMK) using clusters in the feature space as histogram bins have been proposed for recognition of objects in an image represented as a set of local feature vectors. In PMK, a set of feature vectors is mapped onto a multi-resolution histogram pyramid. The kernel is computed between a pair of examples by comparing the pyramids using a weighted histogram intersection function at each level of pyramid. We propose to use the PMK-based SVM classifier for speaker identification and verification from the speech signal of an utterance represented as a set of local feature vectors. The main issue in building the PMK-based SVM classifier is construction of a pyramid of histograms. We first propose to form hard clusters, using k-means clustering method, with increasing number of clusters at different levels of pyramid to design the codebook-based PMK (CBPMK). Then we propose the GMM-based PMK (GMMPMK) that uses soft clustering. We compare the performance of the GMM-based approaches, and the PMK and other dynamic kernel SVM-based approaches to speaker identification and verification. The 2002 and 2003 NIST speaker recognition corpora are used in evaluation of different approaches to speaker identification and verification. Results of our studies show that the dynamic kernel SVM-based approaches give a significantly better performance than the state-of-the-art GMM-based approaches. For speaker recognition task, the GMMPMK-based SVM gives a performance that is better than that of SVMs using many other dynamic kernels and comparable to that of SVMs using state-of-the-art dynamic kernel, GUMI kernel. The storage requirements of the GMMPMK-based SVMs are less than that of SVMs using any other dynamic kernel.  相似文献   

15.
For classifying large data sets, we propose a discriminant kernel that introduces a nonlinear mapping from the joint space of input data and output label to a discriminant space. Our method differs from traditional ones, which correspond to map nonlinearly from the input space to a feature space. The induced distance of our discriminant kernel is Eu- clidean and Fisher separable, as it is defined based on distance vectors of the feature space to distance vectors on the discriminant space. Unlike the support vector machines or the kernel Fisher discriminant analysis, the classifier does not need to solve a quadric program- ming problem or eigen-decomposition problems. Therefore, it is especially appropriate to the problems of processing large data sets. The classifier can be applied to face recognition, shape comparison and image classification benchmark data sets. The method is significantly faster than other methods and yet it can deliver comparable classification accuracy.  相似文献   

16.
樊康新 《计算机工程》2009,35(24):191-193
针对朴素贝叶斯(NB)分类器在分类过程中存在诸如分类模型对样本具有敏感性、分类精度难以提高等缺陷,提出一种基于多种特征选择方法的NB组合文本分类器方法。依据Boosting分类算法,采用多种不同的特征选择方法建立文本的特征词集,训练NB分类器作为Boosting迭代过程的基分类器,通过对基分类器的加权投票生成最终的NB组合文本分类器。实验结果表明,该组合分类器较单NB文本分类器具有更好的分类性能。  相似文献   

17.
化学物质和疾病之间的副作用关系使得化学物质-疾病关系受到更多关注.介绍一个从生物医学文献中抽取化学物质致病关系的系统——CDRExtractor.该系统首先训练一个句子级别分类器,用于抽取存在于同一个句子中的化学物质致病(chemical-induced disease, CID)关系.在句子级别分类器训练阶段,将特征核和图核特征看作2个独立的视图,采用基于半监督的Co-training方法,利用少量人工标注的训练集和大量未标注语料训练模型.之后,CDRExtractor利用文档级别的化学物质与疾病信息特征训练一个文档级别的分类器用于实现文档级别跨句子的CID关系抽取.最后,利用规则将2个分类器的抽取结果进行整合,生成最终的输出结果.实验结果表明:CDRExtractor在BioCreative V CDR评测任务CID子任务提供的测试集上F值达到67.72%.  相似文献   

18.
一种新的基于统计的自动文本分类方法   总被引:29,自引:5,他引:29  
自动文本分类就是在给定的分类体系下,让计算机根据文本的内容确定与它相关联的类别。为了提高分类性能,本文提出了中文文本多层次特征提取方法和基于核的距离加权KNN算法。多层次特征提取方法在汉字、常用词表和专业词表三个层次上提取文档的统计特征,能够更好地反映文档的统计分布。基于核的距离加权KNN算法解决了样本的多峰分布、边界重叠问题和分类器的精确分类决策问题。实际应用中,互联网和文本库提供了大量经过粗分类的训练文本,但普遍存在样本质量较差的问题,本文通过样本重要性分析技术解决此问题。实验系统证明了新方法的有效性。  相似文献   

19.
We demonstrate a text-mining method, called associative Naïve Bayes (ANB) classifier, for automated linking of MEDLINE documents to gene ontology (GO). The approach of this paper is a nontrivial extension of document classification methodology from a fixed set of classes C={c1,c2,…,cn} to a knowledge hierarchy like GO. Due to the complexity of GO, we use a knowledge representation structure. With that structure, we develop the text mining classifier, called ANB classifier, which automatically links Medline documents to GO. To check the performance, we compare our datasets under several well-known classifiers: NB classifier, large Bayes classifier, support vector machine and ANB classifier. Our results, described in the following, indicate its practical usefulness.  相似文献   

20.
Since a decade, text categorization has become an active field of research in the machine learning community. Most of the approaches are based on the term occurrence frequency. The performance of such surface-based methods can decrease when the texts are too complex, i.e., ambiguous. One alternative is to use the semantic-based approaches to process textual documents according to their meaning. Furthermore, research in text categorization has mainly focused on “flat texts” whereas many documents are now semi-structured and especially under the XML format. In this paper, we propose a semantic kernel for semi-structured biomedical documents. The semantic meanings of words are extracted using the unified medical language system (UMLS) framework. The kernel, with a SVM classifier, has been applied to a text categorization task on a medical corpus of free text documents. The results have shown that the semantic kernel outperforms the linear kernel and the naive Bayes classifier. Moreover, this kernel was ranked in the top 10 of the best algorithms among 44 classification methods at the 2007 Computational Medicine Center (CMC) Medical NLP International Challenge.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号