共查询到20条相似文献,搜索用时 281 毫秒
1.
2.
《计算机应用与软件》2016,(2)
针对大规模文本的自动层次分类问题,K近邻(KNN)算法分类效率较高,但是对于处于类别边界的样本分类准确度不是很高。而支持向量机(SVM)分类算法准确度比较高,但以前的多类SVM算法很多基于多个独立二值分类器组成,训练过程比较缓慢并且不适合层次类别结构等。提出一种融合KNN与层次SVM的自动分类方法。首先对KNN算法进行改进以迅速得到K个最近邻的类别标签,以此对文档的候选类别进行有效筛选。然后使用一个统一学习的多类稀疏层次SVM分类器对其进行自上而下的类别划分,从而实现对文档的高效准确的分类过程。实验结果表明,该方法在单层和多层的分类数据集上的分类准确度比单独使用其中任何一种要好,同时分类时间上也比较接近其中最快的单个分类器。 相似文献
3.
基于支持向量机和k-近邻分类器的多特征融合方法 总被引:1,自引:0,他引:1
针对传统分类方法只采用一种分类器而存在的片面性,分类精度不高,以及支持向量机分类超平面附近点易错分的问题,提出了基于支持向量机(SVM)和k 近邻(KNN)的多特征融合方法。在该算法中,设样本集特征可分为L组,先用SVM算法根据训练集中每组特征数据构造分类超平面,共构造L个;其次用SVM KNN方法对测试集进行测试,得到由L组后验概率构成的决策轮廓矩阵;最后将其进行多特征融合,输出最终的分类结果。用鸢尾属植物数据进行了数值实验,实验结果表明:采用基于SVM KNN的多特征融合方法比单独使用一种SVM或SVM KNN方法的平均预测精度分别提高了28.7%和1.9%。 相似文献
4.
5.
基于改进的F-score与支持向量机的特征选择方法 总被引:1,自引:0,他引:1
将传统F-score度量样本特征在两类之间的辨别能力进行推广,提出了改进的F-score,使其不但能够评价样本特征在两类之间的辨别能力,而且能够度量样本特征在多类之间的辨别能力大小。以改进的F-score作为特征选择准则,用支持向量机(SVM)评估所选特征子集的有效性,实现有效的特征选择。通过UCI机器学习数据库中六组数据集的实验测试,并与SVM、PCA+SVM方法进行比较,证明基于改进F-score与SVM的特征选择方法不仅提高了分类精度,并具有很好的泛化能力,且在训练时间上优于PCA+SVM方法。 相似文献
6.
针对传统分类器在数据不均衡的情况下分类效果不理想的缺陷,为提高分类器在不均衡数据集下的分类性能,特别是少数类样本的分类能力,提出了一种基于BSMOTE 和逆转欠抽样的不均衡数据分类算法。该算法使用BSMOTE进行过抽样,人工增加少数类样本的数量,然后通过优先去除样本中的冗余和噪声样本,使用逆转欠抽样方法逆转少数类样本和多数类样本的比例。通过多次进行上述抽样形成多个训练集合,使用Bagging方法集成在多个训练集合上获得的分类器来提高有效信息的利用率。实验表明,该算法较几种现有算法不仅能够提高少数类样本的分类性能,而且能够有效提高整体分类准确度。 相似文献
7.
针对支持向量机(SVM)训练不平衡样本数据产生最优分类面的偏移会降低分类模型泛化性的问题,提出一种基于Fisher类内散度平均分布比的分类面修正方法。对样本数据进行SVM训练后获得分类面的法向量;通过计算两类样本在该法向量方向上的Fisher类内散度来评价这两类样本的分布情况;依据类内散度综合考虑样本个数所得到的平均分布比重新修正最优分类面的位置。在benchmarks数据集上的实验结果说明该方法能够提高SVM分类模型在处理不均衡数据集时对于少数类的识别率,从而有助于提高模型的泛化性。 相似文献
8.
文章提出了一种基于算法选择和结果评估的自动聚类方法。对给定数据集,该方法首先通过分析数据集的潜在簇结构,并依据所发现的簇结构为数据集挑选一种合适的备选聚类算法集;然后利用聚类有效性指标对这个算法集的算法聚类结果进行评估,以确保得到高质量聚类结果。实验结果表明该方法能够自动地挑选适合数据集的聚类算法,并获得高质量的聚类结果。 相似文献
9.
一种基于有向无环图的多类SVM分类器 总被引:1,自引:0,他引:1
本文提出了一种多类SVM分类器--ACDMSVM,它是基于决策有向无环图和积极约束的多类SVM分类器.对于k类问题,它将k(k-1)/2个改进的二类SVM分类器进行组合.为了提高分类器的训练及决策速度,对标准的二类SVM分类器进行三个方面的改进:利用大间隔方法,对软间隔错误变量采用2-范数形式并应用积极约束.在训练阶段,使用含有根的二元有向无环图进行节点的选择,该有向无环图含k(k-1)/2个内部节点和k个叶节点.数值实验表明这是一种快速的多类SVM分类器. 相似文献
10.
11.
12.
滚动轴承作为风电机组的关键部件,对于整个机组的安全运行起着决定性作用.针对机组滚动轴承故障诊断问题,提出一种节点优化型有向无环图大间隔分布机(O-DAG-LDM)的故障诊断方法.结合DAG多分类扩展性能与LDM二分类器泛化性能的优点,构建一种面向滚动轴承故障诊断的DAG结构扩展式LDM多分类器方法.在DAG-LDM算法框架下,利用优化算法对DAG节点进行优化排列以减小随机排布引起的累积误差,提高LDM故障分类准确率.实验表明,与其他主流智能诊断方法相比,所提出的节点优化型DAG-LDM故障诊断方法具有较高的准确率和更好的抗噪性能. 相似文献
13.
Decision trees for hierarchical multi-label classification 总被引:3,自引:0,他引:3
Celine Vens Jan Struyf Leander Schietgat Sašo Džeroski Hendrik Blockeel 《Machine Learning》2008,73(2):185-214
Hierarchical multi-label classification (HMC) is a variant of classification where instances may belong to multiple classes
at the same time and these classes are organized in a hierarchy. This article presents several approaches to the induction
of decision trees for HMC, as well as an empirical study of their use in functional genomics. We compare learning a single
HMC tree (which makes predictions for all classes together) to two approaches that learn a set of regular classification trees
(one for each class). The first approach defines an independent single-label classification task for each class (SC). Obviously,
the hierarchy introduces dependencies between the classes. While they are ignored by the first approach, they are exploited
by the second approach, named hierarchical single-label classification (HSC). Depending on the application at hand, the hierarchy
of classes can be such that each class has at most one parent (tree structure) or such that classes may have multiple parents
(DAG structure). The latter case has not been considered before and we show how the HMC and HSC approaches can be modified
to support this setting. We compare the three approaches on 24 yeast data sets using as classification schemes MIPS’s FunCat
(tree structure) and the Gene Ontology (DAG structure). We show that HMC trees outperform HSC and SC trees along three dimensions:
predictive accuracy, model size, and induction time. We conclude that HMC trees should definitely be considered in HMC tasks
where interpretable models are desired. 相似文献
14.
This research presents the augmentation of the original contour preserving classification technique to support multi-class data and to reduce the number of synthesized vectors, called multi-class outpost vectors (MCOVs). The technique has been proven to function on both synthetic-problem data sets and real-world data sets correctly. The technique also includes three methods to reduce the number of MCOVs by using minimum vector distance selection between fundamental multi-class outpost vectors and additional multi-class outpost vectors to select only MCOVs located at the decision boundary between consecutive classes of data. The three MCOV reduction methods include the FF-AA reduction method, the FA-AF reduction method, and the FAF-AFA reduction method. An evaluation has been conducted to show the reduction capability, the contour preservation capability, and the levels of classification accuracy of the three MCOV reduction methods on both non-overlapping and highly overlapping synthetic-problem data sets and highly overlapping real-world data sets. For non-overlapping problems, the experimental results present that the FA-AF reduction method can partially reduce the number of MCOVs while preserving the contour of the problem most accurately and obtaining similar levels of classification accuracy as when the whole set of MCOVs is used. For highly overlapping problems, the experimental results present that the FF-AA reduction method can partially reduce the number of MCOVs while preserving the contour of the problem most accurately and obtaining similar levels of classification accuracy as when the whole set of MCOVs is used. 相似文献
15.
Various microarray experiments are now done in many laboratories, resulting in the rapid accumulation of microarray data in public repositories. One of the major challenges of analyzing microarray data is how to extract and select efficient features from it for accurate cancer classification. Here we introduce a new feature extraction and selection method based on information gene pairs that have significant change in different tissue samples. Experimental results on five public microarray data sets demonstrate that the feature subset selected by the proposed method performs well and achieves higher classification accuracy on several classifiers. We perform extensive experimental comparison of the features selected by the proposed method and features selected by other methods using different evaluation methods and classifiers. The results confirm that the proposed method performs as well as other methods on acute lymphoblastic-acute myeloid leukemia, adenocarcinoma and breast cancer data sets using a fewer information genes and leads to significant improvement of classification accuracy on colon and diffuse large B cell lymphoma cancer data sets. 相似文献
16.
Many gene selection methods have been proposed to select a subset of genes that can have a high prediction accuracy for cancer classification, and most set the same preference for all genes. However, many biological reports have pointed out that mutated or flawed genes, named as risk genes, can be one of the major causes of a specific disease. This study proposes a gene selection method based on the risk genes found in biological reports. The information provided by risk genes can reduce the time complexity for gene selection and increase the accuracy of cancer classification. This gene selection method is composed of two stages. Since all risk genes must be chosen, the first stage is to remove the genes that have similar expression levels or functions to risk genes. The next stage is to perform gene selection and gene replacement based on the results of a process that divides the remaining genes into clusters. Based on the test results from four microarray data sets, our gene selection method outperforms those proposed by previous studies, and genes that have the potential to be new risk genes are presented. 相似文献
17.
Sounak Chakraborty 《Computational statistics & data analysis》2009,53(4):1462-1474
Since most cancer treatments come with a certain degree of toxicity it is very essential to identify a cancer type correctly and then administer the relevant therapy. With the arrival of powerful tools such as gene expression microarrays the cancer classification basis is slowly changing from morphological properties to molecular signatures. Several recent studies have demonstrated a marked improvement in prediction accuracy of tumor types based on gene expression microarray measurements over clinical markers. The main challenge in working with gene expression microarrays is that there is a huge number of genes to work with. Out of them only a small fraction are actually relevant for differentiating between different types of cancer. A Bayesian nearest neighbor model equipped with an integrated variable selection technique is proposed to overcome this challenge. This classification and gene selection model is able to classify different cancer types accurately and simultaneously identify the relevant or important genes. The proposed model is completely automatic in the sense that it adaptively picks up the neighborhood size and the important covariates. The method is successfully applied to three simulated data sets and four well known real data sets. To demonstrate the competitiveness of the method a comparative study is also done with several other “off the shelf” popular classification methods. For all the simulated data sets and real life data sets, the proposed method produced highly competitive if not better results. While the standard approach is two step model building for gene selection and then tumor prediction, this novel adaptive gene selection technique automatically selects the relevant genes along with tumor class prediction in one go. The biological relevance of the selected genes are also discussed to validate the claim. 相似文献
18.
Hepatitis is a disease which is seen at all levels of age. Hepatitis disease solely does not have a lethal effect, but the early diagnosis and treatment of hepatitis is crucial as it triggers other diseases. In this study, a new hybrid medical decision support system based on rough set (RS) and extreme learning machine (ELM) has been proposed for the diagnosis of hepatitis disease. RS-ELM consists of two stages. In the first one, redundant features have been removed from the data set through RS approach. In the second one, classification process has been implemented through ELM by using remaining features. Hepatitis data set, taken from UCI machine learning repository has been used to test the proposed hybrid model. A major part of the data set (48.3%) includes missing values. As removal of missing values from the data set leads to data loss, feature selection has been done in the first stage without deleting missing values. In the second stage, the classification process has been performed through ELM after the removal of missing values from sub-featured data sets that were reduced in different dimensions. The results showed that the highest 100.00% classification accuracy has been achieved through RS-ELM and it has been observed that RS-ELM model has been considerably successful compared to the other methods in the literature. Furthermore in this study, the most significant features have been determined for the diagnosis of the hepatitis. It is considered that proposed method is to be useful in similar medical applications. 相似文献
19.
Luca Scrucca 《Computational statistics & data analysis》2007,52(1):438-451
The monitoring of the expression profiles of thousands of genes have proved to be particularly promising for biological classification. DNA microarray data have been recently used for the development of classification rules, particularly for cancer diagnosis. However, microarray data present major challenges due to the complex, multiclass nature and the overwhelming number of variables characterizing gene expression profiles. A regularized form of sliced inverse regression (REGSIR) approach is proposed. It allows the simultaneous development of classification rules and the selection of those genes that are most important in terms of classification accuracy. The method is illustrated on some publicly available microarray data sets. Furthermore, an extensive comparison with other classification methods is reported. The REGSIR performance is comparable with the best classification methods available, and when appropriate feature selection is made the performance can be considerably improved. 相似文献
20.
支持向量机是最有效的分类技术之一,具有很高的分类精度和良好的泛化能力,但其应用于大型数据集时的训练过程还是非常复杂。对此提出了一种基于单类支持向量机的分类方法。采用随机选择算法来约简训练集,以达到提高训练速度的目的;同时,通过恢复超球体交集中样本在原始数据中的邻域来保证支持向量机的分类精度。实验证明,该方法能在较大程度上减小计算复杂度,从而提高大型数据集中的训练速度。 相似文献