首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In conjunction with the advance in computer technology, virtual screening of small molecules has been started to use in drug discovery. Since there are thousands of compounds in early-phase of drug discovery, a fast classification method, which can distinguish between active and inactive molecules, can be used for screening large compound collections. In this study, we used Support Vector Machines (SVM) for this type of classification task. SVM is a powerful classification tool that is becoming increasingly popular in various machine-learning applications. The data sets consist of 631 compounds for training set and 216 compounds for a separate test set. In data pre-processing step, the Pearson's correlation coefficient used as a filter to eliminate redundant features. After application of the correlation filter, a single SVM has been applied to this reduced data set. Moreover, we have investigated the performance of SVM with different feature selection strategies, including SVM–Recursive Feature Elimination, Wrapper Method and Subset Selection. All feature selection methods generally represent better performance than a single SVM while Subset Selection outperforms other feature selection methods. We have tested SVM as a classification tool in a real-life drug discovery problem and our results revealed that it could be a useful method for classification task in early-phase of drug discovery.  相似文献   

2.
陶新民  童智靖  刘玉  付丹丹 《控制与决策》2011,26(10):1535-1541
针对传统的支持向量机(SVM)算法在数据不均衡的情况下分类效果不理想的缺陷,为了提高SVM算法在不均衡数据集下的分类性能,提出一种新型的逐级优化递减欠采样算法.该算法去除样本中大量重叠的冗余和噪声样本,使得在减少数据的同时保留更多的有用信息,并且与边界人工少数类过采样算法相结合实现训练样本数据集的均衡.实验表明,该算法不但能有效提高SVM算法在不均衡数据中少数类的分类性能,而且总体分类性能也有所提高.  相似文献   

3.
Software developers, testers and customers routinely submit issue reports to software issue trackers to record the problems they face in using a software. The issues are then directed to appropriate experts for analysis and fixing. However, submitters often misclassify an improvement request as a bug and vice versa. This costs valuable developer time. Hence automated classification of the submitted reports would be of great practical utility. In this paper, we analyze how machine learning techniques may be used to perform this task. We apply different classification algorithms, namely naive Bayes, linear discriminant analysis, k-nearest neighbors, support vector machine (SVM) with various kernels, decision tree and random forest separately to classify the reports from three open-source projects. We evaluate their performance in terms of F-measure, average accuracy and weighted average F-measure. Our experiments show that random forests perform best, while SVM with certain kernels also achieve high performance.  相似文献   

4.
血管紧张素转换酶抑制剂(ACEI)对高血压的治疗具有重要意义。基于从结构复杂的化合物数据库中构建的候选小分子数据集,采用分子对接技术从数据集中筛选出样本构建分类模型。分别采用支持向量机、[K]近邻、决策树、随机森林和贝叶斯方法建立血管紧张素转换酶潜在抑制剂和非抑制剂的分类模型。经结果对比,支持向量机相比于其他方法有更高的预测率,其中模型总体预测率和相关系数分别为82.4%和0.653。研究表明,支持向量机方法对于虚拟筛选血管紧张素转换酶抑制剂具有良好的效果。  相似文献   

5.
针对计算机辅助诊断(CAD)技术在乳腺癌疾病诊断准确率的优化问题,提出了一种基于随机森林模型下Gini指标特征加权的支持向量机方法(RFG-SVM)。该方法利用了随机森林模型下的Gini指数衡量各个特征对分类结果的重要性,构造具有加权特征向量核函数的支持向量机,并在乳腺癌疾病诊断方面加以应用。经理论分析和实验数据验证,相比于传统的支持向量机(SVM),该方法提升了分类预测的性能,其结果与最新的方法相比也具有一定的竞争力,而且在医疗诊断应用方面更具优势。  相似文献   

6.
基于多层集成学习的岩性识别方法   总被引:1,自引:0,他引:1  
岩性识别是油藏地质解释中的关键问题和难点问题,人工智能特别是机器学习技术的发展和应用为岩性识别问题解决提供了新的技术途径。本文利用支持向量机(Support vector machine,SVM)、多粒度级联森林(Multi-grained cascade forest,GCForest)、随机森林(Random forest,RF)以及XGBoost(eXtreme gradient boosting)等机器学习模型建立一个异构多层集成学习模型,该集成学习模型克服了单一模型对数据集要求高、泛化能力差以及识别精度低等缺点。本文分别利用集成模型和单一模型进行了岩性识别实验。实验结果表明,本文集成模型在岩性分类测试集上平均精度达到96.66%,高于SVM的平均精度75.53%、GCForest的平均精度96.21%、随机森林的平均精度95.06%和XGBoost的平均精度95.77%。该集成模型能有效地用于油藏地质分析中的岩性识别和分类任务,适应性强,识别精度高。  相似文献   

7.
为实时动态监控发动机缸体顶面孔组的加工质量,提出基于随机森林(random forest,RF)和支持向量机(support vector machine,SVM)相结合的工序节点处加工质量分级监控模型。设计在工序间快速获取发动机缸体孔组图像的机器视觉系统,提取单缸孔7个特征参数及3个相邻孔间距;用主成分分析法对特征参数进行降维处理,建立样本集合训练孔组整体加工质量RF分级监控模型及单孔加工质量SVM分级监控模型。应用该模型对某发动机缸体顶面孔组加工质量进行在线监控,结果表明,与决策树模型、RF模型和SVM模型相比,所提模型对孔组整体加工质量分级精度可达97.778%,单孔分级精度可达99.167%,能快速响应发动机缸体制造过程质量反馈控制,可有效解决相关工程实际问题。  相似文献   

8.
Mapping tropical forests to a sufficient level of spatial resolution and structural detail is a prerequisite for their rational management, which however remains a largely unmet challenge. We explore the degree to which a forest canopy height model (CHM) derived from airborne laser scanning (ALS) can discriminate between five forest types of similar height but varying structure or composition. We systematically compare various textural features (Haralick, Fourier transform-based, and wavelet-based features) and various classification procedures (linear discriminant analysis (LDA), random forest(RF), and support vector machine (SVM)) applied to two sizes of sampling units (64 m × 64 m and 32 m × 32 m). Simple height distribution statistics achieve at best 70% classification accuracy in our sample set comprising 120 sampling units of 64 m × 64 m. Using w avelet-based features, this accuracy increases to 79% but drops by 10% with smaller sampling units (32 m × 32 m). Classifier performance depends on the texture feature set used, but SVM and RF tend to perform better than LDA. High discrimination rates between forests types of similar height indicate that the ALS-derived CHM provides information suitable for mapping of tropical forest types. Wavelet-based texture features coupled with a SVM classifier was found to be the most promising combination of methods. Ancillary data derived from laser scans and notably topography could be used jointly for an improved segmentation scheme.  相似文献   

9.
For the last years, a considerable amount of attention has been devoted to the research about the link prediction (LP) problem in complex networks. This problem tries to predict the likelihood of an association between two not interconnected nodes in a network to appear in the future. One of the most important approaches to the LP problem is based on supervised machine learning (ML) techniques for classification. Although many works have presented promising results with this approach, choosing the set of features (variables) to train the classifiers is still a major challenge. In this article, we report on the effects of three different automatic variable selection strategies (Forward, Backward and Evolutionary) applied to the feature-based supervised learning approach in LP applications. The results of the experiments show that the use of these strategies does lead to better classification models than classifiers built with the complete set of variables. Such experiments were performed over three datasets (Microsoft Academic Network, Amazon and Flickr) that contained more than twenty different features each, including topological and domain-specific ones. We also describe the specification and implementation of the process used to support the experiments. It combines the use of the feature selection strategies, six different classification algorithms (SVM, K-NN, naïve Bayes, CART, random forest and multilayer perceptron) and three evaluation metrics (Precision, F-Measure and Area Under the Curve). Moreover, this process includes a novel ML voting committee inspired approach that suggests sets of features to represent data in LP applications. It mines the log of the experiments in order to identify sets of features frequently selected to produce classification models with high performance. The experiments showed interesting correlations between frequently selected features and datasets.  相似文献   

10.
以长白山为试验区,选择CHRIS/PROBA高光谱零度角遥感数据,在对其进行预处理的基础上,通过应用最大似然法(MLC)、最小距离法、支持向量机法(SVM)和光谱角填图法(SAM)等几种常用的高光谱遥感分类方法对影像进行森林类型分类。利用混淆矩阵对分类结果进行验证,结果显示:在高光谱遥感森林类型分类中,SVM总体分类精度最高,为84.60%;其次是MLC,为 83.53%,最小距离法73.81%,SAM 56.49%。Kappa系数从高到底为:SVM 0.78,MLC 0.77,最小距离法0.68,SAM 0.52。经过比较分析,得出SVM分类方法精度最高,这表明该方法用于高光谱遥感森林分类中的实用性和优越性。  相似文献   

11.
Support vector machine (SVM) is a novel pattern classification method that is valuable in many applications. Kernel parameter setting in the SVM training process, along with the feature selection, significantly affects classification accuracy. The objective of this study is to obtain the better parameter values while also finding a subset of features that does not degrade the SVM classification accuracy. This study develops a simulated annealing (SA) approach for parameter determination and feature selection in the SVM, termed SA-SVM.To measure the proposed SA-SVM approach, several datasets in UCI machine learning repository are adopted to calculate the classification accuracy rate. The proposed approach was compared with grid search which is a conventional method of performing parameter setting, and various other methods. Experimental results indicate that the classification accuracy rates of the proposed approach exceed those of grid search and other approaches. The SA-SVM is thus useful for parameter determination and feature selection in the SVM.  相似文献   

12.
Most studies have been based on the original computation mode of semivariogram and discrete semivariance values. In this paper, a set of texture features are described to improve the accuracy of object-oriented classification in remotely sensed images. So, we proposed a classification method support vector machine (SVM) with spectral information and texture features (ST-SVM), which incorporates texture features in remotely sensed images into SVM. Using kernel methods, the spectral information and texture features are jointly used for the classification by a SVM formulation. Then, the texture features were calculated based on segmented block matrix image objects using the panchromatic band. A comparison of classification results on real-world data sets demonstrates that the texture features in this paper are useful supplement information for the spectral object-oriented classification, and proposed ST-SVM classification accuracy than the traditional SVM method with only spectral information.  相似文献   

13.
Biometric authentication is the process that allows an individual to be identified based on a set of unique biological features data. In this study, we present different experiments to use the cardiac sound signals (phonocardiogram “PCG”) as a biometric authentication trait. We have applied different features extraction approaches and different classification techniques to use the PCG as a biometric trait. Through all experiments, data acquisition is based on collecting the cardiac sounds from HSCT-11 and PASCAL CHSC2011 datasets, while preprocessing is concerned with de-noising of cardiac sounds using multiresolution-decomposition and multiresolution-reconstruction (MDR-MRR). The de-noised signal is then segmented based on frame-windowing and Shanon energy (SE) methods. For feature extraction, Cepstral (Cp) domain (based on mel-frequency) and time-scale (T-S) domain (based on Wavelet Transform) features are extracted from the de-noised signal after segmentation. The features, extracted from the Cp-domain and the T-S domain, are fed to four different classifiers: Artificial neural networks (ANN), support vector machine (SVM), random forest (RF) and K-nearest neighbor (KNN). The performance of the classifications is assessed based on the k-fold cross validation. The computation complexity of the feature extraction domains is expressed using the Big-O measurements. The T-S features are superior to PCG heart signals in terms of the classification accuracy. The experiments' results give the highest classification accuracy with lowest computation complexity for RF in the Cp domain and SVM and ANN in the T-S domain.  相似文献   

14.
We previously investigated the classification and prediction of dopamine D1 receptor agonists and antagonists using a topological fragment spectra (TFS)-based support vector machine (SVM), in which the dataset contained noise compounds that had no D1 receptor activity. This work extended the dataset to seven activity classes (dopamine D1, D2, and auto-receptor agonists, and D1, D2, D3, and D4 antagonists) and increased the noise ratio to ten times that of active compounds. In total, this study used 16,008 compounds for training and 1,779 compounds for prediction. The TFS-based SVM gave good, stable results for both classification and prediction, even in the case that included ten times the noise data. The resulting model correctly predicted 97.6% of the prediction set of 1,779 compounds.  相似文献   

15.
针对随机森林分类效果受样本集类间不平衡、类内不规则的影响,提出一种聚类欠采样策略的随机森林优化方法。该方法对原始数据大类样本聚类,得到与小类样本个数相同的子类簇;从每个子类簇中随机有放回抽取一个样本与小类样本合并,形成平衡样本集;对平衡样本集进行有放回随机抽样,形成单棵决策树的训练样本集并完成建树;将两次未被抽中的样本作为袋外数据,用于模型测试;重复上述过程多次,形成随机森林。使用10组非平衡数据集进行实验验证,结果表明,该方法在这10组数据集上的分类能力及稳定性均优于传统随机森林。  相似文献   

16.
《Real》2003,9(3):179-188
A real-time implementation of an approximation of the support vector machine (SVM) decision rule is proposed. This method is based on an improvement of a supervised classification method using hyperrectangles, which is useful for real-time image segmentation. The final decision combines the accuracy of the SVM learning algorithm and the speed of a hyperrectangles-based method. We review the principles of the classification methods and we evaluate the hardware implementation cost of each method. We present the combination algorithm, which consists of rejecting ambiguities in the learning set using SVM decision, before using the learning step of the hyperrectangles-based method. We present results obtained using Gaussian distribution and give an example of image segmentation from an industrial inspection problem. The results are evaluated regarding hardware cost as well as classification performances.  相似文献   

17.
针对传统机器学习方法在处理不平衡的海量高维数据时罕见攻击类检测率低的问题,提出了一种基于深度学习的随机森林算法的入侵检测模型,为了避免传统的随机森林面对高维数据和不平衡数据时分类精度低、稳定性差和对罕见攻击类检测率低的问题,引入生成式对抗网络(GAN)和栈式降噪自编码器(SDAE)对随机森林算法(RF)进行改进。将罕见攻击类数据集输入GAN神经网络中,生成新的攻击类样本,改善网络入侵数据在样本集中不均衡分布的情况,通过堆叠深层的SDAE逐层抽取网络数据的分布规则,并结合各个编码层的系数惩罚和重构误差,来确定高维数据中与入侵行为相关的特征,基于降维后的特征数据构建森林决策树。采用UNSW-NB15数据集的实验结果表明,与SVM、KNN、CNN、LSTM、DBN方法相比,GAN-SDAE-RF整体检测准确率平均提高了9.39%、误报率和漏报率平均降低了9%和15.24%以及在少数类Analysis、Shellcode、Backdoor、Worms上检测率分别提高了26.8%、27.98%、27.85%、39.97%。  相似文献   

18.
针对使用传统机器学习方法来识别恶意TLS流量受到专家经验的影响较大、识别与分类效果不理想的问题,提出了HNNIM(Hybrid Neural Network Identification Model)模型来进行识别与分类。模型由两层组成:第一层用于提取特征,第二层用于识别与分类。第一层中,提取的特征分为两部分,一部分特征由深度神经网络自动挖掘,另一部分特征根据专家经验选取,并由深度神经网络进一步筛选;第二层将第一层筛选出的特征进行聚合,采用全连接的深度神经网络进一步学习和拟合。通过分析大量TLS流量样本,最终选用TLS流量中的ClientHello与ServerHello消息报文与TCP协议交互信息这两部分来作为特征空间。实验的结果表明,HNNIM模型在恶意TLS流量的识别任务上关于恶意样本的F1值为0.989,较随机森林、SVM、XGBoost、卷积神经网络模型,在F1值上分别提升了0.016、0.016、0.019、0.043;在多分类任务上的平均准确率为89.28%,较随机森林、SVM、XGBoost、卷积神经网络模型分别提升了9.92%、9.09%、11.31%、7.03%。  相似文献   

19.
Estimating the extent of tropical rainforest types is needed for biodiversity assessment and carbon accounting. In this study, we used statistical comparisons to determine the ability of Landsat Thematic Mapper (TM) bands and spectral vegetation indices to discriminate composition and structural types. A total of 144 old-growth forest plots established in northern Costa Rica were categorized via cluster analysis and ordination. Locations for palm swamps, forest regrowth and tree plantations were also acquired, making 11 forest types for separability analysis. Forest types classified using support vector machines (SVM), a theoretically superior method for solving complex classification problems, were compared with the random forest decision tree classifier (RF). Separability comparisons demonstrate that spectral data are sensitive to differences among forest types when tree species and structural similarity is low. SVM class accuracy was 66.6% for all forest types, minimally higher than the RF classifier (65.3%). TM bands and the Normalized Difference Vegetation Index (NDVI) combined with digital elevation data notably increased accuracies for SVM (84.3%) and RF (86.7%) classifiers. Rainforest types discriminated here are typically limited to one or two categories for remote sensing classifications. Our results indicate that TM bands and ancillary data combined via machine learning algorithms can yield accurate and ecologically meaningful rainforest classifications important to national and international forest monitoring protocols.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号