首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 375 毫秒
1.
汪敏  武禹伯  闵帆 《计算机应用》2020,40(12):3437-3444
针对传统岩性识别方法识别精度低,难以和地质经验有机结合的问题,提出了一种基于多种聚类算法和多元线性回归的多分类主动学习算法(ALCL)。首先,通过多种异构聚类算法聚类得到对应每种算法的类别矩阵,并通过查询公共点对类别矩阵进行标记和预分类;其次,提出优先级最大搜寻策略和最混乱查询策略选取用于训练聚类算法权重系数模型的关键实例;然后,定义目标求解函数,通过训练关键实例求解得到每种聚类算法的权重系数;最后,结合权重系数进行分类计算,从而对结果置信度高的样本进行分类。应用大庆油田油井的6个公开岩性数据集进行实验,实验结果表明,ALCL的分类精度最高时,比传统监督学习算法和其他主动学习算法提高了2.07%~14.01%。假设检验和显著性分析的结果验证了ALCL在岩性识别问题上具有更好的分类效果。  相似文献   

2.
汪敏  武禹伯  闵帆 《计算机应用》2005,40(12):3437-3444
针对传统岩性识别方法识别精度低,难以和地质经验有机结合的问题,提出了一种基于多种聚类算法和多元线性回归的多分类主动学习算法(ALCL)。首先,通过多种异构聚类算法聚类得到对应每种算法的类别矩阵,并通过查询公共点对类别矩阵进行标记和预分类;其次,提出优先级最大搜寻策略和最混乱查询策略选取用于训练聚类算法权重系数模型的关键实例;然后,定义目标求解函数,通过训练关键实例求解得到每种聚类算法的权重系数;最后,结合权重系数进行分类计算,从而对结果置信度高的样本进行分类。应用大庆油田油井的6个公开岩性数据集进行实验,实验结果表明,ALCL的分类精度最高时,比传统监督学习算法和其他主动学习算法提高了2.07%~14.01%。假设检验和显著性分析的结果验证了ALCL在岩性识别问题上具有更好的分类效果。  相似文献   

3.
不平衡数据分类是机器学习研究领域中的一个热点问题。针对传统分类算法处理不平衡数据的少数类识别率过低问题,文章提出了一种基于聚类的改进AdaBoost分类算法。算法首先进行基于聚类的欠采样,在多数类样本上进行K均值聚类,之后提取聚类质心,与少数类样本数目一致的聚类质心和所有少数类样本组成新的平衡训练集。为了避免少数类样本数量过少而使训练集过小导致分类精度下降,采用少数过采样技术过采样结合聚类欠采样。然后,借鉴代价敏感学习思想,对AdaBoost算法的基分类器分类误差函数进行改进,赋予不同类别样本非对称错分损失。实验结果表明,算法使模型训练样本具有较高的代表性,在保证总体分类性能的同时提高了少数类的分类精度。  相似文献   

4.
胡小生  张润晶  钟勇 《计算机科学》2013,40(11):271-275
类别不平衡数据分类是机器学习和数据挖掘研究的热点问题。传统分类算法有很大的偏向性,少数类分类效果不够理想。提出一种两层聚类的类别不平衡数据级联挖掘算法。算法首先进行基于聚类的欠采样,在多数类样本上进行聚类,之后提取聚类质心,获得与少数类样本数目相一致的聚类质心,再与所有少数类样例一起组成新的平衡训练集,为了避免少数类样本数量过少而使训练集过小导致分类精度下降的问题,使用SMOTE过采样结合聚类欠采样;然后在平衡的训练集上使用K均值聚类与C4.5决策树算法相级联的分类方法,通过K均值聚类将训练样例划分为K个簇,在每个聚类簇内使用C4.5算法构建决策树,通过K个聚簇上的决策树来改进优化分类决策边界。实验结果表明,该算法具有处理类别不平衡数据分类问题的优势。  相似文献   

5.
针对标签均值半监督支持向量机在图像分类中随机选取无标记样本会导致分类正确率不高,以及算法的稳定性较低的问题,提出了基于聚类标签均值的半监督支持向量机算法。该算法修改了原算法对于无标记样本的惩罚项,对选取的无标记样本聚类,使用聚类标签均值替换标签均值。实验结果表明,使用聚类标签均值训练的分类器大大减少了背景与目标的错分情况,提高了分类的正确率以及算法的稳定性,适合用于图像分类。  相似文献   

6.
本文提出了一种基于隐马尔可夫模型的二次k-均值聚类算法并实现了对基因序列数据的建模与聚类。算法首先引入了同源基因序列核苷酸比率趋向于一致的生物学特征来对基 因序列数据进行初次k-均值聚类,然后利用第一次聚类结果训练出表征序列特征的隐马尔可夫模型,最后采用基于模型的k-均值方法再次聚类。实验结果表明,该算法是可行的,,并且具有较好的聚类质量。  相似文献   

7.
获取数据流上样本的真实类别的代价很高,因此标记所有样本的方式缺乏实用性,而随机标记部分样本又会导致模型的不稳定.针对上述问题,文中提出基于聚类假设的数据流分类算法.基于通过聚类算法分到同类中的样本可能具有相同类别这一聚类假设,利用训练数据集上的聚类结果拟合样本的分布情况,在分类阶段有目的性地选取很难分类或潜在概念漂移的样本更新模型.为了训练数据集上每个类别的样本,建立各自对应的基础分类器,当数据流中样本的类别消失或重现时,只需要冻结或激活与之对应的基础分类器,而无需再重新学习之前已经掌握的知识.实验表明,文中算法能够在适应概念漂移的前提下,减少更新模型需要的样本数量,并且取得和当前数据流上的分类算法相当或更好的分类效果.  相似文献   

8.
针对支持向量机(Support Vector Machine,SVM)处理大规模数据集的学习时间长、泛化能力下降等问题,提出基于边界样本选择的支持向量机加速算法。首先,进行无监督的K均值聚类;然后,在各个聚簇内依照簇的混合度、支持度因素应用K近邻算法剔除非边界样本,获得最终的类别边界区域样本,参与SVM模型训练。在标准数据集上的实验结果表明,算法在保持传统支持向量机的分类泛化能力的同时,显著降低了模型训练时间。  相似文献   

9.
提出基于K均值集成和支持向量机相结合的P2P流量识别模型,以保证流量识别精度和稳定性,克服聚类识别模型中参数值难以确定、复杂性高等缺点。对少量标签样本采用随机簇中心的K均值算法训练基聚类器,按最大后验概率分配簇标签,无标签样本与其最近簇标签一致;按投票机制集成无标签样本标签信息,并结合原标签样本训练支持向量机识别模型。该模型利用了集成学习稳定性和SVM在小样本集上的良好泛化性能。理论分析和仿真实验结果证明了方案的可行性。  相似文献   

10.
移动互联网流量分类/聚类是有效管理网络流量的重要基础,但是已有文献采集的移动互联网流量数据来源不同、流量数据标签级别不同、描述流量数据的特征集合不同,所获得的实验结果无法进行直接比较。借助于MobileGT系统采集移动App产生的网络流量数据,从两种粒度标记流量数据(App级别和功能级别),以单向流和双向流分别获取不同的特征集合,进而综合性实验分析各种机器学习算法在不同标记粒度和不同特征集合描述的移动互联网流量数据上的分类/聚类性能。实验结果表明,在流统计特征方面,基于单向流的统计特征更优;在分类算法方面,随机森林和AdaBoost算法更优;在聚类算法方面,K-均值方法更优。  相似文献   

11.
A new clustering technique for function approximation   总被引:5,自引:0,他引:5  
To date, clustering techniques have always been oriented to solve classification and pattern recognition problems. However, some authors have applied them unchanged to construct initial models for function approximators. Nevertheless, classification and function approximation problems present quite different objectives. Therefore it is necessary to design new clustering algorithms specialized in the problem of function approximation. This paper presents a new clustering technique, specially designed for function. approximation problems, which improves the performance of the approximator system obtained, compared with other models derived from traditional classification oriented clustering algorithms and input-output clustering techniques.  相似文献   

12.
This paper presents a novel host-based combinatorial method based on k-Means clustering and ID3 decision tree learning algorithms for unsupervised classification of anomalous and normal activities in computer network ARP traffic. The k-Means clustering method is first applied to the normal training instances to partition it into k clusters using Euclidean distance similarity. An ID3 decision tree is constructed on each cluster. Anomaly scores from the k-Means clustering algorithm and decisions of the ID3 decision trees are extracted. A special algorithm is used to combine results of the two algorithms and obtain final anomaly score values. The threshold rule is applied for making the decision on the test instance normality. Experiments are performed on captured network ARP traffic. Some anomaly criteria has been defined and applied to the captured ARP traffic to generate normal training instances. Performance of the proposed approach is evaluated using five defined measures and empirically compared with the performance of individual k-Means clustering and ID3 decision tree classification algorithms and the other proposed approaches based on Markovian chains and stochastic learning automata. Experimental results show that the proposed approach has specificity and positive predictive value of as high as 96 and 98%, respectively.  相似文献   

13.
An adaptive flocking algorithm for performing approximate clustering   总被引:1,自引:0,他引:1  
This paper presents an approach based on an adaptive bio-inspired method to make state of the art clustering algorithms scalable and to provide them with an any-time behavior. The method is based on the biology-inspired paradigm of a flock of birds, i.e. a population of simple agents interacting locally with each other and with the environment. The flocking algorithm provides a model of decentralized adaptive organization useful to solve complex optimization, classification and distributed control problems. This approach avoids the sequential search of canonical clustering algorithms and permits a scalable implementation.The method is applied to design two novel clustering algorithms based on the main principles of two popular clustering algorithms: DBSCAN and SNN. This apporach can identify clusters of widely varying shapes and densities and is able to extract an approximate view of the clusters whenever it is required. Both the algorithms have been evaluated on synthetic and real world data sets and the impact of the flocking strategy on performance has been evaluated.  相似文献   

14.
P-AutoClass: scalable parallel clustering for mining large data sets   总被引:3,自引:0,他引:3  
Data clustering is an important task in the area of data mining. Clustering is the unsupervised classification of data items into homogeneous groups called clusters. Clustering methods partition a set of data items into clusters, such that items in the same cluster are more similar to each other than items in different clusters according to some defined criteria. Clustering algorithms are computationally intensive, particularly when they are used to analyze large amounts of data. A possible approach to reduce the processing time is based on the implementation of clustering algorithms on scalable parallel computers. This paper describes the design and implementation of P-AutoClass, a parallel version of the AutoClass system based upon the Bayesian model for determining optimal classes in large data sets. The P-AutoClass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. The system architecture, its implementation, and experimental performance results on different processor numbers and data sets are presented and compared with theoretical performance. In particular, experimental and predicted scalability and efficiency of P-AutoClass versus the sequential AutoClass system are evaluated and compared.  相似文献   

15.
PCCS部分聚类分类:一种快速的Web文档聚类方法   总被引:16,自引:1,他引:15  
PCCS是为了帮助Web用户从搜索引擎所返回的大量文档片中筛选出自已所需要的文档,而使用的一种对Web文档进行快速聚类的部分聚类分法,首先对一部分文档进行聚类,然后根据聚类结果形成类模型对其余的文档进行分类,采用交互式的一次改进一个聚类摘选的聚类方法快速地创建一个聚类摘选集,将其余的文档使用Naive-Bayes分类器进行划分,为了提高聚类与分类的效率,提出了一种混合特征选取方法以减少文档表示的维数,重新计算文档中各特征的熵,从中选取具有最大熵值的前若干个特征,或者基于持久分类模型中的特征集来进行特征选取,实验证明,部分聚类方法能够快速,准确地根据文档主题内容组织Web文档,使用户在更高的术题层次上来查看搜索引擎返回的结果,从以主题相似的文档所形成的集簇中选取相关文档。  相似文献   

16.
17.
基于密度复杂簇聚类算法研究与实现   总被引:3,自引:2,他引:1       下载免费PDF全文
聚类算法在模式识别、数据分析、图像处理、以及市场研究的应用中,需要解决的关键技术是如何有效地聚类各种复杂的数据对象簇。在分析与研究现有聚类算法的基础上,提出了一种基于密度和自适应密度可达的改进算法。实验证明,该算法能够有效聚类任意分布形状、不同密度、不同尺度的簇;同时,算法的计算复杂度与传统基于密度的聚类算法相比有明显的降低。  相似文献   

18.
Modern day computers cannot provide optimal solution to the clustering problem. There are many clustering algorithms that attempt to provide an approximation of the optimal solution. These clustering techniques can be broadly classified into two categories. The techniques from first category directly assign objects to clusters and then analyze the resulting clusters. The methods from second category adjust representations of clusters and then determine the object assignments. In terms of disciplines, these techniques can be classified as statistical, genetic algorithms based, and neural network based. This paper reports the results of experiments comparing five different approaches: hierarchical grouping, object-based genetic algorithms, cluster-based genetic algorithms, Kohonen neural networks, and K-means method. The comparisons consist of the time requirements and within-group errors. The theoretical analyses were tested for clustering of highway sections and supermarket customers. All the techniques were applied to clustering of highway sections. The hierarchical grouping and genetic algorithms approaches were computationally infeasible for clustering a larger set of supermarket customers. Hence only Kohonen neural networks and K-means techniques were applied to the second set to confirm some of the results from previous experiments.  相似文献   

19.
Shell clustering algorithms are ideally suited for computer vision tasks such as boundary detection and surface approximation, particularly when the boundaries have jagged or scattered edges and when the range data is sparse. This is because shell clustering is insensitive to local aberrations, it can be performed directly in image space, and unlike traditional approaches it does assume dense data and does not use additional features such as curvatures and surface normals. The shell clustering algorithms introduced in Part I of this paper assume that the number of clusters is known, however, which is not the case in many boundary detection and surface approximation applications. This problem can be overcome by considering cluster validity. We introduce a validity measure called surface density which is explicitly meant for the type of applications considered in this paper, we show through theoretical derivations that surface density is relatively invariant to size and partiality (incompleteness) of the clusters. We describe unsupervised clustering algorithms that use the surface density measure and other measures to determine the optimum number of shell clusters automatically, and illustrate the application of the proposed algorithms to boundary detection in the case of intensity images and to surface approximation in the case of range images  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号