首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 656 毫秒
1.
马福民  孙静勇  张腾飞 《控制与决策》2022,37(11):2968-2976
在原有数据聚类结果的基础上,如何对新增数据进行归属度量分析是提高增量式聚类质量的关键,现有增量式聚类算法更多地是考虑新增数据的位置分布,忽略其邻域数据点的归属信息.在粗糙K-means聚类算法的基础上,针对边界区域新增数据点的不确定性信息处理,提出一种基于邻域归属信息的粗糙K-means增量式聚类算法.该算法综合考虑边界区域新增数据样本的位置分布及其邻域数据点的类簇归属信息,使得新增数据点与各类簇的归属度量更为合理;此外,在增量式聚类过程中,根据新增数据点所导致的类簇结构的变化,对类簇进行相应的合并或分裂操作,使类簇划分可以自适应调整.在人工数据集和UCI标准数据集上的对比实验结果验证了算法的有效性.  相似文献   

2.
针对现有的增量聚类算法对参数敏感度较高、时空复杂度较高等问题,提出了一种基于代表点的增量聚类算法。首先采用代表点聚类算法对静态的数据库进行聚类;然后根据新增加的节点与已存的代表点之间的关系,判断是否将其添加到已存的代表点所属的类簇中,或是提升为新的代表点;最后,再次采用代表点聚类算法对其进行聚类。实验结果证明,该算法对参数的敏感性低、效率高、占用空间小。  相似文献   

3.
目前,搜索结果聚类方法大多数采用基于文档的方法,不能生成有意义的聚类标签。为了解决这个问题,提出一种基于关键名词短语聚类的中文搜索结果聚类方法,该方法将名词短语、相关搜索词作为候选聚类标签,利用C-Value算法、IDF值筛选标签,然后使用Chameleon算法将标签聚类,最后将搜索结果划分到最相关的聚类簇。实验证明,该方法把关键名词短语和相关搜索词作为聚类标签,有效地提高了标签的描述性,降低了聚类算法的时间复杂度。  相似文献   

4.
一种基于容错粗糙集的Web搜索结果聚类方法   总被引:1,自引:0,他引:1  
一些Web聚类方法把类严格作为互斥的关系,聚类效果不理想.一种基于容错粗糙集的k均值的聚类解决了这一问题.首先运用向量模型表示Web文档信息,采用常规方法得到文本特征词集,然后利用某些特征词协同出现的价值,构造特征词客错关系,扩充特征词的描述能力,最后用特征词容错类描述文档之间的相似关系,实现了Web搜索结果聚类,并提出了简单直观的衡量聚类精度的T模型.实验结果表明,利用容错关系聚类的类标记描述性强、容易理解、明显优于普通k均值算法.  相似文献   

5.
张钧波  李天瑞  潘毅  罗川  滕飞 《软件学报》2015,26(5):1064-1078
日益复杂和动态变化的海量数据处理,是当前人们普遍关注的问题,其核心内容之一是研究如何利用已有的信息实现快速的知识更新.粒计算是近年来新兴的一个研究领域,是信息处理的一种新的概念和计算范式,主要用于描述和处理不确定的、模糊的、不完整的和海量的信息,以及提供一种基于粒与粒间关系的问题求解方法.作为粒计算理论中的一个重要组成部分,粗糙集是一种处理不确定性和不精确性问题的有效数学工具.根据云计算中的并行模型MapReduce,给出了并行计算粗糙集中等价类、决策类和两者之间相关性的算法;然后,设计了用于处理大规模数据的并行粗糙近似集求解算法.为应对动态变化的海量数据,结合MapReduce模型和增量更新方法,根据不同的增量策略,设计了两种并行增量更新粗糙近似集的算法.实验结果表明,该算法可以有效地快速更新知识;而且数据量越大,效果越明显.  相似文献   

6.
一种层次化的检索结果聚类方法   总被引:3,自引:1,他引:2  
检索结果聚类能够帮助用户快速地浏览搜索引擎返回的结果.传统的聚类方法由于不能生成有意义的类别标签因此是不适合的,为了改善检索结果层次化聚类的效果,采用了基于标签的聚类算法,提出了将DF、查询日志、查询词上下文特征融合的类别标签抽取算法,并以抽取的标签构造基础类别图,通过GBCA算法构建层次化聚类结果.实验证明了多特征融合模型的有效性;GBCA算法在类别标签抽取和F-Measure两个评价指标上都比STC和Snaket算法有很大的提高.  相似文献   

7.
一种基于粒度的粗糙聚类分析方法   总被引:1,自引:1,他引:0       下载免费PDF全文
何明 《计算机工程》2008,34(8):203-204
基于粗糙集理论,分析等价关系与粒度之间的关系,提出一种基于粒度的粗糙聚类方法。该方法根据数据对象之间的相对相似性形成初始等价关系和等价类,每个等价类对应一个粒度。引入等价关系隶属度因子 ,用于度量等价关系间隶属关系,作为聚类过程一个有效参数,控制聚类的规模。通过迭代计算聚类的有效性,得到优化的聚类结果。聚类过程表明,聚类分析在一个统一的粒度下进行,在样本点之间定义一种等价关系。实验结果证实了该方法的有效性,用规则集描述的聚类结果具有可解释性和合理性。  相似文献   

8.
Rough k-means clustering describes uncertainty by assigning some objects to more than one cluster. Rough cluster quality index based on decision theory is applicable to the evaluation of rough clustering. In this paper we analyze rough k-means clustering with respect to the selection of the threshold, the value of risk for assigning an object and uncertainty of objects. According to the analysis, clusters presented as interval sets with lower and upper approximations in rough k-means clustering are not adequate to describe clusters. This paper proposes an interval set clustering based on decision theory. Lower and upper approximations in the proposed algorithm are hierarchical and constructed as outer-level approximations and inner-level ones. Uncertainty of objects in out-level upper approximation is described by the assignment of objects among different clusters. Accordingly, ambiguity of objects in inner-level upper approximation is represented by local uniform factors of objects. In addition, interval set clustering can be improved to obtain a satisfactory clustering result with the optimal number of clusters, as well as optimal values of parameters, by taking advantage of the usefulness of rough cluster quality index in the evaluation of clustering. The experimental results on synthetic and standard data demonstrate how to construct clusters with satisfactory lower and upper approximations in the proposed algorithm. The experiments with a promotional campaign for the retail data illustrates the usefulness of interval set clustering for improving rough k-means clustering results.  相似文献   

9.
基于粗糙集的混合属性数据聚类算法   总被引:2,自引:0,他引:2  
范黎林  王娟 《计算机应用》2010,30(12):3377-3379
传统聚类方法将对象严格地划分到某一类,但是很多时候边界对象不能被严格地划分。基于粗糙集的k-means聚类算法和基于粗糙集的leader聚类算法,利用粗糙集理论将数据对象划分到一个簇的上近似集或下近似集当中,提供了一种新的处理不确定性的视角,很好地解决了这种边界不确定问题。但其缺点是不能处理混合属性数据,聚类结果对初值有明显的依赖性。针对这些算法存在的不足,给出了一种适用于混合属性数据的距离定义,对初始值的选取提出了改进办法,提出了一种基于粗糙集的混合属性数据聚类算法。仿真实验证明,在不确定聚类簇数的情况下,该算法的聚类准确率比传统k-means算法明显提高。  相似文献   

10.
Efficient phrase-based document indexing for Web document clustering   总被引:4,自引:0,他引:4  
Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This article presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the document index graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.  相似文献   

11.
王玲  孟建瑶 《控制与决策》2018,33(3):471-478
针对传统的贝叶斯增量聚类算法需要人为设置参数,且对分布不均衡数据聚类效果不佳的问题,提出一种基于局部分布的贝叶斯自适应共振理论增量聚类算法.首先,利用数据快照读取数据;然后,在无需设置参数的情况下,考虑类簇的局部分布情况,自适应地确定新数据的所属类别,并更新获胜类簇;最后,确定相邻快照中类簇的演化关系.不同数据集的仿真结果表明,所提出的算法在准确性和自适应性方面均有显著提高.  相似文献   

12.
改进的基于遗传算法的粗糙聚类方法   总被引:2,自引:0,他引:2       下载免费PDF全文
传统的聚类算法都是使用硬计算来对数据对象进行划分,然而现实中不同类之间对象通常没有明确的界限。粗糙集理论提供了一种处理边界对象不确定的方法。因此将粗糙理论与k-均值方法相结合。同时,传统的k-均值聚类方法必须事先给定聚类数k,但实际情况下k很难确定;另外虽然传统k-均值算法局部搜索能力强,但容易陷入局部最优。遗传算法能得到全局最优解,但收敛过快。鉴于此,提出了一种改进的基于遗传算法的的粗糙聚类方法。该算法能动态地生成k-均值聚类数,采用最大最小原则生成初始聚类中心,同时结合粗糙集理论的上近似和下近似处理边界对象。最后,用UCI的Iris数据集分别对算法进行实际验证。实验结果表明,该算法具有较高的正确率,综合性能更加稳定。  相似文献   

13.
In this paper the problem of automatic clustering a data set is posed as solving a multiobjective optimization (MOO) problem, optimizing a set of cluster validity indices simultaneously. The proposed multiobjective clustering technique utilizes a recently developed simulated annealing based multiobjective optimization method as the underlying optimization strategy. Here variable number of cluster centers is encoded in the string. The number of clusters present in different strings varies over a range. The points are assigned to different clusters based on the newly developed point symmetry based distance rather than the existing Euclidean distance. Two cluster validity indices, one based on the Euclidean distance, XB-index, and another recently developed point symmetry distance based cluster validity index, Sym-index, are optimized simultaneously in order to determine the appropriate number of clusters present in a data set. Thus the proposed clustering technique is able to detect both the proper number of clusters and the appropriate partitioning from data sets either having hyperspherical clusters or having point symmetric clusters. A new semi-supervised method is also proposed in the present paper to select a single solution from the final Pareto optimal front of the proposed multiobjective clustering technique. The efficacy of the proposed algorithm is shown for seven artificial data sets and six real-life data sets of varying complexities. Results are also compared with those obtained by another multiobjective clustering technique, MOCK, two single objective genetic algorithm based automatic clustering techniques, VGAPS clustering and GCUK clustering.  相似文献   

14.
Hybrid mining approach in the design of credit scoring models   总被引:1,自引:0,他引:1  
Unrepresentative data samples are likely to reduce the utility of data classifiers in practical application. This study presents a hybrid mining approach in the design of an effective credit scoring model, based on clustering and neural network techniques. We used clustering techniques to preprocess the input samples with the objective of indicating unrepresentative samples into isolated and inconsistent clusters, and used neural networks to construct the credit scoring model. The clustering stage involved a class-wise classification process. A self-organizing map clustering algorithm was used to automatically determine the number of clusters and the starting points of each cluster. Then, the K-means clustering algorithm was used to generate clusters of samples belonging to new classes and eliminate the unrepresentative samples from each class. In the neural network stage, samples with new class labels were used in the design of the credit scoring model. The proposed method demonstrates by two real world credit data sets that the hybrid mining approach can be used to build effective credit scoring models.  相似文献   

15.
As a data mining method, clustering, which is one of the most important tools in information retrieval, organizes data based on unsupervised learning which means that it does not require any training data. But, some text clustering algorithms cannot update existing clusters incrementally and, instead, have to recompute a new clustering from scratch. In view of above, this paper presents a novel down-top incremental conceptual hierarchical text clustering approach using CFu-tree (ICHTC-CF) representation, which starts with each item as a separate cluster. Term-based feature extraction is used for summarizing a cluster in the process. The Comparison Variation measure criterion is also adopted for judging whether the closest pair of clusters can be merged or a previous cluster can be split. And, our incremental clustering method is not sensitive to the input data order. Experimental results show that the performance of our method outperforms k-means, CLIQUE, single linkage clustering and complete linkage clustering, which indicate our new technique is efficient and feasible.  相似文献   

16.
Clustering in Dynamic Spatial Databases   总被引:2,自引:0,他引:2  
Efficient clustering in dynamic spatial databases is currently an open problem with many potential applications. Most traditional spatial clustering algorithms are inadequate because they do not have an efficient support for incremental clustering.In this paper, we propose DClust, a novel clustering technique for dynamic spatial databases. DClust is able to provide multi-resolution view of the clusters, generate arbitrary shapes clusters in the presence of noise, generate clusters that are insensitive to ordering of input data and support incremental clustering efficiently. DClust utilizes the density criterion that captures arbitrary cluster shapes and sizes to select a number of representative points, and builds the Minimum Spanning Tree (MST) of these representative points, called R-MST. After the initial clustering, a summary of the cluster structure is built. This summary enables quick localization of the effect of data updates on the current set of clusters. Our experimental results show that DClust outperforms existing spatial clustering methods such as DBSCAN, C2P, DENCLUE, Incremental DBSCAN and BIRCH in terms of clustering time and accuracy of clusters found.  相似文献   

17.
Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters.We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method’s performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests.  相似文献   

18.
聚类是数据挖掘中重要的研究方向。本文针对现有的聚类算法中相似度量的缺陷,提出了一种新的相似性度量方法。在此基础上,将粗糙集理论中的区分能力引入到聚类算法中,用来度量属性的重要性,进而提出了一种能够处理符号型数据的新的加权粗糙聚类算法。通过对UCI数据的实验表明,本文算法对数据输入顺序不敏感,且不需要预先给定簇的数目,提高了聚类的质量。  相似文献   

19.
Features extracted from real world applications increase dramatically, while machine learning methods decrease their performance given the previous scenario, and feature reduction is required. Particularly, for fault diagnosis in rotating machinery, the number of extracted features are sizable in order to collect all the available information from several monitored signals. Several approaches lead to data reduction using supervised or unsupervised strategies, where the supervised ones are the most reliable and its main disadvantage is the beforehand knowledge of the fault condition. This work proposes a new unsupervised algorithm for feature selection based on attribute clustering and rough set theory. Rough set theory is used to compute similarities between features through the relative dependency. The clustering approach combines classification based on distance with clustering based on prototype to group similar features, without requiring the number of clusters as an input. Additionally, the algorithm has an evolving property that allows the dynamic adjustment of the cluster structure during the clustering process, even when a new set of attributes feeds the algorithm. That gives to the algorithm an incremental learning property, avoiding a retraining process. These properties define the main contribution and significance of the proposed algorithm. Two fault diagnosis problems of fault severity classification in gears and bearings are studied to test the algorithm. Classification results show that the proposed algorithm is able to select adequate features as accurate as other feature selection and reduction approaches.  相似文献   

20.
Quality of clustering is an important issue in application of clustering techniques. Most traditional cluster validity indices are geometry-based cluster quality measures. This paper proposes a cluster validity index based on the decision-theoretic rough set model by considering various loss functions. Experiments with synthetic, standard, and real-world retail data show the usefulness of the proposed validity index for the evaluation of rough and crisp clustering. The measure is shown to help determine optimal number of clusters, as well as an important parameter called threshold in rough clustering. The experiments with a promotional campaign for the retail data illustrate the ability of the proposed measure to incorporate financial considerations in evaluating quality of a clustering scheme. This ability to deal with monetary values distinguishes the proposed decision-theoretic measure from other distance-based measures. The proposed validity index can also be extended for evaluating other clustering algorithms such as fuzzy clustering.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号