首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
一种基于密度的空间数据流在线聚类算法   总被引:2,自引:0,他引:2  
于彦伟  王沁  邝俊  何杰 《自动化学报》2012,38(6):1051-1059
为了解决空间数据流中任意形状簇的聚类问题,提出了一种基于密度的空间数据流在线聚类算法(On-line density-based clustering algorithm for spatial datastream,OLDStream),该算法在先前聚类结果上聚类增量空间数据,仅对新增空间点及其满足核心点条件的邻域数据做局部聚类更新,降低聚类更新的时间复杂度,实现对空间数据流的在线聚类.OLDStream算法具有快速处理大规模空间数据流、实时获取全局任意形状的聚类簇结果、对数据流的输入顺序不敏感、并能发现孤立点数据等优势.在真实数据和合成数据上的综合实验验证了算法的聚类效果、高效率性和较高的可伸缩性,同时实验结果的统计分析显示仅有4%的空间点消耗最坏运行时间,对每个空间点的平均聚类时间约为0.033 ms.  相似文献   

2.
Massive spatio-temporal data have been collected from the earth observation systems for monitoring the changes of natural resources and environment. To find the interesting dynamic patterns embedded in spatio-temporal data, there is an urgent need for detecting spatio-temporal clusters formed by objects with similar attribute values occurring together across space and time. Among different clustering methods, the density-based methods are widely used to detect such spatio-temporal clusters because they are effective for finding arbitrarily shaped clusters and rely on less priori knowledge (e.g. the cluster number). However, a series of user-specified parameters is required to identify high-density objects and to determine cluster significance. In practice, it is difficult for users to determine the optimal clustering parameters; therefore, existing density-based clustering methods typically exhibit unstable performance. To overcome these limitations, a novel density-based spatio-temporal clustering method based on permutation tests is developed in this paper. High-density objects and cluster significance are determined based on statistical information on the dataset. First, the density of each object is defined based on the local variance and a fast permutation test is conducted to identify high-density objects. Then, a proposed two-stage grouping strategy is implemented to group high-density objects and their neighbors; hence, spatio-temporal clusters are formed by minimizing the inhomogeneity increase. Finally, another newly developed permutation test is conducted to evaluate the cluster significance based on the cluster member permutation. Experiments on both simulated and meteorological datasets show that the proposed method exhibits superior performance to two state-of-the-art clustering methods, i.e., ST-DBSCAN and ST-OPTICS. The proposed method can not only identify inherent cluster patterns in spatio-temporal datasets, but also greatly alleviates the difficulty in selecting appropriate clustering parameters.  相似文献   

3.
Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the nmnber of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density- based methods in the clustering process and at the same time overcoming the constraints, which are put out by data streanFs nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms' performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.  相似文献   

4.
An essential activity to obtain valuable information to identify, for example, intrusions, faults, system failures, etc, is outliers detection. This paper proposes a bio-inspired algorithm able to detect anomaly data in distributed systems. Each data object is associated with a mobile agent that follows the well-known bio-inspired algorithm of flocking. The agents are randomly disseminated onto a virtual space where they move autonomously in order to form one or more flocks. Through a tailored similarity function, the agents associated with similar objects join in the same flock, whereas, the agents associated with dissimilar objects do not join in any flock. The objects associated with isolated agents or associated with agents grouped into flock with a number of entities lower than a given threshold, represent the outliers. Experimental results on synthetic and real data sets confirm the validity of the approach.  相似文献   

5.
A Novel Density-Based Clustering Framework by Using Level Set Method   总被引:1,自引:0,他引:1  
In this paper, a new density-based clustering framework is proposed by adopting the assumption that the cluster centers in data space can be regarded as target objects in image space. First, the level set evolution is adopted to find an approximation of cluster centers by using a new initial boundary formation scheme. Accordingly, three types of initial boundaries are defined so that each of them can evolve to approach the cluster centers in different ways. To avoid the long iteration time of level set evolution in data space, an efficient termination criterion is presented to stop the evolution process in the circumstance that no more cluster centers can be found. Then, a new effective density representation called level set density (LSD) is constructed from the evolution results. Finally, the valley seeking clustering is used to group data points into corresponding clusters based on the LSD. The experiments on some synthetic and real data sets have demonstrated the efficiency and effectiveness of the proposed clustering framework. The comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.  相似文献   

6.
周红芳  赵雪涵  周扬 《计算机应用》2012,32(8):2182-2185
传统密度算法DBSCAN与DBRS的缺点在于时间性能和聚类精度均较低,为此,提出一种结合限定区域数据取样技术的密度聚类算法——DBLRS。该算法在不增加时间和空间复杂度的基础上利用参数Eps查找核心点的邻域点和扩展点,并在限定区域(Eps,2Eps)内进行数据抽样。实验结果表明,限定区域内选取代表点进行簇的扩充降低了大簇分裂的概率,提高了算法效率与聚类精度。  相似文献   

7.
This paper examines a method of clustering within a fully decentralized multi-agent system. Our goal is to group agents with similar objectives or data, as is done in traditional clustering. However, we add the additional constraint that agents must remain in place on a network, instead of first being collected into a centralized database. To do this, we connect agents in a random overlay network and have them search in a peer-to-peer fashion for other similar agents. We thus aim to tackle the basic clustering problem on an Internet scale, and create a method by which agents themselves can be grouped, forming coalitions. In order to investigate the feasibility of this decentralized approach, this paper presents simulation experiments that look into the quality of the clusters discovered. First, the clusters found by the agent method are compared to those created by k-means clustering for two-dimensional spatial data points. Results show that the decentralized agent method produces a better clustering than the centralized k-means algorithm, placing 95% to 99% of points correctly. A further experiment explores how agents can be used to cluster a straightforward text document set, demonstrating that agents can discover clusters and keywords that are reasonable estimates of those identified by the central word vector space approach.  相似文献   

8.
高维Turnstile型数据流聚类算法   总被引:3,自引:1,他引:3  
现有数据流聚类算法只能处理Time Series和Cash Register型数据流,并且应用于高维数据流时其精度不甚理想。提出针对高维Turnstile型数据流的子空间聚类算法HT-Stream,算法对数据空间进行网格划分,在线动态维护网格单元信息,采用倾斜时间窗口存储统计信息,根据用户指定时间跨度离线输出聚类结果。基于真实数据集与仿真数据集的实验表明,算法具有良好的适用性和有效性。  相似文献   

9.
于彦伟  王欢  王沁  赵金东 《软件学报》2015,26(5):1113-1128
提出一种基于密度的簇结构挖掘算法(mining density-based clustering structure over data streams,简称MCluStream),以解决数据流密度聚类中输入参数选择困难和重叠簇识别等问题.首先,设计了一种树拓扑CR-Tree索引结构,将直接核心可达的一对数据点映射成树结构中的父子关系,蕴含了数据点依赖关系的CR-Tree涵盖了一系列subEps参数下的基于密度的簇结构;其次,MCluStream算法采用滑动窗口的方式更新CR-Tree,在线维护当前窗口上的簇结构,实现了对海量数据流的快速演化聚类分析;再次,设计了一种快速从CR-Tree提取簇结构的方法,根据可视化的簇结构,选择合理的聚类结果;最后,在真实和合成海量数据上的实验验证了MCluStream算法具有有效的挖掘效果、较高的聚类效率和较小的空间开销.MCluStream可适用于海量数据流应用中自适应的密度聚类演化 分析.  相似文献   

10.
Clustering is a useful data mining technique which groups data points such that the points within a single group have similar characteristics, while the points in different groups are dissimilar. Density-based clustering algorithms such as DBSCAN and OPTICS are one kind of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big data is a challenging problem. Recently, parallelizing clustering algorithms on a large cluster of commodity machines using the MapReduce framework have received a lot of attention.In this paper, we first propose the new density-based clustering algorithm, called DBCURE, which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce. We next develop DBCURE-MR, which is a parallelized DBCURE using MapReduce. While traditional density-based algorithms find each cluster one by one, our DBCURE-MR finds several clusters together in parallel. We prove that both DBCURE and DBCURE-MR find the clusters correctly based on the definition of density-based clusters. Our experimental results with various data sets confirm that DBCURE-MR finds clusters efficiently without being sensitive to the clusters with varying densities and scales up well with the MapReduce framework.  相似文献   

11.
基于“震动方法”的类删减策略是在数据挖掘领域“基于密度的聚类”方法基础上,通过对数据仓库中数据元进行初步聚类,确定各类的“核”并赋予“能”之后,再对特定数据元进行能量“震动”,以便减小数据元之间的差别,融合相似度较高的类,突现类与类之间联系的具体方法.由于这一方法引入可逆的能量传递方式进行数据元状态分析,从而可以被运用于聚类后类的合并操作,并可以加强类与类之间存在的联系,便于聚类后的类分析过程.  相似文献   

12.
针对大数据环境下传统并行密度聚类算法中存在的数据划分不合理,聚类结果准确度不高,结果受参数影响较大以及并行效率低等问题,提出一种MapReduce下使用均值距离与关联性标记的并行OPTICS算法——POMDRM-MR。算法使用一种基于维度稀疏度的减少边界点划分策略(DS-PRBP),划分数据集;针对各个分区,提出标记点排序识别簇算法(MOPTICS),构建数据点与核心点之间的关联性,并标记数据点迭代次数,在距离度量中,使用领域均值距离策略(FMD),计算数据点的领域均值距离,代替可达距离排序,输出关联性标记序列;最后结合重排序序列提取簇算法(REC),对输出序列进行二次排序并提取簇,提高算法局部聚类的准确性和稳定性;在合并全局簇时,算法提出边界密度筛选策略(BD-FLC),计算筛选密度相近局部簇;又基于n叉树的并集型合并与MapReduce模型,提出并行局部簇合并算法(MCNT-MR),加快局部簇收敛,并行合并局部簇,提升全局簇合并效率。对照实验表明,POMDRM-MR算法聚类效果更佳,且在大规模数据集下算法的并行化性能更好。  相似文献   

13.
针对分布式数据流聚类算法存在的聚类质量不高、通信代价大的问题,提出了密度和代表点聚类思想相结合的分布式数据流聚类算法。该算法的局部站点采用近邻传播聚类,引入了类簇代表点的概念来描述局部分布的概要信息,全局站点采用基于改进的密度聚类算法合并局部站点上传的概要数据结构进而获得全局模型。仿真实验结果表明,所提算法能明显提高分布式环境下数据流的聚类质量,同时算法使用类簇代表点能够发现不同形状的聚簇并显著降低数据传输量。  相似文献   

14.
瞿原  邓维斌  胡峰  张其龙  王鸿 《计算机科学》2018,45(1):97-102, 107
点排序识别聚类结构(Ordering Points to Identify the Clustering Structure,OPTICS)的密度聚类算法能以可视化的方式导出数据集的内在聚类结构,并且可以通过簇排序提取基本的聚类信息。但是该算法由于时空复杂度较高,不能很好地适应当今社会出现的大型数据集。随着云计算和并行计算的发展,提供了一种解决OPTICS算法复杂度缺陷的方法和一种建立在基于Spark内存计算平台的点排序识别聚类结构并行算法。测试的实验结果表明,它能极大地降低OPTICS算法对时间和空间的需要。  相似文献   

15.
针对大流量骨干网的在线网络异常检测是目前网络安全研究的热点之一,提出一种网络异常检测方法,有效在线处理大数据流,利用密度聚类算法把大数据流转换成微簇,通过微簇提高处理效率,定时调用孤立点检测算法发现攻击行为。方法具有不需线下训练、能发现任意行为模式、支持大数据流、可以平衡检测精度与系统资源要求、处理效率高等优点。实验表明,原型系统在20 s完成2000年LLS_DDOS_1.0数据集分析,检测率为82%,误报率为6%,效果与K-means相当。  相似文献   

16.
高维数据流聚类及其演化分析研究   总被引:5,自引:0,他引:5  
基于数据流数据的聚类分析算法已成为研究的热点.提出一种基于子空间的高维数据流聚类及演化分析算法CAStream,该算法对数据空间进行网格化,采用近似的方法记录网格单元的统计信息,并将潜在密集网格单元快照以改进的金字塔时间结构进行存储,最后采用深度优先搜索方法进行聚类及其演化分析.CAStream能够有效处理高雏数据流,并能发现任意形状分布的聚类.基于真实数据集与仿真数据集的实验表明,算法具有良好的适用性和有效性.  相似文献   

17.
提出一种基于密度与分形维数的数据流聚类算法。采用在线/离线的两阶段框架,结合密度聚类和分形聚类的优点,克服传统数据流聚类算法的不足。针对数据流的时效性,在计算网格密度时对数据点使用衰减策略。实验结果表明,该算法能有效提高数据流聚类效率及聚类精度,且可以发现任意形状和距离非邻近的聚类。  相似文献   

18.
针对传统聚类算法对流数据进行聚类时面临时间复杂度高,存储空间需求大以及准确度较低的问题,提出一种基于差异性采样的流数据聚类算法。首先利用差异性采样法对流数据进行采样并用样本点构造核矩阵,然后利用核模糊C均值聚类算法对核矩阵中的点进行聚类得到一个带有标记的样本核矩阵,最后利用带有标记的样本核矩阵对流数据中的点进行划分。同时利用衰退聚类机制,实时更新样本核矩阵。实验结果表明,相比于传统聚类算法,该算法实现了更低的时间复杂度,同时实时聚类,得到较为理想的聚类结果。  相似文献   

19.
This paper presents a novel approach to handle large amounts of geometric data. A data stream clustering is used to reduce the amount of data and build a hierarchy of clusters. The data stream concept allows for the processing of very large data sets. The cluster hierarchy is then used in a dynamic triangulation to create a multiresolution model. It allows for the interactive selection of a different level of detail in various parts of the data.A method for removal multiple points from Delaunay triangulation is proposed. It is significantly faster than the traditional approach. The clustering and the triangulation are supplemented by an elliptical metric to handle data with anisotropic properties.Compared to the closest competitive method by Isenburg et al., the presented algorithm requires only a single pass over the data and offers a high flexibility. These advantages culminate in a long running time. The method was tested on several large digital elevation maps. The clustering phase can take up to a few hours. Once the cluster hierarchy is built, the terrains can be efficiently manipulated in real time.  相似文献   

20.
For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of clustering. In this paper, a novel density-based hierarchical clustering scheme for streaming data is proposed in order to improve both accuracy and effectiveness; it is based on the agglomerative clustering framework. Traditionally, clustering algorithms for streaming data often use the cluster center to represent the whole cluster when conducting cluster merging, which may lead to unsatisfactory results. We argue that even if the data set is accessed only once, some parameters, such as the variance within cluster, the intra-cluster density and the inter-cluster distance, can be calculated accurately. This may bring measurable benefits to the process of cluster merging. Furthermore, we employ a general framework that can incorporate different criteria and, given the same criteria, will produce similar clustering results for both streaming and non-streaming data. In experimental studies, the proposed method demonstrates promising results with reduced time and space complexity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号