首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Using Self-Similarity to Cluster Large Data Sets   总被引:6,自引:0,他引:6  
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.  相似文献   

2.
面向大规模机群的可扩展OLAP查询技术   总被引:1,自引:0,他引:1  
大数据时代,由中低端硬件组成的大规模机群逐渐成为海量数据处理的主流平台之一.然而传统基于高端硬件平台设计的并行OLAP查询算法并不适应这种由不可靠计算单元组成的大规模并行计算的环境.为改善其在新计算环境下的的扩展性和容错性,该文对传统数据仓库的数据组织模式及处理模式进行改造,提出了全新的无连接雪花模型和TRM执行模型.无连接雪花模型基于层次编码技术,将维表层次等关键信息压缩进事实表,使得事实表可以独立处理数据,从数据模型层保证了数据计算的独立性;TRM执行模型将OLAP查询的处理抽象为Transform、Reduce、Merge3个操作,使得OLAP查询可被划分为众多可并行执行的独立子任务,从执行层保证了系统的高度可扩展特性.在性能优化方面,该文提出了Scan-index扫描和跳跃式扫描算法,以尽可能地减少I/O访问操作;设计了并行谓词判断、批量谓词判断等优化算法,以加速本地计算速度.实验表明:LaScOLAP原型可以获得较好的扩展性和容错性,其性能比HadoopDB高出一个数量级.  相似文献   

3.
In the form of the support vector machine and Gaussian processes, kernel-based systems are currently very popular approaches to supervised learning. Unfortunately, the computational load for training kernel-based systems increases drastically with the size of the training data set, such that these systems are not ideal candidates for applications with large data sets. Nevertheless, research in this direction is very active. In this paper, I review some of the current approaches toward scaling kernel-based systems to large data sets.  相似文献   

4.
胡海苗  姜帆 《软件学报》2015,26(S2):228-238
提出了一种可扩展的局部敏感哈希索引(SLSH),以解决高维动态数据索引中,由于数据集大小及分布特征无法确定而导致索引效率降低的问题.SLSH架构于E2LSH之上,继承了其对高维数据索引速度快,并可直接对欧式空间上的数据点进行索引的特点.为了使得哈希索引具有动态的相似性区分能力,SLSH修改了E2LSH的哈希族,通过哈希桶容量约束自适应调节哈希参数.因此对于分布密度动态变化的数据空间,SLSH也能够给出鲁棒的划分.  相似文献   

5.
本文针对大型关系数据集的高维数据,提出了一种基于聚类指引旅行的三维投影的可视化方法.将数据集中的数据聚类,选择聚类的中心点进行投影,将投影映射到三维空间的四面体中,然后将所有的四面体旅行一遍,从而实现数据的遍历和可视化.  相似文献   

6.
曾志强  廖备水  高济 《计算机科学》2009,36(11):208-212
标准SVM学习算法运行所需的时间和空间复杂度分别为O(l~3)和O(l~2),l为训练样本的数量,因此不适用于对超大数据集进行训练.提出一种基于近似解的SVM训练算法:Approximate Vector Machine(AVM).AVM采用增量学习的策略来寻找近似最优分类超平面,并且在迭代过程中采用热启动及抽样技巧来加快训练速度.理论分析表明,该算法的计算复杂度与训练样本的数量无关,因此具有良好的时间与空间扩展性.在超大数据集上的实验结果表明,该算法在极大提高训练速度的同时,仍然保持了原始分类器的泛化性能,并且训练完毕具有较少的支持向量,因此结果分类器具有更快的分类速度.  相似文献   

7.
Computing LTS Regression for Large Data Sets   总被引:9,自引:0,他引:9  
Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics and sums of squared residuals, and techniques which we call ‘selective iteration’ and ‘nested extensions’. We also use an intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude. This allows us to apply FAST-LTS to large databases.  相似文献   

8.
为了探索多维数据集中各变量之间的关系,特别是非函数关系,对数据集所在的n维立方体的各个维度进行划分,在得到的n维网格中定义自然概率密度函数,以此得到数据集在特定划分下的互信息值。对所有的划分取互信息最大值,正态化后即为所定义的特征张量的一项,取特征张量在给定最大网格数下的最大项的值定义为MIC,它度量了多维变量间的关联程度。  相似文献   

9.
针对大规模日志数据的聚类问题,提出了DBk-means算法.该算法使用Hadoop对原始日志数据进行预处理,并结合了k-means和DBSCAN聚类算法各自的优势.实验结果表明,相比k-means算法进行聚类分析,文中使用DBk-means算法进行聚类,能够取得更好的聚类效果,正确率可以达到83%以上.  相似文献   

10.
刘贝贝  马儒宁  丁军娣 《软件学报》2015,26(11):2820-2835
针对处理大数据时传统聚类算法失效或效果不理想的问题,提出了一种大数据的密度统计合并算法(density-based statistical merging algorithm for large data sets,简称DSML).该算法将数据点的每个特征看作一组独立随机变量,并根据独立有限差分不等式获得统计合并判定准则.首先,使用统计合并判定准则对Leaders算法做出改进,获得代表点集;随后,结合代表点的密度和邻域信息,再次使用统计合并判定准则完成对整个数据集的聚类.理论分析和实验结果表明,DSML算法具有近似线性的时间复杂度,能处理任意形状的数据集,且对噪声具有良好的鲁棒性,非常有利于处理大规模数据集.  相似文献   

11.
This paper addresses the problem of finding frequent closed patterns (FCPs) from very dense data sets. We introduce two compressed hierarchical FCP mining algorithms: C-Miner and B-Miner. The two algorithms compress the original mining space, hierarchically partition the whole mining task into independent subtasks, and mine each subtask progressively. The two algorithms adopt different task partitioning strategies: C-Miner partitions the mining task based on Compact Matrix Division, whereas B-Miner partitions the task based on Base Rows Projection. The compressed hierarchical mining algorithms enhance the mining efficiency and facilitate a progressive refinement of results. Moreover, because the subtasks can be mined independently, C-Miner and B-Miner can be readily paralleled without incurring significant communication overhead. We have implemented C-Miner and B-Miner, and our performance study on synthetic data sets and real dense microarray data sets shows their effectiveness over existing schemes. We also report experimental results on parallel versions of these two methods.  相似文献   

12.
传感器网络中层次簇模型的数据压缩算法   总被引:1,自引:0,他引:1       下载免费PDF全文
提出一种传感器网络中层次簇模型的分布式数据压缩算法。将传感器网络映射成一个层次簇,基于低级簇内节点部署的相对规则性和超级簇内节点部署的相对不规则性,分别采用不同的小波变换模型来进行数据压缩。理论分析和实验仿真结果表明,该算法有较好的逼近性能,能对传感器网络中的数据进行有效压缩,可更大程度地降低传感器网络中的数据传输量,从而进一步延长整个网络的生命周期。  相似文献   

13.
针对原始的仿射传播(affinity propagation,AP)聚类算法难以处理多代表点聚类,以及空间和时间开销过大等问题,提出了快速多代表点仿射传播(multi-exemplar affinity propagation using fast reduced set density estimator,FRSMEAP)聚类算法。该算法在聚类初始阶段,引入快速压缩集密度估计算法(fast reduced set density estimator,FRSDE)对大规模数据集进行预处理,得到能够充分代表样本属性的压缩集;在聚类阶段,使用多代表点仿射传播(multi-exemplar affinity propagation,MEAP)聚类算法,获得比AP更加明显的聚类决策边界,从而提高聚类的精度;最后再利用K-邻近(K-nearest neighbor,KNN)算法分配剩余点得到最终的数据划分。在人工数据集和真实数据集上的仿真实验结果表明,该算法不仅能在大规模数据集上进行聚类,而且具有聚类精度高和运行速度快等优点。  相似文献   

14.
This work has two main objectives, namely, to introduce a novel algorithm, called the fast condensed nearest neighbor (FCNN) rule, for computing a training-set-consistent subset for the nearest neighbor decision rule and to show that condensation algorithms for the nearest neighbor rule can be applied to huge collections of data. The FCNN rule has some interesting properties: it is order independent, its worst-case time complexity is quadratic but often with a small constant prefactor, and it is likely to select points very close to the decision boundary. Furthermore, its structure allows for the triangle inequality to be effectively exploited to reduce the computational effort. The FCNN rule outperformed even here-enhanced variants of existing competence preservation methods both in terms of learning speed and learning scaling behavior and, often, in terms of the size of the model while it guaranteed the same prediction accuracy. Furthermore, it was three orders of magnitude faster than hybrid instance-based learning algorithms on the MNIST and Massachusetts Institute of Technology (MIT) Face databases and computed a model of accuracy comparable to that of methods incorporating a noise-filtering pass.  相似文献   

15.
In this work, the parallel fast condensed nearest neighbor (PFCNN) rule, a distributed method for computing a consistent subset of a very large data set for the nearest neighbor classification rule is presented. In order to cope with the communication overhead typical of distributed environments and to reduce memory requirements, different variants of the basic PFCNN method are introduced. An analysis of spatial cost, CPU cost, and communication overhead is accomplished for all the algorithms. Experimental results, performed on both synthetic and real very large data sets, revealed that these methods can be profitably applied to enormous collections of data. Indeed, they scale up well and are efficient in memory consumption, confirming the theoretical analysis, and achieve noticeable data reduction and good classification accuracy. To the best of our knowledge, this is the first distributed algorithm for computing a training set consistent subset for the nearest neighbor rule.  相似文献   

16.
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.  相似文献   

17.
李海永  李晓  张岩 《计算机工程》2011,37(12):82-84
根据无线传感器网络(WSN)资源受限的特点,在主成分分析融合方法的基础上提出一种WSN簇内分级数据融合算法。采用自学习加权方法估计各个传感器的测量方差,通过线性无偏最小方差估计法对簇内传感器节点的测量数据进行修正,用主成分分析方法得出各传感器的综合支持度和数据融合的公式。通过应用实例和仿真结果验证该方法的有效性和可靠性。  相似文献   

18.
一种矢量数据的双层次多尺度表达模型与检索技术   总被引:1,自引:0,他引:1       下载免费PDF全文
空间数据的多尺度表达是当代GIS研究的热点问题之一。该文针对矢量数据快速可视化的需求,结合制图综合领域的相关理论,提出了一种矢量数据双层次多尺度表达模型,用来将矢量数据抽象为空间要素和要素内的点坐标两个层次进行表达。其中空间要素层次的表达以空间要素为最小研究单元,通过建立多尺度索引来描述空间要素因尺度改变而引起的数量或性质变化;要素点坐标层次的表达则是以要素内坐标点为最小研究单元,通过尺度层次标记的方式来表达空间要素内的点坐标随尺度变化的渐变过程。该模型在开源数据库管理系统PostgreSQL支持下,扩展了相应的索引与函数,实现了矢量数据的双层次多尺度表达模型,同时设计了相应的检索算法,并以某城市1:10 000土地利用数据为例,对上述模型与检索算法进行了验证。实验结果表明,在基本不影响可视化效果的前提下,该矢量数据多尺度模型能极大地提高海量矢量数据的可视化与传输的效率。  相似文献   

19.
Exploratory data analysis is a widely used technique to determine which factors have the most influence on data values in a multi-way table, or which cells in the table can be considered anomalous with respect to the other cells. In particular, median polish is a simple yet robust method to perform exploratory data analysis. Median polish is resistant to holes in the table (cells that have no values), but it may require many iterations through the data. This factor makes it difficult to apply median polish to large multidimensional tables, since the I/O requirements may be prohibitive. This paper describes a technique that uses median polish over an approximation of a datacube, easing the burden of I/O. The cube approximation is achieved by fitting log-linear models to the data. The results obtained are tested for quality, using a variety of measures. The technique scales to large datacubes and proves to give a good approximation of the results that would have been obtained by median polish in the original data.  相似文献   

20.
面向大数据的SVM参数寻优方法   总被引:2,自引:0,他引:2  
研究数据回归问题,进行快速寻优,传统SVM参数寻优因采用大范围遍历搜索算法,需消耗大量时间,不适用于对大数据集进行训练.基于均匀设计与自调用支持向量回归,为缩短寻优时间,加快速度,提出了一种有效降低搜索时间的策略.根据均匀设计产生27个具有代表性参数组合,每个组合对训练集经交叉测试得其均方误差MSE,再以MSE为目标函数,通过自调用支持向量回归建立其与27个参数组合之间的关系模型.基于关系模型预测729个参数组合对应的MSE,并以MSE最小寻找最优参数组合.3个实例数据集的仿真结果表明,新方法在保证预测精度的同时,大幅度缩短了训练建模时间,为大数据集支持向量机参数选择提供了新的有效解决方案.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号