首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
预计算一个完整的数据立方可以获得最快的查询响应速度,但是对于一个大规模的数据立方,所需的存储空间非常大,因此通常只能预先计算数据立方中的部分聚集。文章提出了计算部分数据立方的算法PCC(PartialComputationofCube),它的特点是采用自底向上的划分方法,能根据需要计算的聚集确定维的划分路径,并裁减不必要的聚集和划分。实验表明,和利用完整数据立方的计算方法BUC来计算部分数据立方的方法比,PCC算法的效率更高。  相似文献   

2.
刘群 《计算机科学》2004,31(Z2):185-186
1引言 随着Internet所提供的信息和服务资源的快速增长,许多强有力的搜索引擎通过基于内容、关键词等方式对Web文挡进行搜索,但是不幸的是所查询的结果并不能使用户满意.聚类分析可以在数据集合特征未知的情况下,使用一种无示教的学习过程,对数据集合分布和聚合特性进行初步了解,但是聚类模型选择的好坏以及聚类结果的准确性都将影响到整个知识发现的质量.  相似文献   

3.
查询速度是联机分析处理中的一个关键性能指标,人们通过事先生成所有可能的聚集来提高查询速度,然而这样的完全物化是以存储空间为代价的.针对数据立方体数据分布特点和结合压缩技术,本文介绍如何最大化节省存储空间来进行完全物化,然后在此基础上对查询进行了研究,以达到最小存储空间以及较好的查询速度的目的.  相似文献   

4.
Computing LTS Regression for Large Data Sets   总被引:9,自引:0,他引:9  
Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics and sums of squared residuals, and techniques which we call ‘selective iteration’ and ‘nested extensions’. We also use an intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude. This allows us to apply FAST-LTS to large databases.  相似文献   

5.
针对大规模日志数据的聚类问题,提出了DBk-means算法.该算法使用Hadoop对原始日志数据进行预处理,并结合了k-means和DBSCAN聚类算法各自的优势.实验结果表明,相比k-means算法进行聚类分析,文中使用DBk-means算法进行聚类,能够取得更好的聚类效果,正确率可以达到83%以上.  相似文献   

6.
In this work, the parallel fast condensed nearest neighbor (PFCNN) rule, a distributed method for computing a consistent subset of a very large data set for the nearest neighbor classification rule is presented. In order to cope with the communication overhead typical of distributed environments and to reduce memory requirements, different variants of the basic PFCNN method are introduced. An analysis of spatial cost, CPU cost, and communication overhead is accomplished for all the algorithms. Experimental results, performed on both synthetic and real very large data sets, revealed that these methods can be profitably applied to enormous collections of data. Indeed, they scale up well and are efficient in memory consumption, confirming the theoretical analysis, and achieve noticeable data reduction and good classification accuracy. To the best of our knowledge, this is the first distributed algorithm for computing a training set consistent subset for the nearest neighbor rule.  相似文献   

7.
针对原始的仿射传播(affinity propagation,AP)聚类算法难以处理多代表点聚类,以及空间和时间开销过大等问题,提出了快速多代表点仿射传播(multi-exemplar affinity propagation using fast reduced set density estimator,FRSMEAP)聚类算法。该算法在聚类初始阶段,引入快速压缩集密度估计算法(fast reduced set density estimator,FRSDE)对大规模数据集进行预处理,得到能够充分代表样本属性的压缩集;在聚类阶段,使用多代表点仿射传播(multi-exemplar affinity propagation,MEAP)聚类算法,获得比AP更加明显的聚类决策边界,从而提高聚类的精度;最后再利用K-邻近(K-nearest neighbor,KNN)算法分配剩余点得到最终的数据划分。在人工数据集和真实数据集上的仿真实验结果表明,该算法不仅能在大规模数据集上进行聚类,而且具有聚类精度高和运行速度快等优点。  相似文献   

8.
一种快速生成最小浓缩数据立方的算法   总被引:2,自引:0,他引:2  
语义OLAP技术是近来学者研究的热点之一,浓缩数据立方就是其中一种.本文设计了一个用于快速生成最小浓缩数据立方的算法SQCube.算法分两个阶段:首先利用BottomUpBST算法生成一个非最小的浓缩数据立方,然后对所得到的非最小浓缩数据立方进行后处理,把其中的所有纯BST和隐BST压缩为一条BST,从而生成一个最小浓缩数据立方.实验表明SQCube算法明显优于以往提出的同类算法MinCube.  相似文献   

9.
Exploratory data analysis is a widely used technique to determine which factors have the most influence on data values in a multi-way table, or which cells in the table can be considered anomalous with respect to the other cells. In particular, median polish is a simple yet robust method to perform exploratory data analysis. Median polish is resistant to holes in the table (cells that have no values), but it may require many iterations through the data. This factor makes it difficult to apply median polish to large multidimensional tables, since the I/O requirements may be prohibitive. This paper describes a technique that uses median polish over an approximation of a datacube, easing the burden of I/O. The cube approximation is achieved by fitting log-linear models to the data. The results obtained are tested for quality, using a variety of measures. The technique scales to large datacubes and proves to give a good approximation of the results that would have been obtained by median polish in the original data.  相似文献   

10.
胡明  曾联明 《现代计算机》2010,(7):16-19,23
针对SVM分类过程中,处理大规模训练样本集遇到的因样本维度高、消耗大量内存导致分类效率低下的问题,提出基于网格环境的计算策略。该策略针对密集型计算问题分别提出按步骤、按功能、按数据进行任务分解的三种解决方案。用户根据SVM样本训练和分类的实际来选择使用哪一种方案。对遥感数据分别在单机环境和网格环境的对比实验表明,能够提高训练和分类速度,在计算环境的层面弥补处理大规模数据对计算性能的高要求。  相似文献   

11.
数据预处理是KDD的关键一步,良好的数据预处理可以极大地提高数据挖掘的效率。该文提出了一种基于数据立方体的数据泛化算法用于数据预处理,能够为数据挖掘提供良好的数据环境,提高数据挖掘的有效性。  相似文献   

12.
数据预处理在数据挖掘项目中有着举足轻重的作用,是数据挖掘整个过程的关键步骤之一。论文根据粗糙集概率模型应用于数据挖掘的特点,提出了一种用于数据预处理的基于信息归纳的概率粗糙集算法SRII;实验证明,SRII结合算法C4.5应用于数据挖掘具有良好的效率与显著改进的挖掘结果。  相似文献   

13.
《Computer》1980,13(8):85-87
  相似文献   

14.
《Computer》1981,14(1):53-53
  相似文献   

15.
求解线性代数方程组是工程上经常遇到的问题,而它们的系数矩阵又往往是大型稀疏矩阵。文章介绍了一种简单易行,并且已经用C语言实现了的求解这类方程组的压缩算法。最后,还对压缩和非压缩算法进行了比较。  相似文献   

16.
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.  相似文献   

17.
18.
巨型数据库中的数据采掘   总被引:6,自引:3,他引:6  
罗可  吴杰 《计算机工程与应用》2001,37(20):88-91,100
数据采掘,也称数据库中的知识发现。传统进行数据分析的算法假设数据库中相关的记录比较少,然而,现在的许多数据库大到内存无法装下整个数据库,为了保证高效率,运用到大型数据库中的数据采掘技术必须是高度可缩放的。文章讨论了当今若干种先进的算法,它们能处理三类数据采掘:市场篮子分析、分类和聚类,并提出了今后的若干研究热点。  相似文献   

19.
数据立方体上多维多层关联规则挖掘算法   总被引:7,自引:0,他引:7  
重点结合联机分析挖掘的思想,讨论了数据立方体上的多维多层关联规则挖掘。基于数据立方体和FP算法提出并构建了体现概念层次的Hib&Dim FP树和其挖掘算法Hib&Dim FP算法,并把此算法应用于数据立方体上的多维多层关联规则挖掘。最后的实验证明了该算法的有效性。  相似文献   

20.
《计算机工程》2017,(9):149-155
分段正交匹配追踪算法(StOMP)运算速度快、计算量小,适用于无线传感器网络(WSN)压缩感知数据重构。为此,分析并研究StOMP算法的门限阈值选取对WSN压缩感知数据重构精度的影响,提出一种StOMP算法门限阈值的自适应调整方法。基于比例-积分-微分方法的思想,根据StOMP算法的当次重构误差计算门限阈值的调整值,并使用调整后的门限阈值重新进行数据重构,重复该过程以提高重构精度。实验结果表明,该方法能快速找到满足误差要求的门限阈值,与采用固定门限阈值的调整方法相比,重构精度更高。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号