共查询到20条相似文献,搜索用时 0 毫秒
1.
预计算一个完整的数据立方可以获得最快的查询响应速度,但是对于一个大规模的数据立方,所需的存储空间非常大,因此通常只能预先计算数据立方中的部分聚集。文章提出了计算部分数据立方的算法PCC(PartialComputationofCube),它的特点是采用自底向上的划分方法,能根据需要计算的聚集确定维的划分路径,并裁减不必要的聚集和划分。实验表明,和利用完整数据立方的计算方法BUC来计算部分数据立方的方法比,PCC算法的效率更高。 相似文献
2.
1引言
随着Internet所提供的信息和服务资源的快速增长,许多强有力的搜索引擎通过基于内容、关键词等方式对Web文挡进行搜索,但是不幸的是所查询的结果并不能使用户满意.聚类分析可以在数据集合特征未知的情况下,使用一种无示教的学习过程,对数据集合分布和聚合特性进行初步了解,但是聚类模型选择的好坏以及聚类结果的准确性都将影响到整个知识发现的质量. 相似文献
3.
查询速度是联机分析处理中的一个关键性能指标,人们通过事先生成所有可能的聚集来提高查询速度,然而这样的完全物化是以存储空间为代价的.针对数据立方体数据分布特点和结合压缩技术,本文介绍如何最大化节省存储空间来进行完全物化,然后在此基础上对查询进行了研究,以达到最小存储空间以及较好的查询速度的目的. 相似文献
4.
Computing LTS Regression for Large Data Sets 总被引:9,自引:0,他引:9
Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what
methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data
mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics
and sums of squared residuals, and techniques which we call ‘selective iteration’ and ‘nested extensions’. We also use an
intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas
for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude.
This allows us to apply FAST-LTS to large databases. 相似文献
5.
6.
In this work, the parallel fast condensed nearest neighbor (PFCNN) rule, a distributed method for computing a consistent subset of a very large data set for the nearest neighbor classification rule is presented. In order to cope with the communication overhead typical of distributed environments and to reduce memory requirements, different variants of the basic PFCNN method are introduced. An analysis of spatial cost, CPU cost, and communication overhead is accomplished for all the algorithms. Experimental results, performed on both synthetic and real very large data sets, revealed that these methods can be profitably applied to enormous collections of data. Indeed, they scale up well and are efficient in memory consumption, confirming the theoretical analysis, and achieve noticeable data reduction and good classification accuracy. To the best of our knowledge, this is the first distributed algorithm for computing a training set consistent subset for the nearest neighbor rule. 相似文献
7.
《计算机科学与探索》2016,(2):268-276
针对原始的仿射传播(affinity propagation,AP)聚类算法难以处理多代表点聚类,以及空间和时间开销过大等问题,提出了快速多代表点仿射传播(multi-exemplar affinity propagation using fast reduced set density estimator,FRSMEAP)聚类算法。该算法在聚类初始阶段,引入快速压缩集密度估计算法(fast reduced set density estimator,FRSDE)对大规模数据集进行预处理,得到能够充分代表样本属性的压缩集;在聚类阶段,使用多代表点仿射传播(multi-exemplar affinity propagation,MEAP)聚类算法,获得比AP更加明显的聚类决策边界,从而提高聚类的精度;最后再利用K-邻近(K-nearest neighbor,KNN)算法分配剩余点得到最终的数据划分。在人工数据集和真实数据集上的仿真实验结果表明,该算法不仅能在大规模数据集上进行聚类,而且具有聚类精度高和运行速度快等优点。 相似文献
8.
一种快速生成最小浓缩数据立方的算法 总被引:2,自引:0,他引:2
语义OLAP技术是近来学者研究的热点之一,浓缩数据立方就是其中一种.本文设计了一个用于快速生成最小浓缩数据立方的算法SQCube.算法分两个阶段:首先利用BottomUpBST算法生成一个非最小的浓缩数据立方,然后对所得到的非最小浓缩数据立方进行后处理,把其中的所有纯BST和隐BST压缩为一条BST,从而生成一个最小浓缩数据立方.实验表明SQCube算法明显优于以往提出的同类算法MinCube. 相似文献
9.
Exploratory data analysis is a widely used technique to determine which factors have the most influence on data values in a multi-way table, or which cells in the table can be considered anomalous with respect to the other cells. In particular, median polish is a simple yet robust method to perform exploratory data analysis. Median polish is resistant to holes in the table (cells that have no values), but it may require many iterations through the data. This factor makes it difficult to apply median polish to large multidimensional tables, since the I/O requirements may be prohibitive. This paper describes a technique that uses median polish over an approximation of a datacube, easing the burden of I/O. The cube approximation is achieved by fitting log-linear models to the data. The results obtained are tested for quality, using a variety of measures. The technique scales to large datacubes and proves to give a good approximation of the results that would have been obtained by median polish in the original data. 相似文献
10.
针对SVM分类过程中,处理大规模训练样本集遇到的因样本维度高、消耗大量内存导致分类效率低下的问题,提出基于网格环境的计算策略。该策略针对密集型计算问题分别提出按步骤、按功能、按数据进行任务分解的三种解决方案。用户根据SVM样本训练和分类的实际来选择使用哪一种方案。对遥感数据分别在单机环境和网格环境的对比实验表明,能够提高训练和分类速度,在计算环境的层面弥补处理大规模数据对计算性能的高要求。 相似文献
11.
数据预处理是KDD的关键一步,良好的数据预处理可以极大地提高数据挖掘的效率。该文提出了一种基于数据立方体的数据泛化算法用于数据预处理,能够为数据挖掘提供良好的数据环境,提高数据挖掘的有效性。 相似文献
12.
数据预处理在数据挖掘项目中有着举足轻重的作用,是数据挖掘整个过程的关键步骤之一。论文根据粗糙集概率模型应用于数据挖掘的特点,提出了一种用于数据预处理的基于信息归纳的概率粗糙集算法SRII;实验证明,SRII结合算法C4.5应用于数据挖掘具有良好的效率与显著改进的挖掘结果。 相似文献
13.
14.
15.
求解线性代数方程组是工程上经常遇到的问题,而它们的系数矩阵又往往是大型稀疏矩阵。文章介绍了一种简单易行,并且已经用C语言实现了的求解这类方程组的压缩算法。最后,还对压缩和非压缩算法进行了比较。 相似文献
16.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values 总被引:76,自引:0,他引:76
Zhexue Huang 《Data mining and knowledge discovery》1998,2(3):283-304
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values
prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms
which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes
algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with
modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function.
With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The
k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes
algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean
disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on
two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large
data sets, which is critical to data mining applications. 相似文献
17.
18.
巨型数据库中的数据采掘 总被引:6,自引:3,他引:6
数据采掘,也称数据库中的知识发现。传统进行数据分析的算法假设数据库中相关的记录比较少,然而,现在的许多数据库大到内存无法装下整个数据库,为了保证高效率,运用到大型数据库中的数据采掘技术必须是高度可缩放的。文章讨论了当今若干种先进的算法,它们能处理三类数据采掘:市场篮子分析、分类和聚类,并提出了今后的若干研究热点。 相似文献
19.