共查询到20条相似文献,搜索用时 0 毫秒
1.
Using Self-Similarity to Cluster Large Data Sets 总被引:6,自引:0,他引:6
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape. 相似文献
2.
面向大规模机群的可扩展OLAP查询技术 总被引:1,自引:0,他引:1
大数据时代,由中低端硬件组成的大规模机群逐渐成为海量数据处理的主流平台之一.然而传统基于高端硬件平台设计的并行OLAP查询算法并不适应这种由不可靠计算单元组成的大规模并行计算的环境.为改善其在新计算环境下的的扩展性和容错性,该文对传统数据仓库的数据组织模式及处理模式进行改造,提出了全新的无连接雪花模型和TRM执行模型.无连接雪花模型基于层次编码技术,将维表层次等关键信息压缩进事实表,使得事实表可以独立处理数据,从数据模型层保证了数据计算的独立性;TRM执行模型将OLAP查询的处理抽象为Transform、Reduce、Merge3个操作,使得OLAP查询可被划分为众多可并行执行的独立子任务,从执行层保证了系统的高度可扩展特性.在性能优化方面,该文提出了Scan-index扫描和跳跃式扫描算法,以尽可能地减少I/O访问操作;设计了并行谓词判断、批量谓词判断等优化算法,以加速本地计算速度.实验表明:LaScOLAP原型可以获得较好的扩展性和容错性,其性能比HadoopDB高出一个数量级. 相似文献
3.
Volker Tresp 《Data mining and knowledge discovery》2001,5(3):197-211
In the form of the support vector machine and Gaussian processes, kernel-based systems are currently very popular approaches to supervised learning. Unfortunately, the computational load for training kernel-based systems increases drastically with the size of the training data set, such that these systems are not ideal candidates for applications with large data sets. Nevertheless, research in this direction is very active. In this paper, I review some of the current approaches toward scaling kernel-based systems to large data sets. 相似文献
4.
本文针对大型关系数据集的高维数据,提出了一种基于聚类指引旅行的三维投影的可视化方法.将数据集中的数据聚类,选择聚类的中心点进行投影,将投影映射到三维空间的四面体中,然后将所有的四面体旅行一遍,从而实现数据的遍历和可视化. 相似文献
5.
郝薇 《电脑与微电子技术》2012,(18):3-8,21
为了探索多维数据集中各变量之间的关系,特别是非函数关系,对数据集所在的n维立方体的各个维度进行划分,在得到的n维网格中定义自然概率密度函数,以此得到数据集在特定划分下的互信息值。对所有的划分取互信息最大值,正态化后即为所定义的特征张量的一项,取特征张量在给定最大网格数下的最大项的值定义为MIC,它度量了多维变量间的关联程度。 相似文献
6.
Computing LTS Regression for Large Data Sets 总被引:9,自引:0,他引:9
Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what
methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data
mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics
and sums of squared residuals, and techniques which we call ‘selective iteration’ and ‘nested extensions’. We also use an
intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas
for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude.
This allows us to apply FAST-LTS to large databases. 相似文献
7.
8.
Liping Ji Kian-Lee Tan Tung A.K.H. 《Knowledge and Data Engineering, IEEE Transactions on》2007,19(9):1175-1187
This paper addresses the problem of finding frequent closed patterns (FCPs) from very dense data sets. We introduce two compressed hierarchical FCP mining algorithms: C-Miner and B-Miner. The two algorithms compress the original mining space, hierarchically partition the whole mining task into independent subtasks, and mine each subtask progressively. The two algorithms adopt different task partitioning strategies: C-Miner partitions the mining task based on Compact Matrix Division, whereas B-Miner partitions the task based on Base Rows Projection. The compressed hierarchical mining algorithms enhance the mining efficiency and facilitate a progressive refinement of results. Moreover, because the subtasks can be mined independently, C-Miner and B-Miner can be readily paralleled without incurring significant communication overhead. We have implemented C-Miner and B-Miner, and our performance study on synthetic data sets and real dense microarray data sets shows their effectiveness over existing schemes. We also report experimental results on parallel versions of these two methods. 相似文献
9.
《计算机科学与探索》2016,(2):268-276
针对原始的仿射传播(affinity propagation,AP)聚类算法难以处理多代表点聚类,以及空间和时间开销过大等问题,提出了快速多代表点仿射传播(multi-exemplar affinity propagation using fast reduced set density estimator,FRSMEAP)聚类算法。该算法在聚类初始阶段,引入快速压缩集密度估计算法(fast reduced set density estimator,FRSDE)对大规模数据集进行预处理,得到能够充分代表样本属性的压缩集;在聚类阶段,使用多代表点仿射传播(multi-exemplar affinity propagation,MEAP)聚类算法,获得比AP更加明显的聚类决策边界,从而提高聚类的精度;最后再利用K-邻近(K-nearest neighbor,KNN)算法分配剩余点得到最终的数据划分。在人工数据集和真实数据集上的仿真实验结果表明,该算法不仅能在大规模数据集上进行聚类,而且具有聚类精度高和运行速度快等优点。 相似文献
10.
In this work, the parallel fast condensed nearest neighbor (PFCNN) rule, a distributed method for computing a consistent subset of a very large data set for the nearest neighbor classification rule is presented. In order to cope with the communication overhead typical of distributed environments and to reduce memory requirements, different variants of the basic PFCNN method are introduced. An analysis of spatial cost, CPU cost, and communication overhead is accomplished for all the algorithms. Experimental results, performed on both synthetic and real very large data sets, revealed that these methods can be profitably applied to enormous collections of data. Indeed, they scale up well and are efficient in memory consumption, confirming the theoretical analysis, and achieve noticeable data reduction and good classification accuracy. To the best of our knowledge, this is the first distributed algorithm for computing a training set consistent subset for the nearest neighbor rule. 相似文献
11.
This work has two main objectives, namely, to introduce a novel algorithm, called the fast condensed nearest neighbor (FCNN) rule, for computing a training-set-consistent subset for the nearest neighbor decision rule and to show that condensation algorithms for the nearest neighbor rule can be applied to huge collections of data. The FCNN rule has some interesting properties: it is order independent, its worst-case time complexity is quadratic but often with a small constant prefactor, and it is likely to select points very close to the decision boundary. Furthermore, its structure allows for the triangle inequality to be effectively exploited to reduce the computational effort. The FCNN rule outperformed even here-enhanced variants of existing competence preservation methods both in terms of learning speed and learning scaling behavior and, often, in terms of the size of the model while it guaranteed the same prediction accuracy. Furthermore, it was three orders of magnitude faster than hybrid instance-based learning algorithms on the MNIST and Massachusetts Institute of Technology (MIT) Face databases and computed a model of accuracy comparable to that of methods incorporating a noise-filtering pass. 相似文献
12.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values 总被引:76,自引:0,他引:76
Zhexue Huang 《Data mining and knowledge discovery》1998,2(3):283-304
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values
prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms
which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes
algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with
modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function.
With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The
k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes
algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean
disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on
two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large
data sets, which is critical to data mining applications. 相似文献
13.
传统中由单一的神经网络等算法所构架起的评价模型主要存在着精度低、网络学习速度慢等不合理之处.为此,提出了基于粗糙集和RBF神经网络的大规模数据集环境下的评价方法.首先详解了粗糙集理论对大规模高维数据所确定的宽泛属性集的分类、约简;然后把处理后的数据指标作为RBF神经网络的输入进行训练、仿真.以高速公路路面性能使用评价为... 相似文献
14.
Exploratory data analysis is a widely used technique to determine which factors have the most influence on data values in a multi-way table, or which cells in the table can be considered anomalous with respect to the other cells. In particular, median polish is a simple yet robust method to perform exploratory data analysis. Median polish is resistant to holes in the table (cells that have no values), but it may require many iterations through the data. This factor makes it difficult to apply median polish to large multidimensional tables, since the I/O requirements may be prohibitive. This paper describes a technique that uses median polish over an approximation of a datacube, easing the burden of I/O. The cube approximation is achieved by fitting log-linear models to the data. The results obtained are tested for quality, using a variety of measures. The technique scales to large datacubes and proves to give a good approximation of the results that would have been obtained by median polish in the original data. 相似文献
15.
面向大数据的SVM参数寻优方法 总被引:2,自引:0,他引:2
研究数据回归问题,进行快速寻优,传统SVM参数寻优因采用大范围遍历搜索算法,需消耗大量时间,不适用于对大数据集进行训练.基于均匀设计与自调用支持向量回归,为缩短寻优时间,加快速度,提出了一种有效降低搜索时间的策略.根据均匀设计产生27个具有代表性参数组合,每个组合对训练集经交叉测试得其均方误差MSE,再以MSE为目标函数,通过自调用支持向量回归建立其与27个参数组合之间的关系模型.基于关系模型预测729个参数组合对应的MSE,并以MSE最小寻找最优参数组合.3个实例数据集的仿真结果表明,新方法在保证预测精度的同时,大幅度缩短了训练建模时间,为大数据集支持向量机参数选择提供了新的有效解决方案. 相似文献
16.
Xiaoyan Li Lidan Shou Gang Chen Tianlei Hu Jinxiang Dong 《Knowledge and Data Engineering, IEEE Transactions on》2008,20(11):1566-1580
In this paper, we propose an image semantic model based on the knowledge and criteria in the field of linguistics and taxonomy. Our work bridges the "semantic gap" by seamlessly exploiting the synergy of both visual feature processing and semantic relevance computation in a new way, and provides improved query efficiency and effectiveness for large general image databases. Our main contributions are as follows: We design novel data structures, namely a Lexical Hierarchy, an Image-Semantic Hierarchy, and a number of Atomic Semantic Domains, to capture the semantics and the features of the database, and to provide the indexing scheme. We present a novel image query algorithm based on the proposed structures. In addition, we propose a novel term expansion mechanism to improve the lexical processing. Our extensive experiments indicate that our proposed techniques are effective in achieving high run-time performance with improved retrieval accuracy. The experiments also show that the proposed method has good scalability. 相似文献
17.
Stergios Papadimitriou Seferina Mavroudi Liviu Vladutu G. Pavlides Anastasios Bezerianos 《Applied Intelligence》2002,16(3):185-203
Complex application domains involve difficult pattern classification problems. The state space of these problems consists of regions that lie near class separation boundaries and require the construction of complex discriminants while for the rest regions the classification task is significantly simpler. The motivation for developing the Supervised Network Self-Organizing Map (SNet-SOM) model is to exploit this fact for designing computationally effective solutions. Specifically, the SNet-SOM utilizes unsupervised learning for classifying at the simple regions and supervised learning for the difficult ones in a two stage learning process. The unsupervised learning approach is based on the Self-Organizing Map (SOM) of Kohonen. The basic SOM is modified with a dynamic node insertion/deletion process controlled with an entropy based criterion that allows an adaptive extension of the SOM. This extension proceeds until the total number of training patterns that are mapped to neurons with high entropy (and therefore with ambiguous classification) reduces to a size manageable numerically with a capable supervised model. The second learning phase (the supervised training) has the objective of constructing better decision boundaries at the ambiguous regions. At this phase, a special supervised network is trained for the computationally reduced task of performing the classification at the ambiguous regions only. The performance of the SNet-SOM has been evaluated on both synthetic data and on an ischemia detection application with data extracted from the European ST-T database. In all cases, the utilization of SNet-SOM with supervised learning based on both Radial Basis Functions and Support Vector Machines has improved the results significantly related to those obtained with the unsupervised SOM and has enhanced the scalability of the supervised learning schemes. The highly disciplined design of the generalization performance of the Support Vector Machine allows to design the proper model for the particular training set. 相似文献
18.
This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists
of determining all pairs that are similar to each other, where the overall similarity between two records is defined based
on domain-specific similarities over individual attributes. The record-linkage problem arises naturally in the context of
data cleansing that usually precedes data analysis and mining. Since the scalability issue of record linkage was addressed
in [21], the repertoire of database techniques dealing with multidimensional data sets has significantly increased. Specifically,
many effective and efficient approaches for distance-preserving transforms and similarity joins have been developed. Based
on these advances, we explore a novel approach to record linkage. For each attribute of records, we first map values to a
multidimensional Euclidean space that preserves domain-specific similarity. Many mapping algorithms can be applied, and we
use the Fastmap approach [16] as an example. Given the merging rule that defines when two records are similar based on their attribute-level similarities,
a set of attributes are chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes
is used to find similar pairs of records. Our extensive experiments using real data sets show that our solution has very good
efficiency and recall.
Part of this article was published in [28]. In addition to the prior materials, this article contains more analysis, a complete proof, and more experimental results
that were not included in the original paper. 相似文献
19.
Weili Wu Hong Gao Jianzhong Li 《Knowledge and Data Engineering, IEEE Transactions on》2006,18(12):1667-1680
Data compression is an effective technique to improve the performance of data warehouses. Since cube operation represents the core of online analytical processing in data warehouses, it is a major challenge to develop efficient algorithms for computing cube on compressed data warehouses. To our knowledge, very few cube computation techniques have been proposed for compressed data warehouses to date in the literature. This paper presents a novel algorithm to compute cubes on compressed data warehouses. The algorithm operates directly on compressed data sets without the need of first decompressing them. The algorithm is applicable to a large class of mapping complete data compression methods. The complexity of the algorithm is analyzed in detail. The analytical and experimental results show that the algorithm is more efficient than all other existing cube algorithms. In addition, a heuristic algorithm to generate an optimal plan for computing cube is also proposed 相似文献
20.