期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing 总被引：2，自引：0，他引：2

Hwanjo?Yu Email author Jiong?Yang Jiawei?Han Xiaolei?Li 《Data mining and knowledge discovery》2005,11(3):295-321

Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification. A preliminary version of the paper, “Classifying Large Data Sets Using SVM with Hierarchical Clusters”, by H. Yu, J. Yang, and J. Han, appeared in Proc. 2003 Int. Conf. on Knowledge Discovery in Databases (KDD'03), Washington, DC, August 2003. However, this submission has substantially extended the previous paper and contains new and major-value added technical contribution in comparison with the conference publication.

Hwanjo Yu (Corresponding author)Email:

Jiong YangEmail:

Jiawei HanEmail:

Xiaolei LiEmail:

相似文献

2.

基于自相似的金融时间序列波动聚集性研究 总被引：1，自引：0，他引：1

黄超吴清烈武忠朱扬勇《计算机工程与应用》2005,41(32):12-14

自相似与波动聚集性是金融时间序列的两个重要特征,文章将这两个特征结合,提出了一种基于自相似的波动聚集模型。基于该模型提出了一种基于拟合优度与趋势变动的联机时间序列分割算法,算法能够根据波动的自相似特征将序列分割为多个子序列,从而用于研究在不同时段金融时间序列波动的自相似性。对实际数据的实验结果表明,文章所提出的模型和分割算法是有效的。相似文献

3.

大数据的密度统计合并算法

刘贝贝马儒宁丁军娣《软件学报》2015,26(11):2820-2835

针对处理大数据时传统聚类算法失效或效果不理想的问题,提出了一种大数据的密度统计合并算法(density-based statistical merging algorithm for large data sets,简称DSML).该算法将数据点的每个特征看作一组独立随机变量,并根据独立有限差分不等式获得统计合并判定准则.首先,使用统计合并判定准则对Leaders算法做出改进,获得代表点集;随后,结合代表点的密度和邻域信息,再次使用统计合并判定准则完成对整个数据集的聚类.理论分析和实验结果表明,DSML算法具有近似线性的时间复杂度,能处理任意形状的数据集,且对噪声具有良好的鲁棒性,非常有利于处理大规模数据集. 相似文献

4.

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values 总被引：76，自引：0，他引：76

Zhexue Huang 《Data mining and knowledge discovery》1998,2(3):283-304

The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications. 相似文献

5.

自相似特性对网络技术的影响

谭晓玲梅成刚刘兰《计算机与数字工程》2006,34(2):5-8

介绍网络业务流自相似特性的研究进展，并对自相似特性的产生进行了深入分析。研究自相似特性对网络技术的影响，对进一步研究网络业务量的自相似性有理论指导意义。相似文献

6.

Scaling Kernel-Based Systems to Large Data Sets

Volker Tresp 《Data mining and knowledge discovery》2001,5(3):197-211

In the form of the support vector machine and Gaussian processes, kernel-based systems are currently very popular approaches to supervised learning. Unfortunately, the computational load for training kernel-based systems increases drastically with the size of the training data set, such that these systems are not ideal candidates for applications with large data sets. Nevertheless, research in this direction is very active. In this paper, I review some of the current approaches toward scaling kernel-based systems to large data sets. 相似文献

7.

数据挖掘中变量聚类方法的应用研究 总被引：5，自引：0，他引：5

汤效琴戴汝源徐琪《计算机工程与应用》2004,40(24):171-173,194

讨论了变量聚类方法中相同类型变量相似性测度方法,首次提出一种关于混合变量间相似性测度的方法。并将基于变量的聚类分析和模糊聚类结合起来,为解决数据挖掘中基于变量聚类问题提供了有效的分析工具。最后给出一个应用实例。相似文献

8.

基于核集合的大数据快速Kernel Grower 聚类方法

常亮邓小明郑碎武王永庆《自动化学报》2008,34(3):376-382

Kernel Grower 是一种有效的核聚类方法, 它具有计算精度高的优点. 然而, Kernel Grower在应用中的一个关键问题是对于大规模数据运算速度缓慢, 这在很大程度上制约了该方法的应用. 本文提出了一种大规模数据的快速核聚类方法, 该方法通过近似最小包含球快速算法, 显著地提高了的Kernel Grower计算速度, 并且该方法的计算复杂度仅与样本个数成线性关系. 在人工数据集和标准测试集上的模拟实验均说明本文算法的有效性. 本文还给出该方法在真实彩色图像分割中应用. 相似文献

9.

采用元组聚类的增量式数据分区方法

下载免费PDF全文

吕晨房俊韩燕波《计算机科学与探索》2011,5(8):719-729

数据分区是提升数据库可扩展能力的有效方法。在事务查询密集的系统中,合理的分区策略可减少分布式事务查询数量,并提高事务查询响应速度。提出了一种基于元组聚类的增量式分区方法,通过将元组聚簇和采用分区感知的数据筛选策略来降低算法的复杂度。首先依据时间窗口模型聚类元组,并构建簇节点图,然后利用分区感知策略对图进行删减,最后采用图划分算法对图进行子图划分来得到分区。与现有方法相比,该方法减少了分区响应时间,保证了较少的分布式事务数量,并提高了分区事务查询速度。相似文献

10.

Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

Chen Li Liang Jin Sharad Mehrotra 《World Wide Web》2006,9(4):557-584

This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes. The record-linkage problem arises naturally in the context of data cleansing that usually precedes data analysis and mining. Since the scalability issue of record linkage was addressed in [21], the repertoire of database techniques dealing with multidimensional data sets has significantly increased. Specifically, many effective and efficient approaches for distance-preserving transforms and similarity joins have been developed. Based on these advances, we explore a novel approach to record linkage. For each attribute of records, we first map values to a multidimensional Euclidean space that preserves domain-specific similarity. Many mapping algorithms can be applied, and we use the Fastmap approach [16] as an example. Given the merging rule that defines when two records are similar based on their attribute-level similarities, a set of attributes are chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes is used to find similar pairs of records. Our extensive experiments using real data sets show that our solution has very good efficiency and recall. Part of this article was published in [28]. In addition to the prior materials, this article contains more analysis, a complete proof, and more experimental results that were not included in the original paper. 相似文献

11.

基于分治法的高维大数据集模糊聚类算法

王宝文阎俊梅刘文远石岩《计算机工程》2007,33(24):60-62

将高维的大数据集随机分成若干个子集,对每个子集聚类采用一种基于遗传算法的高维数据模糊聚类方法。该方法引入了一个模糊非相似矩阵来表示高维样本之间的非相似程度,并将高维样本随机初始化到二维平面,利用遗传算法迭代优化二维样本的坐标值,实现二维样本之间的欧氏距离向样本间的模糊非相似度的趋近。将得到的最优的二维样本用模糊C-均值聚类(FCM)算法聚类,克服了聚类有效性对高维样本空间分布的依赖。实验仿真表明,该算法有较好的聚类效果,且极大地提高了聚类的速度。相似文献

12.

利用改进变色龙算法进行有障碍物体聚类

文俊浩何光辉任海军《计算机工程与应用》2005,41(32):28-29,52

聚类分析由于其应用较为广泛,已经成为数据挖掘、数理统计等学科的一个活跃的研究领域。聚类技术可以应用于模式识别、数据分析、图像处理、网页挖掘、电子商务等。以往的聚类分析都没有考虑现实世界存在的物体障碍问题从而影响聚类结果。该文对有障碍物体聚类问题进行了初步的探讨,并且提出了一种称之为改进的变色龙(ADP-Chameleon)的算法来解决有障碍物体聚类问题。相似文献

13.

Efficient Fitting and Rendering of Large Scattered Data Sets Using Subdivision Surfaces 总被引：2，自引：0，他引：2

VincentScheib JörgHaber Ming C.Lin Hans-PeterSeidel 《Computer Graphics Forum》2002,21(3):353-362

相似文献

14.

大型关系数据集中高维数据可视化处理的一种方法

兰红王建雄刘发升《自动化技术与应用》2006,25(11):21-23

本文针对大型关系数据集的高维数据,提出了一种基于聚类指引旅行的三维投影的可视化方法.将数据集中的数据聚类,选择聚类的中心点进行投影,将投影映射到三维空间的四面体中,然后将所有的四面体旅行一遍,从而实现数据的遍历和可视化. 相似文献

15.

面向大规模机群的可扩展OLAP查询技术 总被引：1，自引：0，他引：1

王会举覃雄派王珊张延松李芙蓉《计算机学报》2015,38(1)

大数据时代,由中低端硬件组成的大规模机群逐渐成为海量数据处理的主流平台之一.然而传统基于高端硬件平台设计的并行OLAP查询算法并不适应这种由不可靠计算单元组成的大规模并行计算的环境.为改善其在新计算环境下的的扩展性和容错性,该文对传统数据仓库的数据组织模式及处理模式进行改造,提出了全新的无连接雪花模型和TRM执行模型.无连接雪花模型基于层次编码技术,将维表层次等关键信息压缩进事实表,使得事实表可以独立处理数据,从数据模型层保证了数据计算的独立性;TRM执行模型将OLAP查询的处理抽象为Transform、Reduce、Merge3个操作,使得OLAP查询可被划分为众多可并行执行的独立子任务,从执行层保证了系统的高度可扩展特性.在性能优化方面,该文提出了Scan-index扫描和跳跃式扫描算法,以尽可能地减少I/O访问操作;设计了并行谓词判断、批量谓词判断等优化算法,以加速本地计算速度.实验表明:LaScOLAP原型可以获得较好的扩展性和容错性,其性能比HadoopDB高出一个数量级. 相似文献

16.

面向超大数据集的SVM近似训练算法

曾志强廖备水高济《计算机科学》2009,36(11):208-212

标准SVM学习算法运行所需的时间和空间复杂度分别为O(l~3)和O(l~2),l为训练样本的数量,因此不适用于对超大数据集进行训练.提出一种基于近似解的SVM训练算法:Approximate Vector Machine(AVM).AVM采用增量学习的策略来寻找近似最优分类超平面,并且在迭代过程中采用热启动及抽样技巧来加快训练速度.理论分析表明,该算法的计算复杂度与训练样本的数量无关,因此具有良好的时间与空间扩展性.在超大数据集上的实验结果表明,该算法在极大提高训练速度的同时,仍然保持了原始分类器的泛化性能,并且训练完毕具有较少的支持向量,因此结果分类器具有更快的分类速度. 相似文献

17.

大型数据集的多维变量关联性挖掘

郝薇《电脑与微电子技术》2012,(18):3-8,21

为了探索多维数据集中各变量之间的关系,特别是非函数关系,对数据集所在的n维立方体的各个维度进行划分,在得到的n维网格中定义自然概率密度函数,以此得到数据集在特定划分下的互信息值。对所有的划分取互信息最大值,正态化后即为所定义的特征张量的一项,取特征张量在给定最大网格数下的最大项的值定义为MIC,它度量了多维变量间的关联程度。相似文献

18.

Computing LTS Regression for Large Data Sets 总被引：9，自引：0，他引：9

PETER J. ROUSSEEUW KATRIEN VAN DRIESSEN 《Data mining and knowledge discovery》2006,12(1):29-45

Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics and sums of squared residuals, and techniques which we call ‘selective iteration’ and ‘nested extensions’. We also use an intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude. This allows us to apply FAST-LTS to large databases. 相似文献

19.

Concept Decompositions for Large Sparse Text Data Using Clustering 总被引：27，自引：0，他引：27

Dhillon Inderjit S. Modha Dharmendra S. 《Machine Learning》2001,42(1-2):143-175

Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain fractal-like and self-similar behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized basis for text data sets. 相似文献

20.

Evolutionary Design of Arbitrarily Large Sorting Networks Using Development

Luká??Sekanina Email author Michal?Bidlo 《Genetic Programming and Evolvable Machines》2005,6(3):319-347

An evolutionary algorithm is combined with an application-specific developmental scheme in order to evolve efficient arbitrarily large sorting networks. First, a small sorting network (that we call the embryo) has to be prepared to solve the trivial instance of a problem. Then the evolved program (the constructor) is applied on the embryo to create a larger sorting network (solving a larger instance of the problem). Then the same constructor is used to create a new instance of the sorting network from the created larger sorting network and so on. The proposed approach allowed us to rediscover the conventional principle of insertion which is traditionally used for constructing large sorting networks. Furthermore, the principle was improved by means of the evolutionary technique. The evolved sorting networks exhibit a lower implementation cost and delay. 相似文献