首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We are witnessing the era of big data computing where computing the resources is becoming the main bottleneck to deal with those large datasets. In the case of high-dimensional data where each view of data is of high dimensionality, feature selection is necessary for further improving the clustering and classification results. In this paper, we propose a new feature selection method, Incremental Filtering Feature Selection (IF2S) algorithm, and a new clustering algorithm, Temporal Interval based Fuzzy Minimal Clustering (TIFMC) algorithm that employs the Fuzzy Rough Set for selecting optimal subset of features and for effective grouping of large volumes of data, respectively. An extensive experimental comparison of the proposed method and other methods are done using four different classifiers. The performance of the proposed algorithms yields promising results on the feature selection, clustering and classification accuracy in the field of biomedical data mining.  相似文献   

2.
Frequent itemset mining is one of the data mining techniques applied to discover frequent patterns, used in prediction, association rule mining, classification, etc. Apriori algorithm is an iterative algorithm, which is used to find frequent itemsets from transactional dataset. It scans complete dataset in each iteration to generate the large frequent itemsets of different cardinality, which seems better for small data but not feasible for big data. The MapReduce framework provides the distributed environment to run the Apriori on big transactional data. However, MapReduce is not suitable for iterative process and declines the performance. We introduce a novel algorithm named Hybrid Frequent Itemset Mining (HFIM), which utilizes the vertical layout of dataset to solve the problem of scanning the dataset in each iteration. Vertical dataset carries information to find support of each itemsets. Moreover, we also include some enhancements to reduce number of candidate itemsets. The proposed algorithm is implemented over Spark framework, which incorporates the concept of resilient distributed datasets and performs in-memory processing to optimize the execution time of operation. We compare the performance of HFIM with another Spark-based implementation of Apriori algorithm for various datasets. Experimental results show that the HFIM performs better in terms of execution time and space consumption.  相似文献   

3.
针对大数据带来的海量信息,传统的数据挖掘方法已经不再适用。近些年来很多学者提出新的数据挖掘方式,或者在传统的方法上进行改进,但是还远不足以处理这些海量信息。在总结已有方法的基础上,提出一种基于C anopy的K-M eans并行化算法。与传统的K-M eans算法相比,本文提出的改进方法会通过密度确定初始中心,然后在H adoop分布式集群上运行K-M eans算法。实验证明,该方法在保证精度的情况下,能降低运算复杂度从而提高计算效率。  相似文献   

4.
BIRCH: A New Data Clustering Algorithm and Its Applications   总被引:14,自引:0,他引:14  
Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality.In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.  相似文献   

5.
Networks with billions of vertices introduce new challenges to perform graph analysis in a reasonable time. Clustering coefficient is an important analytical measure of networks such as social networks and biological networks. To compute clustering coefficient in big graphs, existing distributed algorithms suffer from low efficiency such that they may fail due to demanding lots of memory, or even, if they complete successfully, their execution time is not acceptable for real-world applications. We present a distributed MapReduce-based algorithm, called CCFinder, to efficiently compute clustering coefficient in very big graphs. CCFinder is executed on Apache Spark, a scalable data processing platform. It efficiently detects existing triangles through using our proposed data structure, called FONL, which is cached in the distributed memory provided by Spark and reused multiple times. As data items in the FONL are fine-grained and contain the minimum required information, CCFinder requires less storage space and has better parallelism in comparison with its competitors. To find clustering coefficient, our solution to triangle counting is extended to have degree information of the vertices in the appropriate places. We performed several experiments on a Spark cluster with 60 processors. The results show that CCFinder achieves acceptable scalability and outperforms six existing competitor methods. Four competitors are those methods proposed based on graph processing systems, i.e., GraphX, NScale, NScaleSpark, and Pregel frameworks, and two others are the Cohen’s method and NodeIterator++, introduced based on MapReduce.  相似文献   

6.
王梅  宋晓晖  刘勇  许传海 《计算机应用》2022,42(11):3330-3336
针对K-Means聚类算法利用均值更新聚类中心,导致聚类结果受样本分布影响的问题,提出了神经正切核K-Means聚类算法(NTKKM)。首先通过神经正切核(NTK)将输入空间的数据映射到高维特征空间,然后在高维特征空间中进行K-Means聚类,并采用兼顾簇间与簇内距离的方法更新聚类中心,最后得到聚类结果。在car和breast-tissue数据集上,对NTKKM聚类算法的准确率、调整兰德系数(ARI)及FM指数这3个评价指标进行统计。实验结果表明,NTKKM聚类算法的聚类效果以及稳定性均优于K?Means聚类算法和高斯核K-Means聚类算法。NTKKM聚类算法与传统的K-Means聚类算法相比,准确率分别提升了14.9%和9.4%,ARI分别提升了9.7%和18.0%,FM指数分别提升了12.0%和12.0%,验证了NTKKM聚类算法良好的聚类性能。  相似文献   

7.
模糊K Prototypes(FKP)算法融合了K Means和K Modes对数值型和符号型数据的处理方法,适合于混合类型数据的聚类分析。同时,模糊技术使得FKP适合于处理含有噪声和缺少数据的数据库。但是,在使用FCM(FuzzyC Meansalgorithm)或FKP算法时,如何选取加权指数α仍是悬而未决的问题。许多研究者基于他们的实验结果给出FCM中的最佳加权指数可能位于区间 [1. 5,2. 5],本文则提出了一个FKP中加权指数的探寻算法。在多个实际数据集上的实验结果表明,为进行有效的聚类,FKP中加权指数应该小于 1. 5。  相似文献   

8.
时空轨迹大数据分布式蜂群模式挖掘算法   总被引:1,自引:0,他引:1  
针对时空轨迹大数据的蜂群模式挖掘需求,提出了一种高效的基于MapReduce的分布式蜂群模式挖掘算法。首先,提出了基于最大移动目标集的对象集闭合蜂群模式概念,并利用最小时间支集优化了串行挖掘算法;其次,提出了蜂群模式的并行化挖掘模型,利用蜂群模式时间域无关性,并行化了聚类与子时间域上的蜂群模式挖掘过程;第三,设计了一个基于MapReduce链式架构的分布式并行挖掘算法,通过四个阶段快速地实现了蜂群模式的并行挖掘;最后,在Hadoop平台上,使用真实交通轨迹大数据集对分布式算法的有效性和高效性进行了验证与分析。  相似文献   

9.
Formal Concept Analysis (FCA), in which data is represented as a formal context, offers a framework for Association Rules Mining (ARM) by handling functional dependencies in the data. However, with the size of the formal context, the number of rules grows exponentially. In this article, we apply Fuzzy K-Means clustering on the data set to reduce the formal context and FCA on the reduced data set for mining association rules. With experiments on two real-world healthcare data sets, we offer the evidence for performance of FKM-based FCA in mining association rules.  相似文献   

10.
The advances in nanometer technology and integrated circuit technology enable the graphics card to attach individual memory and one or more processing units, named GPU, in which most of the graphing instructions can be processed in parallel. Obviously, the computation resource can be used to improve the execution efficiency of not only graphing applications but other time consuming applications like data mining. The Clustering Affinity Search Technique is a famous clustering algorithm, which is widely used in clustering the biological data. In this paper, we will propose an algorithm that can utilize the GPU and the individual memory of graphics card to accelerate the execution. The experimental results show that our proposed algorithm can deliver excellent performance in terms of execution time and is scalable to very large databases.  相似文献   

11.
在大数据背景下,以K-Means为代表的聚类分析对于数据分析和挖掘十分重要。海量高维数据的处理给K-Means算法带来了性能方面的强烈需求。最新提出的众核体系结构MIC(many integrated core)能够为算法加速提供众核间线程级和核内指令级并行,使其成为K-Means算法加速的很好选择。在分析K-Means基本算法特点的基础上,分析了K-Means算法的瓶颈,提出了可利用数据并行的K-Means向量化算法,优化了向量化算法的数据布局方案。最后,基于CPU/MIC的异构架构实现了向量化K-Means算法,并且探索了MIC在非传统HPC(high performance computing)应用领域的优化策略。测试结果表明,K-Means向量化算法具有良好的计算性能和扩展性。  相似文献   

12.
如何在海量不确定数据集中提高频繁模式挖掘性能是目前研究的热点.传统算法大多是以期望、概率或者权重等单一指标为数据项集支持度,在大数据背景下,同时考虑概率和权重支持度的算法难以兼顾其执行效率.为此,本文提出一种基于Spark的不确定数据集频繁模式挖掘算法(UWEFP),首先,为了同时兼顾数据项的概率和权重,计算一项集的最大概率权重值并进行剪枝;然后,为了减少对数据集的多次扫描,结合Spark框架的优点,设计了一种具有FP-tree特征的新颖的UWEFP-tree结构进行模式树的构建及挖掘;最后在Spark环境下,以UCI数据集进行实验验证.实验结果表明本文的方法在保证挖掘结果的同时,提高了效率.  相似文献   

13.
An important approach for image classification is the clustering of pixels in the spectral domain. Fast detection of different land cover regions or clusters of arbitrarily varying shapes and sizes in satellite images presents a challenging task. In this article, an efficient scalable parallel clustering technique of multi-spectral remote sensing imagery using a recently developed point symmetry-based distance norm is proposed. The proposed distributed computing time efficient point symmetry based K-Means technique is able to correctly identify presence of overlapping clusters of any arbitrary shape and size, whether they are intra-symmetrical or inter-symmetrical in nature. A Kd-tree based approximate nearest neighbor searching technique is used as a speedup strategy for computing the point symmetry based distance. Superiority of this new parallel implementation with the novel two-phase speedup strategy over existing parallel K-Means clustering algorithm, is demonstrated both quantitatively and in computing time, on two SPOT and Indian Remote Sensing satellite images, as even K-Means algorithm fails to detect the symmetry in clusters. Different land cover regions, classified by the algorithms for both images, are also compared with the available ground truth information. The statistical analysis is also performed to establish its significance to classify both satellite images and numeric remote sensing data sets, described in terms of feature vectors.  相似文献   

14.
Classical clustering algorithms like K-means often converge to local optima and have slow convergence rates for larger datasets. To overcome such situations in clustering, swarm based algorithms have been proposed. Swarm based approaches attempt to achieve the optimal solution for such problems in reasonable time. Many swarm based algorithms such as Flower Pollination Algorithm (FPA), Cuckoo Search Algorithm (CSA), Black Hole Algorithm (BHA), Bat Algorithm (BA) Particle Swarm Optimization (PSO), Firefly Algorithm (FFA), Artificial Bee Colony (ABC) etc have been successfully applied to many non-linear optimization problems. In this paper, an algorithm is proposed which hybridizes Chaos Optimization and Flower Pollination over K-means to improve the efficiency of minimizing the cluster integrity. The proposed algorithm referred as Chaotic FPA (CFPA) is compared with FPA, CSA, BHA, BA, FFA, and PSO over K-Means for data clustering problem. Experiments are conducted on sixteen benchmark datasets. Algorithms are compared on four different performance parameters — cluster integrity, execution time, number of iterations to converge (NIC) and stability. Results obtained are analyzed statistically using Non-parametric Friedman test. If Friedman test rejects the Null hypothesis then pair wise comparison is done using Nemenyi test. Experimental Result demonstrates the following: (a) CFPA and BHA have better performance on the basis of cluster integrity as compared to other algorithms; (b) Prove the superiority of CFPA and CSA over others on the basis of execution time; (c) CFPA and FPA converges earlier than other algorithms to evaluate optimal cluster integrity; (d) CFPA and BHA produce more stable results than other algorithms.  相似文献   

15.
The amounts of digital data, when it is generated for each generation, valuable information called big data, have been retained. The cluster is typically used as a research technique; this practical information mining is the process. A considerable amount of diagnosis in the context of big data is established to measure the clustering processing for big data analysis. The so-called fuzzy mechanism-only framework assembled in the security storage sector may include access to the sub-iterative method. The algorithm, based on the incentive of the design and implementation of its low computational needs fuzzy clustering algorithm, big data is possible to cluster the vast data set and biased. Handle the Random Data Storing with Optimization Fuzzy Logic algorithm (RDS-FLA) proposes random data security storage and optimization be applied to the cluster data, the fuzzy logic algorithm. Some of the large-scale data set of experimental learning data has been shown. To evaluate the vague and random data security storage and the time, the attempted performance of RDS-FLA is a form of recommendation for the execution of scalability on a big data cluster. The calculations, the complexity of time and space, run the time, cluster quality, RDS-FLA is, without affecting the quality of clustering, it is about measures in the face to show that that can be performed in a short period. Therefore, the proposed algorithm, shortening the processing time, increase the efficiently stored data security. Advantages such as optimization and efficiency of such data security costs can be determined from the algorithm's experimental results.  相似文献   

16.
Identifying clusters is an important aspect of analyzing large datasets. Clustering algorithms classically require access to the complete dataset. However, as huge amounts of data are increasingly originating from multiple, dispersed sources in distributed systems, alternative solutions are required. Furthermore, data and network dynamicity in a distributed setting demand adaptable clustering solutions that offer accurate clustering models at a reasonable pace. In this paper, we propose GoScan, a fully decentralized density-based clustering algorithm which is capable of clustering dynamic and distributed datasets without requiring central control or message flooding. We identify two major tasks: finding the core data points, and forming the actual clusters, which we execute in parallel employing gossip-based communication. This approach is very efficient, as it offers each peer enough authority to discover the clusters it is interested in. Our algorithm poses no extra burden of overlay formation in the network, while providing high levels of scalability. We also offer several optimizations to the basic clustering algorithm for improving communication overhead and processing costs. Coping with dynamic data is made possible by introducing an age factor, which gradually detects data-set changes and enables clustering updates. In our experimental evaluation, we will show that GoSCAN can discover the clusters efficiently with scalable transmission cost.  相似文献   

17.
该文利用领域本体对产品评论文本中的评价对象进行抽取和整合,在此基础上,建立产品性能的非完备信息系统,将特征的情感倾向寓于特征的权重计算之中。对非完备信息系统,给出了基于差别矩阵的启发式特征约简方法,通过特征降维处理,达到了减少特征的冗余度和数据稀疏性的目的。对降维后的非完备信息系统采用K-Means聚类算法,实现了评价对象情感聚类。为了验证该文提出方法的有效性,在真实汽车评论文本数据上进行实验, 实验结果表明,在对特征进行一定程度的降维后,仍表现出较好的聚类效果。  相似文献   

18.
田华  何翼 《计算机应用研究》2020,37(12):3586-3589
针对大数据分析在大规模并行分布式系统和软件平台上可扩展的问题,提出了一个基于无参数围绕质心二进制分裂聚类(clustering using binary splitting,CLUBS)的大数据挖掘技术。该技术以完全无监督的方式工作,基于最小二次距离的准则进行分裂聚类将数据与噪声分离,通过中级精炼来识别仅包含异常值的块并为剩余块生成全面的簇,设计CLUBS的并行化版本以实现对大数据进行快速有效的聚类。实验表明CLUBS并行算法不受数据维度和噪声的影响,且比现有算法具有更好的可扩展性且速度较快。  相似文献   

19.
This paper studies the state-of-the-art classification techniques for electroencephalogram (EEG) signals. Fuzzy Functions Support Vector Classifier, Improved Fuzzy Functions Support Vector Classifier and a novel technique that has been designed by utilizing Particle Swarm Optimization and Radial Basis Function Networks (PSO-RBFN) have been studied. The classification performances of the techniques are compared on standard EEG datasets that are publicly available and used by brain–computer interface (BCI) researchers. In addition to the standard EEG datasets, the proposed classifier is also tested on non-EEG datasets for thorough comparison. Within the scope of this study, several data clustering algorithms such as Fuzzy C-means, K-means and PSO clustering algorithms are studied and their clustering performances on the same datasets are compared. The results show that PSO-RBFN might reach the classification performance of state-of-the art classifiers and might be a better alternative technique in the classification of EEG signals for real-time application. This has been demonstrated by implementing the proposed classifier in a real-time BCI application for a mobile robot control.  相似文献   

20.
High utility itemset mining problem uses the notion of utilities to discover interesting and actionable patterns. Several data structures and heuristic methods have been proposed in the literature to efficiently mine high utility itemsets. This paper advances the state-of-the-art and presents HMiner, a high utility itemset mining method. HMiner utilizes a few novel ideas and presents a compact utility list and virtual hyperlink data structure for storing itemset information. It also makes use of several pruning strategies for efficiently mining high utility itemsets. The proposed ideas were evaluated on a set of benchmark sparse and dense datasets. The execution time improvements ranged from a modest thirty percent to three orders of magnitude across several benchmark datasets. The memory consumption requirements also showed up to an order of magnitude improvement over the state-of-the-art methods. In general, HMiner was found to work well in the dense regions of both sparse and dense benchmark datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号