首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 7 毫秒
1.
With the rapid development of information techniques, the dimensionality of data in many application domains, such as text categorization and bioinformatics, is getting higher and higher. The high‐dimensionality data may bring many adverse situations, such as overfitting, poor performance, and low efficiency, to traditional learning algorithms in pattern classification. Feature selection aims at reducing the dimensionality of data and providing discriminative features for pattern learning algorithms. Due to its effectiveness, feature selection is now gaining increasing attentions from a variety of disciplines and currently many efforts have been attempted in this field. In this paper, we propose a new supervised feature selection method to pick important features by using information criteria. Unlike other selection methods, the main characteristic of our method is that it not only takes both maximal relevance to the class labels and minimal redundancy to the selected features into account, but also works like feature clustering in an agglomerative way. To measure the relevance and redundancy of feature exactly, two different information criteria, i.e., mutual information and coefficient of relevance, have been adopted in our method. The performance evaluations on 12 benchmark data sets show that the proposed method can achieve better performance than other popular feature selection methods in most cases.  相似文献   

2.
We explore in this paper the efficient clustering of market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality and sparsity. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-basket data, we devise in this paper a novel measurement, called the category-based adherence, and utilize this measurement to perform the clustering. With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm k-todes, for market-basket data with the objective to minimize the category-based adherence. The distance of an item to a given cluster is defined as the number of links between this item and its nearest tode. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster. A validation model based on information gain is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm k-todes devised in this paper significantly outperforms the prior works in both the execution efficiency and the clustering quality as measured by information gain, indicating the usefulness of category-based adherence in market-basket data clustering.  相似文献   

3.
陈晓云  陈媛 《自动化学报》2022,48(4):1091-1104
处理高维复杂数据的聚类问题,通常需先降维后聚类,但常用的降维方法未考虑数据的同类聚集性和样本间相关关系,难以保证降维方法与聚类算法相匹配,从而导致聚类信息损失.非线性无监督降维方法极限学习机自编码器(Ex-treme learning machine,ELM-AE)因其学习速度快、泛化性能好,近年来被广泛应用于降维及去...  相似文献   

4.
利用网络连接数据可以按照连接的基本特征、内容特征、网络流量特征和主机流量特征进行分组的特点,基于K-means算法,提出一种按照特征分组进行聚类的方法,以高效实现特征约简和数据降维.通过调整聚类参数保留特征分组内的差异信息,使用决策树C4.5算法对降维后的数据进行入侵分类处理.实验结果表明,该方法能够使kddcup99数据集的聚类特征数由41个降为4个,且对网络连接数据的总检测率为99.73%,误检率为0,其中正常网络连接和刺探攻击Probe的检测率均为100%.  相似文献   

5.
Document clustering using locality preserving indexing   总被引:7,自引:0,他引:7  
We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using locality preserving indexing (LPI), the documents can be projected into a lower-dimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on latent semantic indexing (LSI) or nonnegative matrix factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised linear discriminant analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters-21578 and TDT2 data sets.  相似文献   

6.
A model‐based co‐clustering divides the data based on two main axes and simultaneously trains a supervised model for each co‐cluster using all other input features. For example, in the rating prediction task of recommender system, the main two axes are items and users. In each co‐cluster, we train a regression model for predicting the rating based on other features such as user's characteristics (e.g., gender), item's characteristics (e.g., genre), contextual features (e.g., location), and so on. In reality, users and items do not necessarily belong to a single co‐cluster, but rather can be associated with several co‐clusters. We extend the model‐based co‐clustering to support fuzzy co‐clustering. In this setting, each item–user pair is associated to every co‐cluster with some membership grade. This grade indicates the level of relevance of the item–user pair to the co‐cluster. Furthermore, we propose a distributed algorithm, based on a map‐reduce approach, to handle big datasets. Evaluating the fuzzy co‐clustering algorithm on three datasets shows a significant improvement comparing with a regular co‐clustering algorithm. In addition, a map‐reduce version of the fuzzy co‐clustering algorithm significantly reduces the runtime.  相似文献   

7.
Dimensionality reduction is a great challenge in high dimensional unlabelled data processing. The existing dimensionality reduction methods are prone to employing similarity matrix and spectral clustering algorithm. However, the noises in original data always make the similarity matrix unreliable and degrade the clustering performance. Besides, existing spectral clustering methods just focus on the local structures and ignore the global discriminative information, which may lead to overfitting in some cases. To address these issues, a novel unsupervised 2-dimensional dimensionality reduction method is proposed in this paper, which incorporates the similarity matrix learning and global discriminant information into the procedure of dimensionality reduction. Particularly, the number of the connected components in the learned similarity matrix is equal to cluster number. We compare the proposed method with several 2-dimensional unsupervised dimensionality reduction methods and evaluate the clustering performance by K-means on several benchmark data sets. The experimental results show that the proposed method outperforms the state-of-the-art methods.  相似文献   

8.
基于能量优化的WSN数据收集和融合算法   总被引:1,自引:0,他引:1  
针对WSN路由协议LEACH中簇头负载过重的问题,提出一种改进的数据收集和融合算法LEACH-E,在簇的建立阶段根据节点的剩余能量及相对距离选择簇头;在通信阶段,运用主成分分析法对簇头收到的数据进行降维处理,再将融合后的数据沿着蚁群算法找到的最优路径以多跳方式发送给基站。仿真结果表明,该算法在均匀分簇、均衡节点能耗、延长网络生命等方面有更好的性能。  相似文献   

9.
数据量大、数据更新速度快、数据源多样和数据存在噪声是大数据的四大特点,这为数据集成提出了新的挑战.实体解析是数据集成的一个重要步骤,在大数据环境下,传统的实体解析算法在效率、质量,特别是抗噪声能力方面的表现并不理想.为了解决大数据环境中因为数据噪声所导致的解析结果冲突,将公共邻居引入相关性聚类问题.上层预分块算法基于邻居关系设计,因而能够快速有效地完成初步分块;核概念的引入更精确地定义了节点与类之间的关联程度,以便下层调整算法准确地判断节点的归属,进而提高相关性聚类的准确度.两层算法采用较为粗糙的相似度距离函数,使得算法不仅简单而且高效.同时,由于引入邻居关系,算法的抗噪声能力明显提高.大量实验表明,两层相关性聚类算法无论在解析质量、抗噪声能力还是在扩展性方面均优于传统算法.  相似文献   

10.
面向软件缺陷数据的聚类分析就是按照一定的准则将不同的软件缺陷数据对象划分为多个类,使得类内的缺陷数据相似,类间的缺陷数据相异,其意义在于发现软件缺陷的分布规律,有针对性地制定测试方案,优化测试过程.针对传统K-Means方法聚类结果依赖样本初始空间分布的问题,提出一种基于PSO算法的数据降维处理方法 DRPS.仿真实验表明,经过该方法降维处理后数据的聚类准确率及聚类质量都有了一定程度的提高.  相似文献   

11.
随着信息技术的飞速发展和大数据时代的来临,数据呈现出高维性、非线性等复杂特征。对于高维数据来说,在全维空间上往往很难找到反映分布模式的特征区域,而大多数传统聚类算法仅对低维数据具有良好的扩展性。因此,传统聚类算法在处理高维数据的时候,产生的聚类结果可能无法满足现阶段的需求。而子空间聚类算法搜索存在于高维数据子空间中的簇,将数据的原始特征空间分为不同的特征子集,减少不相关特征的影响,保留原数据中的主要特征。通过子空间聚类方法可以发现高维数据中不易展现的信息,并通过可视化技术展现数据属性和维度的内在结构,为高维数据可视分析提供了有效手段。总结了近年来基于子空间聚类的高维数据可视分析方法研究进展,从基于特征选择、基于子空间探索、基于子空间聚类的3种不同方法进行阐述,并对其交互分析方法和应用进行分析,同时对高维数据可视分析方法的未来发展趋势进行了展望。  相似文献   

12.
Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies that the derived models are robust with respect to the presence of noisy features and/or data sample fluctuations. The qualitative nature of the stability property enhardens the development of practical, stability optimizing, data mining algorithms as several issues naturally arise, such as: how “much” stability is enough, or how can stability be effectively associated with intrinsic data properties. In the context of this work we take into account these issues and explore the effect of stability maximization in the continuous (PCA-based) k-means clustering problem. Our analysis is based on both mathematical optimization and statistical arguments that complement each other and allow for the solid interpretation of the algorithm’s stability properties. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the features variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means. Apart from the qualitative evaluation, we have also verified our approach as a feature selection method for $k$ -means clustering using four cancer research datasets. The quantitative empirical results illustrate the practical utility of our framework as a feature selection mechanism for clustering.  相似文献   

13.
万静  郑龙君  何云斌  李松 《计算机应用》2019,39(11):3280-3287
如何降低不确定数据对高维数据聚类的影响是当前的研究难点。针对由不确定数据与维度灾难导致的聚类精度低的问题,采用先将不确定数据确定化,后对确定数据聚类的方法。在将不确定数据确定化的过程中,将不确定数据分为值不确定数据与维度不确定数据,并分别处理以提高算法效率。采用结合期望距离的K近邻(KNN)查询得到对聚类结果影响最小的不确定数据近似值以提高聚类精度。在得到确定数据之后,采用子空间聚类的方式避免维度灾难的影响。实验结果证明,基于Clique的高维不确定数据聚类算法(UClique)在UCI数据集上有较好的表现,有良好的抗噪声能力和伸缩性,在高维数据上能得到较好的聚类结果,在不同的不确定数据集实验中能够得到较高精度的实验结果,体现出算法具有一定的健壮性,能够有效地对高维不确定数据集聚类。  相似文献   

14.
为降低特征空间维数,提出了一种基于分布距离的文本特征聚类方法,通过将特征空间中分布距离相近的特征聚合,来实现降维。在TanCorpusV1.0语料库上实验表明,当将特征空间维数降低至原空间的近10%时,用SVM作为分类器,获得了比特征提取方法高的分类精度。  相似文献   

15.
动态加权模糊核聚类算法   总被引:2,自引:0,他引:2  
为了克服噪声特征向量对聚类的影响,充分考虑各特征向量对聚类结果的贡献度的不同,运用mercer核将待聚类的数据映射到高维空间,提出了一种新的动态加权模糊核聚类算法.该算法运用动态加权,自动消弱噪声特征向量在分类中的作用,在对数据没有任何先验信息的情况下,不仅能够准确划分线性数据,而且能够做到非线性划分非团状数据.仿真和实际数据分类结果表明,数据中的噪声对分类结果影响较小,该算法具有很高的实用性.  相似文献   

16.
针对大数据离线分析类和交互式查询类负载,首先对这些负载的一些共性进行分析,提取出公共操作集,并对它们进行分组整理;然后在大数据平台上测试这些负载运行过程中的微体系结构特征,采用PCA和SimpleKMeans算法对这些体系结构特征参数进行降维和聚类处理。实验分析结果表明负载之间有公共的操作集,如Join和Cross Production;有些负载有相似的属性,如Difference和Projection共享相同的微体系结构特征。实验结果对于 处理器等硬件平台的设计以及应用程序的优化具有指导性的意义,并且为大数据基准测试平台的设计提供了参考。  相似文献   

17.
张亮  杜子平  张俊  李杨 《计算机工程》2011,37(9):216-217,220
仿射传播方法难以处理具有流形结构的数据集。为此,提出一种基于拉普拉斯特征映射的仿射传播聚类算法(APPLE),在标准仿射传播的基础上增强流形学习的能力。使用测地距离计算数据点间相似度,采用拉普拉斯特征映射对数据集进行降维及特征提取。对图像聚类应用的实验结果证明了APPLE的聚类效果优于标准仿射传播方法。  相似文献   

18.
Shared Nearest Neighbours (SNN) techniques are well known to overcome several shortcomings of traditional clustering approaches, notably high dimensionality and metric limitations. However, previous methods were limited to a single information source whereas such methods appear to be very well suited for heterogeneous data, typically in multi-modal contexts. In this paper, we propose a new technique to accelerate the calculation of shared neighbours and we introduce a new multi-source shared neighbours scheme applied to multi-modal image clustering. We first extend existing SNN-based similarity measures to the case of multiple sources and we introduce an original automatic source selection step when building candidate clusters. The key point is that each resulting cluster is built with its own optimal subset of modalities which improves the robustness to noisy or outlier information sources. We experiment our method in the scope of multi-modal search result clustering, visual search mining and subspace clustering. Experimental results on both synthetic and real data involving different information sources and several datasets show the effectiveness of our method.  相似文献   

19.
Clustering is the process of organizing objects into groups whose members are similar in some way. Most of the clustering methods involve numeric data only. However, this representation may not be adequate to model complex information which may be: histogram, distributions, intervals. To deal with these types of data, Symbolic Data Analysis (SDA) was developed. In multivariate data analysis, it is common some variables be more or less relevant than others and less relevant variables can mask the cluster structure. This work proposes a clustering method based on fuzzy approach that produces weighted multivariate memberships for interval-valued data. These memberships can change at each iteration of the algorithm and they are different from one variable to another and from one cluster to another. Furthermore, there is a different relevance weight associated to each variable that may also be different from one cluster to another. The advantage of this method is that it is robust to ambiguous cluster membership assignment since weights represent how important the different variables are to the clusters. Experiments are performed with synthetic data sets to compare the performance of the proposed method against other methods already established by the clustering literature. Also, an application with interval-valued scientific production data is presented in this work. Clustering quality results have shown that the proposed method offers higher accuracy when variables have different variabilities.  相似文献   

20.
一种基于数据聚类的鲁棒SIFT特征匹配方法   总被引:2,自引:0,他引:2  
针对噪声敏感造成的SIFT特征匹配鲁棒性低问题,提出一种基于数据聚类的两阶段特征匹配方法.在满足特征匹配几何距离最邻近本质要求下扩展了k d数据结构,使其不但能够完成算术平均化匹配特征离线聚类,而且能够实现第1阶段聚类特征在线匹配.在此基础上,给出一种概率最优投票策略选择关键图像进行第2阶段匹配,最后合并两阶段属于关键图像的所有匹配特征对.实验结果表明,对于大量存在重叠关系的图像集合,该方法能够有效减少重复特征数量,降低噪声信息对特征匹配的干扰,极大地提高特征匹配的鲁棒性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号