首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Nowadays a vast amount of textual information is collected and stored in various databases around the world, including the Internet as the largest database of all. This rapidly increasing growth of published text means that even the most avid reader cannot hope to keep up with all the reading in a field and consequently the nuggets of insight or new knowledge are at risk of languishing undiscovered in the literature. Text mining offers a solution to this problem by replacing or supplementing the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. Text clustering is one of the most important areas in text mining, which includes text preprocessing, dimension reduction by selecting some terms (features) and finally clustering using selected terms. Feature selection appears to be the most important step in the process. Conventional unsupervised feature selection methods define a measure of the discriminating power of terms to select proper terms from corpus. However up to now the valuation of terms in groups has not been investigated in reported works. In this paper a new and robust unsupervised feature selection approach is proposed that evaluates terms in groups. In addition a new Modified Term Variance measuring method is proposed for evaluating groups of terms. Furthermore a genetic based algorithm is designed and implemented for finding the most valuable groups of terms based on the new measure. These terms then will be utilized to generate the final feature vector for the clustering process . In order to evaluate and justify our approach the proposed method and also a conventional term variance method are implemented and tested using corpus collection Reuters-21578. For a more accurate comparison, methods have been tested on three corpuses and for each corpus clustering task has been done ten times and results are averaged. Results of comparing these two methods are very promising and show that our method produces better average accuracy and F1-measure than the conventional term variance method.  相似文献   

2.
基于图像空间聚类的滤波技术   总被引:2,自引:0,他引:2  
赵红蕊  唐中实 《计算机应用》2006,26(11):2691-2693
传统的图像滤波器在抑制噪声的同时会丢失图像中的边缘和细节。借鉴遥感图像聚类方法,并对其加以改进,提出空间聚类方法。空间聚类方法重视图像空间分布模式的判别与保护,可有效分离出有噪图像中对视觉效果影响较大的噪声。在信噪分离的基础上,对噪声点加以滤波操作,对相对无噪声像元点则采用加权均值融合处理。实验证明,此方法一方面保证了图像的去噪效果,另一方面最大可能地保存了图像中没有被噪声污染的边缘和细节信息,并且对多类噪声甚至混合噪声均有较好的效果。  相似文献   

3.
With the wide applications of Gaussian mixture clustering, e.g., in semantic video classification [H. Luo, J. Fan, J. Xiao, X. Zhu, Semantic principal video shot classification via mixture Gaussian, in: Proceedings of the 2003 International Conference on Multimedia and Expo, vol. 2, 2003, pp. 189-192], it is a nontrivial task to select the useful features in Gaussian mixture clustering without class labels. This paper, therefore, proposes a new feature selection method, through which not only the most relevant features are identified, but the redundant features are also eliminated so that the smallest relevant feature subset can be found. We integrate this method with our recently proposed Gaussian mixture clustering approach, namely rival penalized expectation-maximization (RPEM) algorithm [Y.M. Cheung, A rival penalized EM algorithm towards maximizing weighted likelihood for density mixture clustering with automatic model selection, in: Proceedings of the 17th International Conference on Pattern Recognition, 2004, pp. 633-636; Y.M. Cheung, Maximum weighted likelihood via rival penalized EM for density mixture clustering with automatic model selection, IEEE Trans. Knowl. Data Eng. 17(6) (2005) 750-761], which is able to determine the number of components (i.e., the model order selection) in a Gaussian mixture automatically. Subsequently, the data clustering, model selection, and the feature selection are all performed in a single learning process. Experimental results have shown the efficacy of the proposed approach.  相似文献   

4.
特征选择是数据挖掘和机器学习领域中一种常用的数据预处理技术。在无监督学习环境下,定义了一种特征平均相关度的度量方法,并在此基础上提出了一种基于特征聚类的特征选择方法 FSFC。该方法利用聚类算法在不同子空间中搜索簇群,使具有较强依赖关系(存在冗余性)的特征被划分到同一个簇群中,然后从每一个簇群中挑选具有代表性的子集共同构成特征子集,最终达到去除不相关特征和冗余特征的目的。在 UCI 数据集上的实验结果表明,FSFC 方法与几种经典的有监督特征选择方法具有相当的特征约减效果和分类性能。  相似文献   

5.
代表点选择是面向数据挖掘与模式识别的数据预处理的重要内容之一,是提高分类器分类正确率和执行效率的重要途径。提出了一种基于投票机制的代表点选择算法,该算法能使所得到的代表点尽可能分布在类别边界上,且投票选择机制易于排除异常点,减少数据量,从而有利于提高最近邻分类器的分类精度和效率。通过与多个经典的代表点选择算法的实验比较分析,表明所提出的基于投票机制的代表点选择算法在提高最近邻分类器分类精度和数据降低率上都具有一定的优势。  相似文献   

6.
为了避免PET/CT对病人造成大剂量的X辐射伤害和更好地对PET/MRI混合成像系统进行信号衰减校正。在组织分割方法的指导下,利用迁移模糊聚类算法将对人体无伤害的磁共振成像(MRI)划分成诸如空气、液体、软组织、骨头等不同组织成分,然后赋予不同组织不同的线性衰减系数,以此来实现配准的PET成像的衰减校正工作。本方法具有三大好处:(1)迁移模糊聚类算法可以利用历史高级知识来辅助当前病人MRI组织分割任务,从而保证了临床有效性和鲁棒性,降低了环境噪声、数据缺失及个体解剖结构差异等因素对算法的不良影响;(2)本算法内嵌的基于迁移学习的简单抽样策略,在保证算法鲁棒性的同时,极大地缩短了聚类划分的整体时间,适用于医学MRI大数据快速聚类分割的场合,因而有效地增强了算法的实用性;(3)本算法涉及的历史MRI知识,都是通过历史MRI源数据高度总结得到,非历史MRI源数据,这有效地保护了病人隐私,符合医学诊断的基本要求。通过在真实数据集上的实验表明了上述优点。  相似文献   

7.
周玉 《计算机应用研究》2021,38(6):1683-1688
为了提高神经网络分类器的性能,提出一种基于K均值聚类的分段样本数据选择方法.首先通过K均值聚类把训练样本根据已知的类别数进行聚类,对比聚类前后的各类样本,找出聚类错误的样本集和聚类正确的样本集;聚类正确的样本集根据各样本到聚类中心的距离进行排序并均分为五段,挑选各类的奇数段样本和聚类错误的样本构成新的训练样本集.该方法能够提取信息量大的样本,剔除冗余样本,减少样本数量的同时提高样本质量.利用该方法,结合人工和UCI数据集对三种不同的神经网络分类器进行了仿真实验,实验结果显示在训练样本平均压缩比为66.93%的前提下,三种神经网络分类器的性能都得到了提高.  相似文献   

8.
为了进一步提高视频的压缩比,提出了一种基于小波变换的新型帧内模式预测快速选择算法。该算法利用小波变换后的低频子图结合改进后的Pan算法作帧内模式选择,由此判断宏块中每个4×4予块可能的预测模式。实验结果表明,该方法在保证了视频图像良好效果的情况下,H.264/AVC帧内编码速度得到显著提高。  相似文献   

9.
无监督聚类在锂离子电池分类中的应用   总被引:1,自引:0,他引:1  
单体电池的一致性,决定了电池组的性能,如何选出性能一致的单体电池又一直是电池组研究中的重点所在。本文采集了100个合格锂离子电池的6项性能指标(老化前后电压、容量、内阻、1C放电平台、电芯厚度),运用主成分分析(PCA)、核主成分分析(KPCA)、随机森林(RF)3种无监督聚类方法,对数据结构进行了研究。结果表明,数据指标之间存在复杂的非线性关系,主成分分析和核主成分分析,均未能形成明显聚类,但随机森林数据在低维空间显然形成4类,任意从中选4个电池组成电池组作循环性能仿真测试,结果显示由由该方法挑选出的单体电池具有较好的一致性。  相似文献   

10.
Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based clustering algorithms find clusters based on density of data points in a region. DBSCAN algorithm is one of the density-based clustering algorithms. It can discover clusters with arbitrary shapes and only requires two input parameters. DBSCAN has been proved to be very effective for analyzing large and complex spatial databases. However, DBSCAN needs large volume of memory support and often has difficulties with high-dimensional data and clusters of very different densities. So, partitioning-based DBSCAN algorithm (PDBSCAN) was proposed to solve these problems. But PDBSCAN will get poor result when the density of data is non-uniform. Meanwhile, to some extent, DBSCAN and PDBSCAN are both sensitive to the initial parameters. In this paper, we propose a new hybrid algorithm based on PDBSCAN. We use modified ant clustering algorithm (ACA) and design a new partitioning algorithm based on ‘point density’ (PD) in data preprocessing phase. We name the new hybrid algorithm PACA-DBSCAN. The performance of PACA-DBSCAN is compared with DBSCAN and PDBSCAN on five data sets. Experimental results indicate the superiority of PACA-DBSCAN algorithm.  相似文献   

11.
Most of the well-known clustering methods based on distance measures, distance metrics and similarity functions have the main problem of getting stuck in the local optima and their performance strongly depends on the initial values of the cluster centers. This paper presents a new approach to enhance the clustering problems with the bio-inspired Cuttlefish Algorithm (CFA) by searching the best cluster centers that can minimize the clustering metrics. Various UCI Machine Learning Repository datasets are used to test and evaluate the performance of the proposed method. For the sake of comparison, we have also analysed several algorithms such as K-means, Genetic Algorithm and the Particle Swarm Optimization (PSO) Algorithm. The simulations and obtained results demonstrate that the performance of the proposed CFA-Clustering method is superior to the other counterpart algorithms in most cases. Therefore, the CFA can be considered as an alternative stochastic method to solve clustering problems.  相似文献   

12.
基于Fisher准则和特征聚类的特征选择   总被引:2,自引:0,他引:2  
王飒  郑链 《计算机应用》2007,27(11):2812-2813
特征选择是机器学习和模式识别等领域的重要问题之一。针对高维数据,提出了一种基于Fisher准则和特征聚类的特征选择方法。首先基于Fisher准则,预选出鉴别性能较强的特征子集,然后在预选所得到的特征子集上对特征进行分层聚类,从而最终达到去除不相关和冗余特征的目的。实验结果表明该方法是一种有效的特征选择方法。  相似文献   

13.
Surface defect recognition is important to improve the surface quality of end products. In this area, there were many convolutional neural network (CNN)-based methods because CNN can extract features automatically. The extracted features determine the performance of recognition, so it is important for CNN-based methods to extract effective and sufficient features. However, feature extraction needs a large-scale dataset, which is hard to obtain. To save the cost of collecting samples and extract effective features, ensemble methods were proposed to make full use of the features extracted by CNN in order to guarantee good performance with limited samples. However, the methods are confined to utilize one sample – they extracted multi-level features from one individual sample – but ignore the vast information in a dataset. Due to the limit information in one sample, this paper turns the attention to the training dataset and attempts to mine the multi-level information in the dataset for predicting. The proposed method is named as Prototype vectors fusion-based CNN (ProtoCNN), which utilizes the prototype information in the training dataset. In training process, it trains a VGG11 as the base model, and meanwhile prototype vectors corresponding to each defect class are generated in multiple feature layers of VGG11. Then, in predicting process, the prototype vectors are fused to predict unknown samples. The experiments on three famous datasets, including NEU-CLS, wood dataset, and textile dataset indicate that the proposed ProtoCNN outperforms conventional ensemble models and other models for surface defect recognition. In these datasets, ProtoCNN has achieved the accuracy of 99.86%, 90.01%, and 81.28% respectively, which increase 1.05%, 4.07%, 19.53% compared to its base model respectively. Finally, this paper analyzes the effectiveness and practicality of prototype vectors, showing that the proposed ProtoCNN is practical for real world application.  相似文献   

14.
Spectro-temporal representation of speech has become one of the leading signal representation approaches in speech recognition systems in recent years. This representation suffers from high dimensionality of the features space which makes this domain unsuitable for practical speech recognition systems. In this paper, a new clustering based method is proposed for secondary feature selection/extraction in the spectro-temporal domain. In the proposed representation, Gaussian mixture models (GMM) and weighted K-means (WKM) clustering techniques are applied to spectro-temporal domain to reduce the dimensions of the features space. The elements of centroid vectors and covariance matrices of clusters are considered as attributes of the secondary feature vector of each frame. To evaluate the efficiency of the proposed approach, the tests were conducted for new feature vectors on classification of phonemes in main categories of phonemes in TIMIT database. It was shown that by employing the proposed secondary feature vector, a significant improvement was revealed in classification rate of different sets of phonemes comparing with MFCC features. The average achieved improvements in classification rates of voiced plosives comparing to MFCC features is 5.9% using WKM clustering and 6.4% using GMM clustering. The greatest improvement is about 7.4% which is obtained by using WKM clustering in classification of front vowels comparing to MFCC features.  相似文献   

15.
《电子技术应用》2018,(3):94-98
针对车联网低时延、高可靠性的通信需求,提出了基于簇稳定的车辆分簇方法,有效增加通信时间,提高通信可靠性。在此基础上,研究功率受限的情况下车联网簇内数据分发的中继选择问题,提出了一种基于功率预分配的中继选择方法。该方法簇内车辆协作采用HDAF转发协议,在中继选择之前先对源节点和潜在中继节点进行功率分配,求取系统中断概率最小的功率分配因子,然后比较融入功率优化因子的各个节点的等效信道增益,选择出最优中继节点集合。数值结果表明:基于簇稳定的车辆分簇方法相比基于地理位置的分簇方法具有更高的稳定性。同时,提出的中继选择方法在相同条件下比传统的单中继选择方案和全中继选择方案具有更小的中断概率。  相似文献   

16.
基于模式聚类和遗传算法的文本特征提取方法   总被引:2,自引:1,他引:1  
郝占刚  王正欧 《计算机应用》2005,25(7):1632-1633
采用模式聚类和遗传算法进行文本特征提取,并用Kohonen网络进行分类。模式聚类可以有效降低文本特征的维数,使得特征从几千维降为几百维。但几百维的维数对Kohonen网络来说仍然太高,因此采用遗传算法在此基础上继续降维。实验结果表明,这两种方法结合可以极大地降低文本的维数,并能提高分类准确率。  相似文献   

17.
鉴于传统的基因选择方法会选出大量冗余基因从而导致较低的样本预测准确率,提出一种基于聚类和微粒群优化的基因选择算法。首先采用聚类算法将基因分成固定数目的簇;然后,采用极限学习机作为分类器进行簇中的特征基因分类性能评价,得到一个备选基因库;最后,采用基于微粒群优化和极限学习机的缠绕法从备选基因库中选择具有最大分类率、最小数目的基因子集。所选出的基因具有良好的分类性能。在两个公开的微阵列数据集上的实验结果表明,相对于一些经典的方法,新方法能够以较少的基因获得更高的分类性能。  相似文献   

18.
A new unconstrained global optimization method based on clustering and parabolic approximation (GOBC-PA) is proposed. Although the proposed method is basically similar to other evolutionary and stochastic methods, it represents a significant advancement of global optimization technology for four important reasons. First, it is orders of magnitude faster than existing optimization methods for global optimization of unconstrained problems. Second, it has significantly better repeatability, numerical stability, and robustness than current methods in dealing with high dimensionally and many local minima functions. Third, it can easily and faster find the local minimums using the parabolic approximation instead of gradient descent or crossover operations. Fourth, it can easily adapted to any theoretical or industrial systems which are using the heuristic methods as an intelligent system, such as neural network and neuro-fuzzy inference system training, packing or allocation of objects, game optimization problems. In this study, we assume that the best cluster center gives the position of the possible global optimum. The usage of clustering and curve fitting techniques brings multi-start and local search properties to the proposed method. The experimental studies, such as performed on benchmark functions, a real world optimization problem and tuning the neural network parameters for classification problems, show that the proposed methodology is simple, faster and, it demonstrates a superior performance when compared with some state of the art methods.  相似文献   

19.
根据科技文献的结构特点,搭建了一个四层挖掘模式,提出了一种应用于科技文献分类的文本特征选择方法。该方法首先依据科技文献的结构将其分为四个层次,然后采用K-means聚类对前三层逐层实现特征词提取,最后再使用Aprori算法找出第四层的最大频繁项集,并作为第四层的特征词集合。在该方法中,针对K-means算法受初始中心点的影响较大的问题,首先采用信息熵对聚类对象赋权的方式来修正对象间的距离函数,然后再利用初始聚类的赋权函数值选出较合适的初始聚类中心点。同时,通过为K-means算法的终止条件设定标准值,来减少算法迭代次数,以减少学习时间;通过删除由信息动态变化而产生的冗余信息,来减少动态聚类过程中的干扰,从而使算法达到更准确更高效的聚类效果。上述措施使得该文本特征选择方法能够在文献语料库中更加准确地找到特征词,较之以前的方法有很大提升,尤其是在科技文献方面更为适用。实验结果表明,当数据量较大时,该方法结合改进后的K-means算法在科技文献分类方面有较高的性能。  相似文献   

20.
针对迁移原型聚类的优化问题,本文以模糊知识匹配迁移原型聚类为基础,介绍了聚类场景中从源域到目标域的迁移学习机制,明确了源域聚类中心辅助目标域得到更好的聚类效果。但目前此类迁移机制依然面临如下的挑战:1)如何克服已有迁移原型聚类方法中不同类别间的知识强制性匹配带来的负作用。2)当源域与目标域相似度较低时,如何避免模糊强制性匹配的不合理性以及过于依赖源域知识的缺陷被放大。为此,研究了一种新的迁移原型聚类机制,即可能性匹配知识迁移原型机制,并基于此实现了2个具体的迁移聚类算法。借鉴可能性匹配的思想,该算法可以自动选择和偏重有用的源域知识,克服了源域和目标域之间的强制性匹配限制,具有较好的可调节性。研究结果表明:在不同迁移场景下模拟数据集和真实NG20groups数据集上的实验研究表明,提出的算法较已有的相关算法展现了更好的性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号