首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 10 毫秒
基于密度和最近邻的Kk-means文本聚类算法   总被引:4,自引:0,他引:4  
张文明  吴江  袁小蛟 《计算机应用》2010,30(7):1933-1935
初始中心点的选择对于传统的K-means算法聚类效果影响较大,容易使聚类陷入局部最优解。针对这个问题,引入密度和最近邻思想,提出了生成初始聚类中心的算法,将所选聚类中心用于K-means算法,得到了更好的应用于文本聚类的DN-K-means算法。实验结果表明,该算法可以生成聚类质量较高并且稳定性较好的结果。  相似文献   

In this paper,a new approach is presented to find the reference set for the nearest neighbor classifer.The optimal reference set,which has minimum sample size and satisfies a certain error rate threshold,is obtained through a Tabu search algorithm.When the error rate threshold is set to zero,the algorithm obtains a near minimal consistent subset of a given training set.While the threshold is set to a small appropriate value,the obtained reference set may compensate the bias of the nearest neighbor estimate.An aspiration criterion for Tabu search is introduced,which aims to prevent the search process form the inefficient wandering between the feasible and infeasible regions in the search space and speed up the convergence.Experimental results based on a number of typical data sets are presented and analyzed to illustrate the benefits of the proposed method.Compared to conventional methods,such as CNN and Dasarathy‘s algorithm,the size of the reduced reference sets is much smaller,and the nearest neighbor classification performance is better,especially when the error rate thresholds are set to appropriate nonzerovalues,The experimental results also illustrate that the MCS(inimal consistent set)of Dasarathy‘s algorithm is not minimal,and its candidate consistent set is not always ensured to reduce monotonically.A counter example is also given to confirm this claim.  相似文献   

为进一步提高预测精度, 修改候选解间原始Pareto支配性关系, 提出了d-Pareto支配性最近邻预测方法。结合多目标优化的自身特点, 给出了d-Pareto支配性最近邻预测框架, 并论证了d-Pareto支配性预测比Pareto支配性预测具有低平均预测错误率。同时也初步研究了d-Pareto支配性预测与多目标进化算法的交互作用。对几个经典多目标优化问题进行实验, 仿真结果表明d-Pareto支配性预测具有一定的可行性和有效性。  相似文献   

在大数据环境下,K近邻多标签算法(ML-KNN)高时间复杂度的问题显得尤为突出;此外,ML-KNN也没有考虑◢k◣个近邻对最终分类结果的影响。针对上述问题进行研究,首先将训练集进行聚类,再为测试集找到一个距离其最近的训练数据簇作为新的训练数据集;然后计算最近邻样本的距离权重,并用该权重描述最近邻和其他近邻对预测结果的影响;最后使用新的目标函数为待测样本分类。通过在图片、Web页面文本数据等数据集上的实验表明,所提算法得到了更好的分类结果,并且大大降低了时间复杂度。  相似文献   

基于遗传进化的最近邻聚类算法及其应用   总被引:4,自引:0,他引:4       下载免费PDF全文
提出了基于遗传进化的最近邻聚类算法,该算法结合了遗传算法(GA)与最近邻聚类算法(NN)。对要进行分类的样本和特征量进行优化选取,去除位于类交界的模糊样本,并对样本分类有效的特征量进行放大,对不利于样本分类的特征量进行抑制,从而提高了样本分类的精度,将该算法应用于抽水蓄能发电机组的工况分类,大大提高了机组工况的识别效果,验证了基于遗传算法的最近邻聚类算法的有效性。  相似文献   

基于k-最近邻(kNN)的分类方法是实现各种高性能模式识别技术的基础,然而这些方法很容易受到邻域参数k的影响,在完全不了解数据集特性的情况下想要得出各种数据集的邻域是比较困难的.基于上述问题,介绍了一种新的监督分类方法:扩展自然邻居(ENaN)方法,并证明了该方法在不人为选择邻域参数的情况下提供了一种更好的分类结果.与...  相似文献   

In this paper, two novel classifiers based on locally nearest neighborhood rule, called nearest neighbor line and nearest neighbor plane, are presented for pattern classification. Comparison to nearest feature line and nearest feature plane, the proposed methods take much lower computation cost and achieve competitive performance.  相似文献   

针对图像搜索引擎的结果,对图像集依据视觉相似度将视觉相近的图像组织在一起,提供给用户一个有效的浏览接口.为降低计算时间,提出一种基于关键维的近邻搜索算法.实验证明了以上算法的有效性.  相似文献   

金波  张志勇  赵婷 《计算机应用》2020,40(8):2340-2344
针对社交网络中近邻位置查询时个人位置隐私泄漏的问题,采用地理不可区分性机制对位置数据添加随机噪声,提出了一种隐私预算分配方法。首先,对空间区域进行网格化分割,根据用户在不同区域的位置访问量来个性化分配隐私预算;然后,为了解决在扰动位置数据集中近邻查询命中率偏低的问题,提出了一种组合增量近邻查询(CINQ)算法,以扩大需求空间的检索范围,并利用组合查询过滤冗余数据。在仿真实验中,与SpaceTwist算法相比,CINQ算法的查询命中率提高了13.7个百分点。实验结果表明,CINQ算法有效解决了因为查询目标的位置扰动所带来的查询命中率偏低问题,适用于社交网络应用中扰动位置的近邻查询。  相似文献   

外膜蛋白由于其位于细菌的表面,从而对于抗生素和疫苗开发具有重要的研究价值.如何准确地将外膜蛋白从球蛋白和内膜蛋白等中识别出来对于从基因组序列中确认外膜蛋白以及预测其二级、三级结构都是一项重要的研究任务.近年来人们已经提出了若干从蛋白质序列出发预测外膜蛋白的方法.本文利用1种新的核方法,即核最近邻算法,结合蛋白质序列的子序列分布预测外膜蛋白,并和支持向量机方法、传统的最近邻算法进行了比较.结果表明本文算法不亚于已有的预测方法,而且新算法更为简洁、容易实现.同时我们发现残基顺序在外膜蛋白预测中具有重要作用.  相似文献   

Neighbor-weighted K-nearest neighbor for unbalanced text corpus   总被引:10,自引:0,他引:10  
Text categorization or classification is the automated assigning of text documents to pre-defined classes based on their contents. Many of classification algorithms usually assume that the training examples are evenly distributed among different classes. However, unbalanced data sets often appear in many practical applications. In order to deal with uneven text sets, we propose the neighbor-weighted K-nearest neighbor algorithm, i.e. NWKNN. The experimental results indicate that our algorithm NWKNN achieves significant classification performance improvement on imbalanced corpora.  相似文献   

文本分类为一个文档自动分配一组预定义的类别或主题。文本分类中,文档的表示对学习机的学习性能有很大的影响。以实现哈萨克语文本分类为目的,根据哈萨克语语法规则设计实现哈萨克语文本的词干提取,完成哈萨克语文本的预处理。提出基于最近支持向量机的样本距离公式,避免k参数的选定,以SVM与KNN分类算法的特殊组合算法(SV-NN)实现了哈萨克语文本的分类。结合自己构建的哈萨克语文本语料库的语料进行文本分类仿真实验,数值实验展示了提出算法的有效性并证实了理论结果。  相似文献   

We prove a lower bound of d1−o(1) on the query time for any deterministic algorithms that solve approximate nearest neighbor searching in Yao's cell probe model. Our result greatly improves the best previous lower bound for this problem, which is [A. Chakrabarti et al., in: Proc. 31st Ann. ACM Symp. Theory of Computing, 1999, pp. 305-311]. Our proof is also much simpler than the proof of A. Chakrabarti et al.  相似文献   

Superposition of radial basis functions centered at given prototype patterns constitutes one of the most suitable energy forms for gradient systems that perform nearest neighbor classification with real-valued static prototypes. It is shown in this paper that a continuous-time dynamical neural network model, employing a radial basis function and a sigmoid multi-layer perceptron sub-networks, is capable of maximizing such an energy form locally, thus performing almost perfectly nearest neighbor classification, when initiated by a distorted pattern. The proposed design scheme allows for explicit representation of prototype patterns as network parameters, as well as augmenting additional or forgetting existing memory patterns. The dynamical classification scheme implemented by the network eliminates all comparisons, which are the vital steps of the conventional nearest neighbor classification process. The performance of the proposed network model is demonstrated on binary and gray-scale image reconstruction applications.  相似文献   

This paper presents an approach to select the optimal reference subset (ORS) for nearest neighbor classifier. The optimal reference subset, which has minimum sample size and satisfies a certain resubstitution error rate threshold, is obtained through a tabu search (TS) algorithm. When the error rate threshold is set to zero, the algorithm obtains a near minimal consistent subset of a given training set. While the threshold is set to a small appropriate value, the obtained reference subset may have reasonably good generalization capacity. A neighborhood exploration method and an aspiration criterion are proposed to improve the efficiency of TS. Experimental results based on a number of typical data sets are presented and analyzed to illustrate the benefits of the proposed method. The performances of the result consistent and non-consistent reference subsets are evaluated.  相似文献   

Cluster analysis plays an important role in identifying the natural structure of the target dataset. It has been widely used in many fields, such as pattern recognition, machine learning, image segmentation, document clustering and so on. There are many different methods to conduct cluster analysis. Namely, most real datasets are non-spherical and have complex shapes. Although these methods are widely used to deal with clustering tasks, they are susceptible to noise and arbitrary shapes. Thus, we propose a novel clustering algorithm (called RNN-NSDC) in this paper, which is based on the natural reverse nearest neighbor structure. Firstly, we apply the reverse nearest neighbors in the algorithm to extract core objects. Secondly, our algorithm uses the neighbor structure information of core objects to cluster. And excluding noise effects, core sets can well represent the structure of clusters. Therefore, the RNN-NSDC can obtain the optimal cluster numbers for the datasets which contain clusters of outliers and arbitrary shapes. To verify the efficiency and accuracy of the RNN-NSDC, synthetic datasets and real datasets are used for experiments. The results indicate the superiority of the RNN-NSDC compared with K-means, DBSCAN, DPC, SNNDPC, DCore and NaNLORE.  相似文献   

基于余弦距离度量学习的伪K近邻文本分类算法   总被引:2,自引:0,他引:2  
距离度量学习在分类领域有着广泛的应用,将其应用到文本分类时,由于一般采用的向量空间模型(VSM)中的TF*IDF算法在对文本向量表达时向量均是维度相同并且归一化的,这就导致传统距离度量学习过程中采用的欧式距离作为相似度判别标准在文本分类领域往往无法取得预期的效果,在距离度量学习中的LMNN算法的启发下提出一种余弦距离度量学习算法,使其适应于文本分类领域,称之为CS-LMNN.考虑到文本分类领域中样本类偏斜情况比较普遍,提出采用一种伪K近邻分类算法与CS-LMNN结合实现文本分类,该算法首先利用CS-LMNN算法对训练数据进行距离度量学习,根据训练结果对测试数据使用伪K近邻分类算法进行分类,实验结果表明,该算法可以有效的提高分类精度.  相似文献   

Text categorization is one of the most common themes in data mining and machine learning fields. Unlike structured data, unstructured text data is more difficult to be analyzed because it contains complicated both syntactic and semantic information. In this paper, we propose a two-level representation model (2RM) to represent text data, one is for representing syntactic information and the other is for semantic information. Each document, in syntactic level, is represented as a term vector where the value of each component is the term frequency and inverse document frequency. The Wikipedia concepts related to terms in syntactic level are used to represent document in semantic level. Meanwhile, we designed a multi-layer classification framework (MLCLA) to make use of the semantic and syntactic information represented in 2RM model. The MLCLA framework contains three classifiers. Among them, two classifiers are applied on syntactic level and semantic level in parallel. The outputs of these two classifiers will be combined and input to the third classifier, so that the final results can be obtained. Experimental results on benchmark data sets (20Newsgroups, Reuters-21578 and Classic3) have shown that the proposed 2RM model plus MLCLA framework improves the text classification performance by comparing with the existing flat text representation models (Term-based VSM, Term Semantic Kernel Model, Concept-based VSM, Concept Semantic Kernel Model and Term + Concept VSM) plus existing classification methods.  相似文献   

This paper proposes a two-stage system for text detection in video images. In the first stage, text lines are detected based on the edge map of the image leading in a high recall rate with low computational time expenses. In the second stage, the result is refined using a sliding window and an SVM classifier trained on features obtained by a new Local Binary Pattern-based operator (eLBP) that describes the local edge distribution. The whole algorithm is used in a multiresolution fashion enabling detection of characters for a broad size range. Experimental results, based on a new evaluation methodology, show the promising overall performance of the system on a challenging corpus, and prove the superior discriminating ability of the proposed feature set against the best features reported in the literature.  相似文献   

We report on the development and implementation of a robust algorithm for extracting text in digitized color video. The algorithm first computes maximum gradient difference to detect potential text line segments from horizontal scan lines of the video. Potential text line segments are then expanded or combined with potential text line segments from adjacent scan lines to form text blocks, which are then subject to filtering and refinement. Color information is then used to more precisely locate text pixels within the detected text blocks. The robustness of the algorithm is demonstrated by using a variety of color images digitized from broadcast television for testing. The algorithm also performs well on images after JPEG compression and decompression, and on images corrupted with different types of noise.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号