首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
李欣倩  杨哲  任佳 《测控技术》2022,41(2):36-40
根据朴素贝叶斯算法的特征条件独立假设,提出一种基于互信息和层次聚类双重特征选择的改进朴素贝叶斯算法。通过互信息方法剔除不相关的特征,然后依据欧氏距离将删减后的特征进行分层聚类,通过粒子群算法得到聚类簇的数量,最后将每个聚类簇中与类别互信息最高的特征合并为特征子集,并由朴素贝叶斯算法得到分类准确率。根据实验结果可知,该算法可以有效减少特征之间的相关性,提升算法的分类性能。  相似文献   

2.
传统的视觉词典一般通过K-means聚类生成,一方面这种无监督的学习没有充分利用类别的先验信息,另一方面由于K-means算法自身的局限性导致生成的视觉词典性能较差。针对上述问题,提出一种基于谱聚类构建视觉词典的算法,根据训练样本的类别信息进行分割并采用动态互信息的度量方式进行特征选择,在特征空间中进行谱聚类并生成最终的视觉词典。该方法充分利用了样本的类别信息和谱聚类的优点,有效地解决了图像数据特征空间的高维性和结构复杂性所带来的问题;在Scene-15数据集上的实验结果验证了算法的有效性。  相似文献   

3.
基于增量型聚类的自动话题检测研究   总被引:1,自引:0,他引:1  
张小明  李舟军  巢文涵 《软件学报》2012,23(6):1578-1587
随着网络信息飞速的发展,收集并组织相关信息变得越来越困难.话题检测与跟踪(topic detection and tracking,简称TDT)就是为解决该问题而提出来的研究方向.话题检测是TDT中重要的研究任务之一,其主要研究内容是把讨论相同话题的故事聚类到一起.虽然话题检测已经有了多年的研究,但面对日益变化的网络信息,它具有了更大的挑战性.提出了一种基于增量型聚类的和自动话题检测方法,该方法旨在提高话题检测的效率,并且能够自动检测出文本库中话题的数量.采用改进的权重算法计算特征的权重,通过自适应地提炼具有较强的主题辨别能力的文本特征来提高文档聚类的准确率,并且在聚类过程中利用BIC来判断话题类别的数目,同时利用话题的延续性特征来预聚类文档,并以此提高话题检测的速度.基于TDT-4语料库的实验结果表明,该方法能够大幅度提高话题检测的效率和准确率.  相似文献   

4.
针对特征空间中存在潜在相关特征的规律,分别利用谱聚类探索特征间的相关性及邻域互信息以寻求最大相关特征子集,提出联合谱聚类与邻域互信息的特征选择算法.首先利用邻域互信息移除与标记不相干的特征.然后采用谱聚类将特征进行分簇,使同一簇组中的特征强相关而不同簇组中的特征强相异.继而基于邻域互信息从每一特征簇组中选择与类标记强相关而与本组特征低冗余的特征子集.最后将所有选中特征子集组成最终的特征选择结果.在2个基分类器下的实验表明,文中算法能以较少的合理特征获得较高的分类性能.  相似文献   

5.
文本分类特征权重改进算法   总被引:6,自引:2,他引:4       下载免费PDF全文
台德艺  王俊 《计算机工程》2010,36(9):197-199
TF-IDF是一种在文本分类领域获得广泛应用的特征词权重算法,着重考虑了词频与逆文档频等因素,但无法把握特征词在类间与类内的分布情况。为提高在同类中频繁出现、类内均匀分布的具有代表性的特征词权重,引入特征词分布集中度系数改进IDF函数、用分散度系数进行加权,提出TF-IIDF-DIC权重函数。实验结果表明,基于TF-IIDF-DIC权重算法的K-NN文本分类宏平均F1值比TF-IDF算法提高了6.79%。  相似文献   

6.
文本分类特征权重改进算法   总被引:3,自引:2,他引:1       下载免费PDF全文
台德艺  王俊 《计算机工程》2010,36(9):197-199,
TF-IDF是一种在文本分类领域获得广泛应用的特征词权重算法,着重考虑了词频与逆文档频等因素,但无法把握特征词在类间与类内的分布情况。为提高在同类中频繁出现、类内均匀分布的具有代表性的特征词权重,引入特征词分布集中度系数改进IDF函数、用分散度系数进行加权,提出TF-IIDF-DIC权重函数。实验结果表明,基于TF-IIDF-DIC权重算法的K-NN文本分类宏平均F1值比TF-IDF算法提高了6.79%。  相似文献   

7.
Text steam analysis is now of great importance and practical value today. It has several applications such as news group filtering, topic detection & tracking (TDT), user characterized recommendation etc. Clustering is one of the most important methods of analyzing text stream. However, most text stream clustering algorithms rarely consider the possible change of features during a long-time of clustering, which is usually the case, leading to unsatisfactory results of the clustering system. The paper mainly focuses on the problem of adaptive feature selection for clustering text stream. A validity index based method of adaptive feature selection is proposed, incorporating with which a new text stream clustering algorithm is developed. During the clustering process, threshold of cluster valid index is used to automatically trigger feature re-selection in order to ensure the validity of clustering. The experiment using Reuters-21578 text set as the text source shows that the clustering algorithm reaches reasonable results of high quality.  相似文献   

8.
基于文档频率的特征选择方法   总被引:1,自引:1,他引:0       下载免费PDF全文
杨凯峰  张毅坤  李燕 《计算机工程》2010,36(17):33-35,38
传统的文档频率(DF)方法在进行特征选择时仅考虑特征词在类别中出现的DF,没有考虑特征词在每篇文档中出现的词频率(TF)问题。针对该问题,基于特征词在每篇文档中出现的TF,结合特征词在类别中出现的DF提出特征选择的新算法,并使用支持向量机方法训练分类器。实验结果表明,在进行特征选择时,考虑高词频特征词对类别的贡献,可提高传统DF方法的分类性能。  相似文献   

9.
The State of the Art in Flow Visualisation: Feature Extraction and Tracking   总被引:3,自引:0,他引:3  
Flow visualisation is an attractive topic in data visualisation, offering great challenges for research. Very large data sets must be processed, consisting of multivariate data at large numbers of grid points, often arranged in many time steps. Recently, the steadily increasing performance of computers again has become a driving force for new advances in flow visualisation, especially in techniques based on texturing, feature extraction, vector field clustering, and topology extraction. In this article we present the state of the art in feature‐based flow visualisation techniques. We will present numerous feature extraction techniques, categorised according to the type of feature. Next, feature tracking and event detection algorithms are discussed, for studying the evolution of features in time‐dependent data sets. Finally, various visualisation techniques are demonstrated. ACM CSS: I.3.8 Computer Graphics—applications  相似文献   

10.
针对微博短文本有效特征较稀疏且难以提取,从而影响微博文本表示、分类与聚类准确性的问题,提出一种基于统计与语义信息相结合的微博短文本特征词选择算法。该算法基于词性组合匹配规则,根据词项的TF-IDF、词性与词长因子构造综合评估函数,结合词项与文本内容的语义相关度,对微博短文本进行特征词选择,以使挑选出来的特征词能准确表示微博短文本内容主题。将新的特征词选择算法与朴素贝叶斯分类算法相结合,对微博分类语料集进行实验,结果表明,相比其它的传统算法,新算法使得微博短文本分类准确率更高,表明该算法选取出来的特征词能够更准确地表示微博短文本内容主题。  相似文献   

11.
目的 人体行为识别是计算机视觉领域的一个重要研究课题,具有广泛的应用前景.针对局部时空特征和全局时空特征在行为识别问题中的局限性,提出一种新颖、有效的人体行为中层时空特征.方法 该特征通过描述视频中时空兴趣点邻域内局部特征的结构化分布,增强时空兴趣点的行为鉴别能力,同时,避免对人体行为的全局描述,能够灵活地适应行为的类内变化.使用互信息度量中层时空特征与行为类别的相关性,将视频识别为与之具有最大互信息的行为类别.结果 实验结果表明,本文的中层时空特征在行为识别准确率上优于基于局部时空特征的方法和其他方法,在KTH数据集和日常生活行为(ADL)数据集上分别达到了96.3%和98.0%的识别准确率.结论 本文的中层时空特征通过利用局部特征的时空分布信息,显著增强了行为鉴别能力,能够有效地识别多种复杂人体行为.  相似文献   

12.
Term frequency–inverse document frequency (TF–IDF), one of the most popular feature (also called term or word) weighting methods used to describe documents in the vector space model and the applications related to text mining and information retrieval, can effectively reflect the importance of the term in the collection of documents, in which all documents play the same roles. But, TF–IDF does not take into account the difference of term IDF weighting if the documents play different roles in the collection of documents, such as positive and negative training set in text classification. In view of the aforementioned text, this paper presents a novel TF–IDF‐improved feature weighting approach, which reflects the importance of the term in the positive and the negative training examples, respectively. We also build a weighted voting classifier by iteratively applying the support vector machine algorithm and implement one‐class support vector machine and Positive Example Based Learning methods used for comparison. During classifying, an improved 1‐DNF algorithm, called 1‐DNFC, is also adopted, aiming at identifying more reliable negative documents from the unlabeled examples set. The experimental results show that the performance of term frequency inverse positive–negative document frequency‐based classifier outperforms that of TF–IDF‐based one, and the performance of weighted voting classifier also exceeds that of one‐class support vector machine‐based classifier and Positive Example Based Learning‐based classifier. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

13.
互联网上充斥着用户生成文档,如论坛中的帖子。如何对这些杂乱无章的内容进行监控是安全部门所关心的重点之一,话题识别与跟踪(Topic Detection and Tracking,TDT)是监控的有效手段之一。但是,网络论坛帖子的特点是回帖篇幅短、话题转移快,使得面向论坛的话题识别与跟踪变得异常困难。针对其特点,给出了三个TDT模型 首先给出一个基线模型;为了缓解“话题漂移”现象,提出了将一个话题表示为种子向量与后续向量的改进模型;在改进的模型上运用最新的命名实体(NE)权重调节策略。针对论坛帖子格式不规范及TDT系统对处理速度的要求,提出了一种特征提取方法。最后,在真实数据集上给出了所用TDT模型的实验结果,证实了所建模型及特征提取方法的有效性。  相似文献   

14.
Modified sequential k‐means clustering concerns a k‐means clustering problem in which the clustering machine utilizes output similarity in addition. While conventional clustering methods commonly recognize similar instances at features‐level modified sequential clustering takes advantage of response, too. To this end, the approach we pursue is to enhance the quality of clustering by using some proper information. The information enables the clustering machine to detect more patterns and dependencies that may be relevant. This allows one to determine, for instance, which fashion products exhibit similar behaviour in terms of sales. Unfortunately, conventional clustering methods cannot tackle such cases, because they handle attributes solely at the feature level without considering any response. In this study, we introduce a novel approach underlying minimum conditional entropy clustering and show its advantages in terms of data analytics. In particular, we achieve this by modifying the conventional sequential k‐means algorithm. This modified clustering approach has the ability to reflect the response effect in a consistent manner. To verify the feasibility and the performance of this approach, we conducted several experiments based on real data from the apparel industry.  相似文献   

15.
互联网话题识别与跟踪系统设计及实现   总被引:1,自引:0,他引:1       下载免费PDF全文
针对互联网上论坛和新闻网站发布的海量自然语言文本,该文设计一个话题识别与跟踪系统,将海量的数据分类整理并聚合形成各个话题。该系统的核心采用SVM方法进行文本分类,基于知识库和网络流算法实现话题的聚合,测试结果表明,文章分类的正确率达到92%,聚类的正确率达到88%,具有较高的应用价值。  相似文献   

16.
针对LDA主题模型用于产品特征抽取中存在的问题,提出将句法分析和主题模型相结合的SA-LDA方法。首先基于句法分析对产品所在类别下的所有产品评论进行分析抽取显式特征,并聚类产生特征集和观点集,据此构建语料库。接着对待分析产品的每条评论,提取主观句并利用改进LDA模型对其主题进行学习,根据语料库构建must-link和cannot-link约束条件,在主题更新时对其进行约束和引导,每个主题对应一个特征类。实验表明,本文方法对显式特征和隐式特征都具有很好的实验效果,且相比传统的方法和其他改进方法在保证召回率的同时对准确率也有一定程度的提高。   相似文献   

17.
Unsupervised feature selection is an important problem, especially for high‐dimensional data. However, until now, it has been scarcely studied and the existing algorithms cannot provide satisfying performance. Thus, in this paper, we propose a new unsupervised feature selection algorithm using similarity‐based feature clustering, Feature Selection‐based Feature Clustering (FSFC). FSFC removes redundant features according to the results of feature clustering based on feature similarity. First, it clusters the features according to their similarity. A new feature clustering algorithm is proposed, which overcomes the shortcomings of K‐means. Second, it selects a representative feature from each cluster, which contains most interesting information of features in the cluster. The efficiency and effectiveness of FSFC are tested upon real‐world data sets and compared with two representative unsupervised feature selection algorithms, Feature Selection Using Similarity (FSUS) and Multi‐Cluster‐based Feature Selection (MCFS) in terms of runtime, feature compression ratio, and the clustering results of K‐means. The results show that FSFC can not only reduce the feature space in less time, but also significantly improve the clustering performance of K‐means.  相似文献   

18.
Selecting a subset of salient features for performing clustering using a clustering learning algorithm has been explored extensively in many real‐world applications. To select salient features during training, the filter model evaluates the intrinsic characteristics of each individual feature but is not permitted to use a clustering learning algorithm that provides clustered information to train the features. In particular, the filter model aims to predict unobservable clusters and measure how the features help provide satisfactory within‐cluster and between‐cluster scatters to achieve a good clustering quality. However, it is generally difficult to achieve both scatters in the filter model. For example, a random variable with a large variance may raise only the between‐cluster scatter, whereas another variable following a uniform distribution may raise only the within‐cluster scatter. In this paper, we present a new filter‐based method to quantify features that consider feature compactness and separability to ensure that both scatters are raised. Moreover, our method adopts a new search strategy to locate the best feature salience vector instead of visiting the space of all the possible feature subsets. After the benchmark data sets are tested, the experimental results indicate that our method performs better than many benchmark filter‐based methods at selecting a feature subset to perform clustering.  相似文献   

19.
With the rapid development of information techniques, the dimensionality of data in many application domains, such as text categorization and bioinformatics, is getting higher and higher. The high‐dimensionality data may bring many adverse situations, such as overfitting, poor performance, and low efficiency, to traditional learning algorithms in pattern classification. Feature selection aims at reducing the dimensionality of data and providing discriminative features for pattern learning algorithms. Due to its effectiveness, feature selection is now gaining increasing attentions from a variety of disciplines and currently many efforts have been attempted in this field. In this paper, we propose a new supervised feature selection method to pick important features by using information criteria. Unlike other selection methods, the main characteristic of our method is that it not only takes both maximal relevance to the class labels and minimal redundancy to the selected features into account, but also works like feature clustering in an agglomerative way. To measure the relevance and redundancy of feature exactly, two different information criteria, i.e., mutual information and coefficient of relevance, have been adopted in our method. The performance evaluations on 12 benchmark data sets show that the proposed method can achieve better performance than other popular feature selection methods in most cases.  相似文献   

20.
针对微博文本数据稀疏导致热点话题难以检测的问题,提出了一种基于IDLDA-ITextRank的话题检测模型。首先,通过引入微博时间序列特征和词频特征,构建了IDLDA话题文本聚类模型,利用该模型将同一话题的文本聚到一个文本集合TS;然后,通过采用编辑距离和字向量相结合的相似度计算方法,构建了ITextRank文本摘要和关键词抽取模型,对文本集合TS抽取摘要及其关键词;最后,利用词语互信息和左右信息熵将所抽取的关键词转换成关键主题短语,再将关键主题短语和摘要相结合对话题内容进行表述。通过实验表明,IDLDA模型相较于传统的BTM和LDA模型对话题文本的聚类效果更好,利用关键主题短语和摘要对微博的话题进行表述,比直接利用主题词进行话题表述具有更好的可理解性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号