首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
文本情感分析领域内的特征加权一般考虑两个影响因子:特征在文档中的重要性(ITD)和特征在表达情感上的重要性(ITS)。结合该领域内两种分类准确率较高的监督特征加权算法,提出了一种新的ITS算法。新算法同时考虑特征在一类文档集里的文档频率(在特定的文档集里,出现某个特征的文档数量)及其占总文档频率的比例,使主要出现且大量出现在同一类文档集里的特征获得更高的ITS权值。实验证明,新算法能提高文本情感分类的准确率。  相似文献   

2.
将集成学习方法应用到XML文档聚类中来改进传统聚类算法的不足。提出一种标签与路径相结合的XML文档向量模型,基于这个模型,首先对原始文档集进行多次抽样,在新文档集上进行K均值聚类,然后对得到的聚类中心集合进行层次聚类。在人工数据集和真实数据集上的实验表明,该算法在召回率和精确率上优于K均值算法,并且增强了其鲁棒性。  相似文献   

3.
多标签分类的实质就是为给定实例预测一个与其关联的标签集合。典型方法可以分为两类:问题转换型和算法适应型。本文主要研究基于标签幂集的问题转换型算法。由于已有的标签幂集算法很难发现甚至可能忽略隐藏在训练集中的重要标签集合,因此,本文提出了一种基于标签聚类的标签幂集方法,通过改进平衡k-means聚类来发现训练集中潜在的重要标签集合,并用于形成新的训练集进行多标签分类。经实验验证,该算法在多个评价指标上较原有的标签幂集方法具有更好的分类性能。  相似文献   

4.
在传统的文本分类中,文本向量空间矩阵存在“维数灾难”和极度稀疏等问题,而提取与类别最相关的关键词作为文本分类的特征有助于解决以上两个问题。针对以上结论进行研究,提出了一种基于关键词相似度的短文本分类框架。该框架首先通过大量语料训练得到word2vec词向量模型;然后通过TextRank获得每一类文本的关键词,在关键词集合中进行去重操作作为特征集合。对于任意特征,通过词向量模型计算短文本中每个词与该特征的相似度,选择最大相似度作为该特征的权重。最后选择K近邻(KNN)和支持向量机(SVM)作为分类器训练算法。实验基于中文新闻标题数据集,与传统的短文本分类方法相比,分类效果约平均提升了6%,从而验证了该框架的有效性。  相似文献   

5.
李勇  相中启 《计算机应用》2019,39(1):245-250
针对云计算环境下已有的密文检索方案不支持检索关键词语义扩展、精确度不够、检索结果不支持排序的问题,提出一种支持检索关键词语义扩展的可排序密文检索方案。首先,使用词频逆文档频率(TF-IDF)方法计算文档中关键词与文档之间的相关度评分,并对文档不同域中的关键词设置不同的位置权重,使用域加权评分方法计算位置权重评分,将相关度评分与位置权重评分的乘积设置为关键词在文档索引向量上相应位置的取值;其次,根据WordNet语义网对授权用户输入的检索关键词进行语义扩展,得到语义扩展检索关键词集合,使用编辑距离公式计算语义扩展检索关键词集合中关键词之间的相似度,并将相似度值设置为检索关键词在文档检索向量上相应位置的取值;最后,加密产生安全索引和文档检索陷门,在向量空间模型(VSM)下进行内积运算,以内积运算的结果为密文检索文档的排序依据。理论分析和实验仿真表明,所提方案在已知密文模型和已知背景知识模型下是安全的,且具备对检索结果的排序能力;与多关键字密文检索结果排序(MRSE)方案相比,所提方案支持关键词语义扩展,查询准确率比MRSE方案更加准确可靠,而检索时间则与MRSE方案相差不大。  相似文献   

6.
一种基于关联规则的中文概念集生成算法   总被引:1,自引:0,他引:1  
本文提出了一种基于关联规则的中文概念集生成算法。该算法首先产生文档的中文关键词集,采用向量空间模型VSM(vector space model)表示文档;然后以中文关键词为事务项,以中文文档为事务,采用成熟的关联规则算法发现中文关键词频繁集;再生成原始概念集并对原始概念集进行聚类,最终实现了中文概念集的自生成,同时该算法能引入增量更新的特性,对概念集进行增量更新。通过实验,表明该算法能有效地生成中文概念集,可以用之于对表示中文文档的高维特征向量的语义降维,具有一定的使用价值。  相似文献   

7.
一种基于morlet小波核的约简支持向量机   总被引:7,自引:0,他引:7  
针对支持向量机(SVM)的训练数据量仅局限于较小样本集的问题,结合Morlet小波核函数,提出了一种基于Morlet小波核的约倚支持向量机(MWRSVM—DC).算法的核心是通过密度聚类寻找聚类中每个簇的边缘点作为约倚集合,并利用该约倚集合寻找支持向量.实验表明,利用小波核,该算法不仅提高了分类的准确率,而且提高了整体分类效率.  相似文献   

8.
在对两种SVM学习算法(SMO和SVMlight)进行分析的基础上,提出了一种改进的基于集合划分和SMO的算法SDBSMO。该算法根据样本违背最优化条件的厉害程度将训练集划分为多个集合,每次迭代后利用集合信息快速更新工作集和相关参数,从而减少迭代开销,提高训练速度。实验结果表明该算法能很好地提高支持向量机的训练速度。  相似文献   

9.
跌倒是老年人伤害和死亡的主要诱因之一,我国每年约有4000万65岁以上的老人意外跌倒。本文基于智能手机的加速度、气压计等传感器提出一种人体跌倒检测算法。该算法首先采用支持向量机(SVM)对训练集进行训练,得到一个弱二分类器(包含最优超平面和支持向量集),然后计算待测样本到最优超平面的距离。若该距离大于设定的间隔,直接采用SVM分类;否则,利用支持向量集作为有标签的训练集进行K近邻分类(KNN)。考虑到特征值的多维性,本文引入标准化欧氏距离替代传统的欧氏距离。仿真与实验结果显示,与传统的支持向量机算法相比,该算法能有效提高跌倒检测的准确率,且不受智能手机放置位置的限制。   相似文献   

10.
提出了一种基于TreeMiner算法挖掘频繁子树的文档结构相似度量方法,解决了传统的距离编辑法计算代价高而路径匹配法无法处理重复标签的问题。该方法架构了一个新的检索模型—频繁结构向量模型,给出了文档的结构向量表示和权重函数,构造了XML文档结构相似度量计算公式;同时从数据结构和挖掘程序上对TreeMiner 算法进行了改进,使其更适合大文档数据集的结构挖掘。实验结果表明,该方法具有很高的计算精度和准确率。  相似文献   

11.
一种基于奇异值分解的双语信息过滤算法   总被引:1,自引:0,他引:1  
本文提出了一种基于SVD(奇异值分解)的双语信息过滤算法,将双语文档进行了统一的表示,使得适应于单语过滤的算法可以方便地用于双语过滤,同时对文档向量进行了压缩,滤去了噪声。在应用方面,将双语过滤算法用于互联网上的个性化主动信息过滤。  相似文献   

12.
In the context of information retrieval (IR) from text documents, the term weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model. In this paper, we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and maybe infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TF-IDF and TF-ATO. The results show that both, stop-words removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information on the relevance judgement for the collection.  相似文献   

13.
微博搜索系统中,将微博帖子根据搜索相关性和重要性进行排序,并通过列表的方式返回结果,是目前信息内容的主要展示手段。基于向量空间模型的打分函数被广泛地应用于该类系统中。事实上,微博系统中的帖子重要性打分函数实际取值并不为用户所见,文档的影响力通过排名的方式表现出来。对于一个检索外的文档,如何衡量其在信息检索系统文库中的影响力?一般搜索引擎或信息检索系统并不能很好地回答该问题。在微博短文本的基础上引入了社交影响力这一概念,并通过在文本倒排索引基础上设置反向位置标记,给出了一种全新的影响力度量指标,有效地回答了前述问题。理论分析和数据实验验证了算法的有效性和效率。  相似文献   

14.
基于BIBD(4,2,1)的RFID防碰撞算法   总被引:3,自引:1,他引:2       下载免费PDF全文
标签防碰撞技术是射频识别系统的关键技术之一,它决定着标签的读取速率和正确率。以二进制搜索算法和平衡不完全区组设计BIBD(4,2,1)为基础,提出一种新型确定性RFID标签防碰撞算法。将标签分节,每节只包含BIBD(4,2,1)的子集,通过逐节识读达到快速识别的目的。数学分析和仿真结果表明,该算法识别速度优于二进制算法和动态二进制算法,可达到二进制算法的6倍以上,适用于标签数量多、UID长度较长的识别环境。  相似文献   

15.
基于潜在语义标引的WEB文档自动分类   总被引:6,自引:1,他引:6  
Web挖掘技术在商业上有广泛的应用前景,但现有的Web挖掘技术存在计算量大,精度不高等问题。论文提出的LSIWAC算法,首先运用潜在语义标引技术将Web页面词空间压缩到低维的特征空间;然后,在得到的特征空间上运用最优聚类将样本集合分为若干簇;对得到的每簇鉴别特征再利用最佳鉴别变换进行压缩和特征抽取,并用最终得到的特征矢量进行分类。该方法克服了样本高维效应,有效提高分类准确率,降低计算量。实验结果验证所提方法的有效性。  相似文献   

16.
Clustering groups document objects represented as vectors. An extensive vector space may cause obstacles to applying these methods. Therefore, the vector space was reduced with principal component analysis (PCA). The conventional cosine measure is not the only choice with PCA, which involves the mean-correction of data. Since mean-correction changes the location of the origin, the angles between the document vectors also change. To avoid this, we used a connection between the cosine measure and the Euclidean distance in association with PCA, and grounded searching on the latter. We applied the single and complete linkage and Ward clustering to Finnish documents utilizing their relevance assessment as a new feature. After the normalization of the data PCA was run and relevant documents were clustered.  相似文献   

17.
In this paper, we extend the work of Kraft et al. to present a new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. First, we present a fuzzy agglomerative hierarchical clustering algorithm for clustering documents and to get the document cluster centers of document clusters. Then, we present a method to construct fuzzy logic rules based on the document clusters and their document cluster centers. Finally, we apply the constructed fuzzy logic rules to modify the user's query for query expansion and to guide the information retrieval system to retrieve documents relevant to the user's request. The fuzzy logic rules can represent three kinds of fuzzy relationships (i.e., fuzzy positive association relationship, fuzzy specialization relationship and fuzzy generalization relationship) between index terms. The proposed fuzzy information retrieval method is more flexible and more intelligent than the existing methods due to the fact that it can expand users' queries for fuzzy information retrieval in a more effective manner.  相似文献   

18.
Kwong  Linus W.  Ng  Yiu-Kai 《World Wide Web》2003,6(3):281-303
To retrieve Web documents of interest, most of the Web users rely on Web search engines. All existing search engines provide query facility for users to search for the desired documents using search-engine keywords. However, when a search engine retrieves a long list of Web documents, the user might need to browse through each retrieved document in order to determine which document is of interest. We observe that there are two kinds of problems involved in the retrieval of Web documents: (1) an inappropriate selection of keywords specified by the user; and (2) poor precision in the retrieved Web documents. In solving these problems, we propose an automatic binary-categorization method that is applicable for recognizing multiple-record Web documents of interest, which appear often in advertisement Web pages. Our categorization method uses application ontologies and is based on two information retrieval models, the Vector Space Model (VSM) and the Clustering Model (CM). We analyze and cull Web documents to just those applicable to a particular application ontology. The culling analysis (i) uses CM to find a virtual centroid for the records in a Web document, (ii) computes a vector in a multi-dimensional space for this centroid, and (iii) compares the vector with the predefined ontology vector of the same multi-dimensional space using VSM, which we consider the magnitudes of the vectors, as well as the angle between them. Our experimental results show that we have achieved an average of 90% recall and 97% precision in recognizing Web documents belonged to the same category (i.e., domain of interest). Thus our categorization discards very few documents it should have kept and keeps very few it should have discarded.  相似文献   

19.
In this paper, we describe a document clustering method called novelty-based document clustering. This method clusters documents based on similarity and novelty. The method assigns higher weights to recent documents than old ones and generates clusters with the focus on recent topics. The similarity function is derived probabilistically, extending the conventional cosine measure of the vector space model by incorporating a document forgetting model to produce novelty-based clusters. The clustering procedure is a variation of the K-means method. An additional feature of our clustering method is an incremental update facility, which is applied when new documents are incorporated into a document repository. Performance of the clustering method is examined through experiments. Experimental results show the efficiency and effectiveness of our method.  相似文献   

20.
提出一种新的步态特征提取方法,首先利用坎尼算子提取前端腿部的边缘线,并对底部可能产生的误差点进行剔除;接着计算每两帧之间腿部边缘线每个像素点前进的位移,作为原始步态特征;然后,对步态特征进行远近归一化处理,消除被拍摄对象与拍摄镜头之间因距离不同所产生的影响;最后运用主成分分析,将特征空间维度由60维降到3~4维。在识别阶段,用归一化欧式距离计算样本之间的相似程度。提出的这种新的步态特征提取算法在3个人每人4个序列的小样本纯数据库上用最近标本分类器验证所提算法的性能,正确分类率为83.33%;在5个人每人4个序列的小样本混合数据库上,正确分类率为55%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号