首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 156 毫秒
1.
曾超  吕钊  顾君忠 《计算机应用》2008,28(12):3248-3250
提出了一个基于概念向量空间模型的电子邮件分类方法。在提取电子邮件特征向量时,以WordNet语言本体库为基础,以同义词集合概念代替词条,同时考虑同义词集合间的上下位关系,从而建立电子邮件的概念向量空间模型作为电子邮件的特征向量。使用TF*IWF*IWF方法对概念向量进行权值修正,最后通过简单向量距离分类方法来确定电子邮件的类别。实验结果表明,当训练集合数目有限时,该方法能够有效提高电子邮件的分类准确率。  相似文献   

2.
一种基于语义特征的逻辑段落划分方法及应用   总被引:1,自引:0,他引:1  
引入了一种以逻辑概念为中心的段落化匹配方式.该方法建立在概念词典之上,通过分析待分类文本中所包含的逻辑概念,将待分类文本中表达相同意义的段落进行聚类分析以得到一个逻辑层次,并建立以此逻辑层次划分方法为基础的逻辑段落概念,然后以该逻辑段落作为依据来衡量不同的段落对于文本主题表示的贡献程度.同时,针对匹配过程中存在的多义词和同义词现象,引入了同义词概念扩充和关联词语扩充.实验证明,该方法能够获得更高的内容过滤准确率,有效提高分类效果.  相似文献   

3.
基于概念的文本表示模型   总被引:5,自引:1,他引:4  
文本信息处理正朝着语义的方向发展,而当今主流的文本表示模型——向量空间模型(VSM)以单个词语作为特征项,这忽略了自然语言中词语之间的语义联系、导致文本中大量存在同义词与多义词现象,从而严重地降低了文本信息处理的精度。应用自然语言处理相关技术和成果,把概念和概念距离引入向量空间模型,从语义、概念的角度出发,以概念作为文本的特征项,建立基于概念的文本表示模型。实验证明:这种方法能较好地解决同义词和多义词问题、提高了文本分类的查全率和查准率。  相似文献   

4.
针对同义词替换操作造成原始文本整体统计特性的破坏,提出了一种基于低失真替换优先的文本隐写算法。该算法以同义词在文本中的合适度为基础,构造失真函数衡量同义词替换后文本统计特性的改变程度;不仅为每个同义的词集合选取合适度最高的两个词组成替换组合,而且从全局的角度出发,对原始文本中的所有同义词替换组合所引起的失真度进行排序,优先选取失真度小的同义词替换组合来嵌入信息,从而降低了同义词序列统计特性的改变程度。实验结果表明,该算法可以很好地抵抗基于同义词结对和基于同义词相对词频统计特征的隐藏信息检测算法的攻击,具有较高的抗检测能力,提高了秘密信息的安全性。  相似文献   

5.
一种基于紧密度的半监督文本分类方法   总被引:2,自引:0,他引:2  
自动的文本分类已经成为一个重要的研究课题。在实际的应用情况下,很多训练语料都只有一个数目有限的正例集合,同时语料中的正例和未标注文档在数量上的分布通常也是不均衡的。因此这种文本分类任务有着不同于传统的文本分类任务的特点,传统的文本分类器如果直接应用到这类问题上,也难以取得令人满意的效果。因此,本文提出了一种基于紧密度衡量的方法来解决这一类问题。由于没有标注出来的负例文档,所以,本文先提取出一些可信的负例,然后再根据紧密度衡量对提取出的负例集合进行扩展,进而得到包含正负例的训练集合,从而提高分类器的性能。该方法不需要借助特别的外部知识库来对特征提取,因此能够比较好的应用到各个不同的分类环境中。在TREC’05(国际文本检索会议)的基因项目的文本分类任务语料上的实验表明,该算法在解决半监督文本分类问题中取得了优异的成绩。  相似文献   

6.
针对在自然语言文本信息隐写术中,采用基于同义词替换方法来嵌入秘密信息时,常由于候选同义词选择不准确,导致替换后文本语句出现明显错误或逻辑歧义等问题,提出了基于二元依存同义词替换隐写算法。该算法先从WordNet词库中得出与目标词词性相同、语义相似的词语,然后对目标语句利用依存句法提取同义词对应的二元依存关系,从大规模语料库中计算二元依存关系的向量距离,得出最佳替换的同义词词集。实验结果表明,该算法生成的隐写文本保持嵌入秘密信息后文本特征属性不变,比目前改进的同义词替换算法更能保证文本语法正确、语义完整,更高效地抵抗同义词结对和相对词频统计分析检测,提高了秘密信息传递的安全性。  相似文献   

7.
中文文本情感主题句分析与提取研究   总被引:3,自引:0,他引:3  
樊娜  蔡皖东  赵煜  李慧贤 《计算机应用》2009,29(4):1171-1173
提出一种提取中文文本情感主题句子的方法。首先评估文本中语义概念的概括和归纳能力,确定文本主题概念。将包含主题概念的句子作为候选主题句子,计算各个候选句子的重要度,最终确定文本主题句。然后采用条件随机场模型,选取情感倾向特征和转移词特征训练模型,从文本主题句集合中提取情感主题句。实验证明,以提出的方法为基础进行文本情感分析,避免了与主题无关的句子对分析结果的影响,有效地提高了文本情感分析的准确率。  相似文献   

8.
形式背景需要从实际的数据源中提取。当数据源为无结构的中文文本时,必须选择如何对其进行表示。目前主流的中文文本表示方法主要采用以词语为特征项的向量空间模型(VSM),其主要缺陷是忽略了自然语言中词语之间的语义联系,无法表达文本的语义信息。讨论了一种改进方法,其特征是:选择知网(Hownet)作为知识库,采用相似词集集合代替单一特征词,建立中文文本的概念向量空间。对于用概念向量空间表示的中文文本,可以方便地根据用户的具体要求提取所需的形式背景。以214篇交通类中文文本为实例阐释了该改进方法的实际应用。  相似文献   

9.
在传统的文本分类中,文本向量空间矩阵存在“维数灾难”和极度稀疏等问题,而提取与类别最相关的关键词作为文本分类的特征有助于解决以上两个问题。针对以上结论进行研究,提出了一种基于关键词相似度的短文本分类框架。该框架首先通过大量语料训练得到word2vec词向量模型;然后通过TextRank获得每一类文本的关键词,在关键词集合中进行去重操作作为特征集合。对于任意特征,通过词向量模型计算短文本中每个词与该特征的相似度,选择最大相似度作为该特征的权重。最后选择K近邻(KNN)和支持向量机(SVM)作为分类器训练算法。实验基于中文新闻标题数据集,与传统的短文本分类方法相比,分类效果约平均提升了6%,从而验证了该框架的有效性。  相似文献   

10.
一种基于词义向量模型的词语语义相似度算法   总被引:1,自引:0,他引:1  
李小涛  游树娟  陈维 《自动化学报》2020,46(8):1654-1669
针对基于词向量的词语语义相似度计算方法在多义词、非邻域词和同义词三类情况计算准确性差的问题, 提出了一种基于词义向量模型的词语语义相似度算法.与现有词向量模型不同, 在词义向量模型中多义词按不同词义被分成多个单义词, 每个向量分别与词语的一个词义唯一对应.我们首先借助同义词词林中先验的词义分类信息, 对语料库中不同上下文的多义词进行词义消歧; 然后基于词义消歧后的文本训练词义向量模型, 实现了现有词向量模型无法完成的精确词义表达; 最后对两个比较词进行词义分解和同义词扩展, 并基于词义向量模型和同义词词林综合计算词语之间的语义相似度.实验结果表明本文算法能够显著提升以上三类情况的语义相似度计算精度.  相似文献   

11.
Document clustering is text processing that groups documents with similar concepts. It's usually considered an unsupervised learning approach because there's no teacher to guide the training process, and topical information is often assumed to be unavailable. A guided approach to document clustering that integrates linguistic top-down knowledge from WordNet into text vector representations based on the extended significance vector weighting technique improves both classification accuracy and average quantization error. In our guided self-organization approach we integrate topical and semantic information from WordNet. Because a document-training set with preclassified information implies relationships between a word and its preference class, we propose a novel document vector representation approach to extract these relationships for document clustering. Furthermore, merging statistical methods, competitive neural models, and semantic relationships from symbolic Word-Net, our hybrid learning approach is robust and scales up to a real-world task of clustering 100,000 news documents.  相似文献   

12.
传统的文本分类都是根据文本的外在特征进行的,最常见的就是基于向量空间模型的方法,使用空间向量表示文本,通过相似度比较来确定分类。为了克服向量空间模型中的词条独立性假设,文章提出了一种基于潜在语义索引的文本分类模型,通过对大量的文本集进行统计分析,揭示了词语的上下文使用含义,通过奇异值分解有效地降低了向量空间的维数,消除了同义词、多义词的影响,从而提高了文本分类的精度。  相似文献   

13.
对基于向量空间模型的检索方法进行改进,提出基于本体语义的信息检索模型。将WordNet词典作为参照本体来计算概念之间的语义相似度,依据查询中标引项之间的相似度,对查询向量中的标引项进行权值调整,并参照Word-Net本体对标引项进行同义和上下位扩展,在此基础上定义查询与文档间的相似度。与传统的基于词形的信息检索方法相比,该方法可以提高语义层面上的检索精度。  相似文献   

14.
As a valuable tool for text understanding, semantic similarity measurement enables discriminative semantic-based applications in the fields of natural language processing, information retrieval, computational linguistics and artificial intelligence. Most of the existing studies have used structured taxonomies such as WordNet to explore the lexical semantic relationship, however, the improvement of computation accuracy is still a challenge for them. To address this problem, in this paper, we propose a hybrid WordNet-based approach CSSM-ICSP to measuring concept semantic similarity, which leverage the information content(IC) of concepts to weight the shortest path distance between concepts. To improve the performance of IC computation, we also develop a novel model of the intrinsic IC of concepts, where a variety of semantic properties involved in the structure of WordNet are taken into consideration. In addition, we summarize and classify the technical characteristics of previous WordNet-based approaches, as well as evaluate our approach against these approaches on various benchmarks. The experimental results of the proposed approaches are more correlated with human judgment of similarity in term of the correlation coefficient, which indicates that our IC model and similarity detection approach are comparable or even better for semantic similarity measurement as compared to others.  相似文献   

15.
Knowledge-based vector space model for text clustering   总被引:5,自引:4,他引:1  
This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g., words or concepts) are included in representing text documents as a set of vectors. The idea is to calculate the dissimilarity between two documents more effectively so that text clustering results can be enhanced. In this paper, the semantic relationship between two terms is defined by the similarity of the two terms. Such similarity is used to re-weight term frequency in the VSM. We consider and study two different similarity measures for computing the semantic relationship between two terms based on two different approaches. The first approach is based on the existing ontologies like WordNet and MeSH. We define a new similarity measure that combines the edge-counting technique, the average distance and the position weighting method to compute the similarity of two terms from an ontology hierarchy. The second approach is to make use of text corpora to construct the relationships between terms and then calculate their semantic similarities. Three clustering algorithms, bisecting k-means, feature weighting k-means and a hierarchical clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM.  相似文献   

16.
Text categorization is one of the most common themes in data mining and machine learning fields. Unlike structured data, unstructured text data is more difficult to be analyzed because it contains complicated both syntactic and semantic information. In this paper, we propose a two-level representation model (2RM) to represent text data, one is for representing syntactic information and the other is for semantic information. Each document, in syntactic level, is represented as a term vector where the value of each component is the term frequency and inverse document frequency. The Wikipedia concepts related to terms in syntactic level are used to represent document in semantic level. Meanwhile, we designed a multi-layer classification framework (MLCLA) to make use of the semantic and syntactic information represented in 2RM model. The MLCLA framework contains three classifiers. Among them, two classifiers are applied on syntactic level and semantic level in parallel. The outputs of these two classifiers will be combined and input to the third classifier, so that the final results can be obtained. Experimental results on benchmark data sets (20Newsgroups, Reuters-21578 and Classic3) have shown that the proposed 2RM model plus MLCLA framework improves the text classification performance by comparing with the existing flat text representation models (Term-based VSM, Term Semantic Kernel Model, Concept-based VSM, Concept Semantic Kernel Model and Term + Concept VSM) plus existing classification methods.  相似文献   

17.
Semantic-oriented service matching is one of the challenges in automatic Web service discovery. Service users may search for Web services using keywords and receive the matching services in terms of their functional profiles. A number of approaches to computing the semantic similarity between words have been developed to enhance the precision of matchmaking, which can be classified into ontology-based and corpus-based approaches. The ontology-based approaches commonly use the differentiated concept information provided by a large ontology for measuring lexical similarity with word sense disambiguation. Nevertheless, most of the ontologies are domain-special and limited to lexical coverage, which have a limited applicability. On the other hand, corpus-based approaches rely on the distributional statistics of context to represent per word as a vector and measure the distance of word vectors. However, the polysemous problem may lead to a low computational accuracy. In this paper, in order to augment the semantic information content in word vectors, we propose a multiple semantic fusion (MSF) model to generate sense-specific vector per word. In this model, various semantic properties of the general-purpose ontology WordNet are integrated to fine-tune the distributed word representations learned from corpus, in terms of vector combination strategies. The retrofitted word vectors are modeled as semantic vectors for estimating semantic similarity. The MSF model-based similarity measure is validated against other similarity measures on multiple benchmark datasets. Experimental results of word similarity evaluation indicate that our computational method can obtain higher correlation coefficient with human judgment in most cases. Moreover, the proposed similarity measure is demonstrated to improve the performance of Web service matchmaking based on a single semantic resource. Accordingly, our findings provide a new method and perspective to understand and represent lexical semantics.  相似文献   

18.
为增加向量空间模型的文本语义信息,提出三元组依存关系特征构建方法,将此方法用于文本情感分类任务中。三元组依存关系特征构建方法在得到完整依存解析树的基础上,先依据中文语法特点,制定相应规则对原有完整树进行冗余结点的合并和删除;再将保留的依存树转化为三元组关系并一般化后作为向量空间模型特征项。为了验证此种特征表示方法的有效性,构造出在一元词基础上添加句法特征、简单依存关系特征和词典得分不同组合下的特征向量空间。将三元组依存关系特征向量与构造出的不同组合特征向量分别用于支持向量机和深度信念网络中。结果表明,三元组依存关系文本表示方法在分类精度上均高于其他特征组合表示方法,进一步说明三元组依存关系特征能更充分表达文本语义信息。  相似文献   

19.
The text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which proved effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which (i) uses a finite design set of labeled data to (ii) help agglomerative hierarchical clustering algorithms (AHC) partition a finite set of unlabeled data and then (iii) terminates without the capacity to classify other objects. This algorithm is the “semi-supervised agglomerative hierarchical clustering algorithm” (ssAHC). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 89 TOPICS classes of the Reuters collection. Using the vector space model (VSM), each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssAHC improve its performance, effectively addresses the classification of documents into categories with few training documents, and does not interfere with the use of training data. © 2001 John Wiley & Sons, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号