首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
Text classification techniques mostly rely on single term analysis of the document data set, while more concepts, especially the specific ones, are usually conveyed by set of terms. To achieve more accurate text classifier, more informative feature including frequent co-occurring words in the same sentence and their weights are particularly important in such scenarios. In this paper, we propose a novel approach using sentential frequent itemset, a concept comes from association rule mining, for text classification, which views a sentence rather than a document as a transaction, and uses a variable precision rough set based method to evaluate each sentential frequent itemset's contribution to the classification. Experiments over the Reuters and newsgroup corpus are carried out, which validate the practicability of the proposed system.  相似文献   

2.
With the quick increase of information and knowledge, automatically classifying text documents is becoming a hotspot of knowledge management. A critical capability of knowledge management systems is to classify the text documents into different categories, which are meaningful to users. In this paper, a text topic classification model based on domain ontology by using Vector Space Model is proposed. Eigenvectors as the input to the vector space model are constructed by utilizing concepts and hierarchical structure of ontology, which also provides the domain knowledge. However, a limited vocabulary problem is encountered while mapping keywords to their corresponding ontology concepts. A synonymy lexicon is utilized to extend the ontology and compress the eigenvector. The problem that eigenvectors are too large and complex to be calculated in traditional methods can be solved. At last, combing the concept's supporting, a top-down method according to the ontology structure is used to complete topic classification. An experimental system is implemented and the model is applied to this practical system. Test results show that this model is feasible.  相似文献   

3.
4.
In formal concept analysis ,concept lattice as the fundamental data structure can be construct-ed front a formal context. Howevt, r,it is required that the relation between object and feature in the for-real context should be certain, For uncertain relation,this paper uses the thoughts of upper and lowerapproximation in rough set theory to deal with it ,and gives out the corresponding definitions of missing-value context and rough formal concept, Based on them, this paper employs rough concept lattice,formed by rough formal concepts and partial order relation on them,as the basic data structure for con-cept analysis and knowledge acquisition. Then a theroem is presented to describe the method of extract-ing rules from constructed rough formal concept lattice,and the semantic interpretation of discoveredrules is explained.  相似文献   

5.
The object oriented software development is a kind of promising software methodology and leading to awholly new way for solving problems. In the research on the rapid construction of Structured Development Envi-ronment (SDE)that supports detailed design and coding in software development, a generator that can gener-ate the SDE has been applied as a metatool. The kernel of SDE is a syntax-directed editor based on the ob-ject-oriented concepts. The key issue in the design of SDE is how to represent the elements of target languagewith the class concept and a program internally. In this paper, the key concepts and design of the SDE and itsgenerator as well as the implementation of a prototype are to be discussed.  相似文献   

6.
The paper proposes a new text similarity computing method based on concept similarity in Chinese text processing. The new method converts text to words vector space model at first, and then splits words into a set of concepts. Through computing the inner products between concepts, it obtains the similarity between words. The new method computes the similarity of text based on the similarity of words at last. The contributions of the paper include: 1) propose a new computing formula between words; 2) propose a new text similarity computing method based on words similarity; 3) successfully use the method in the application of similarity computing of WEB news; and 4) prove the validity of the method through extensive experiments.  相似文献   

7.
Ambiguous words refer to words that have multiple meanings such as apple,window.In text classification they are usually removed by feature reduction methods like Information Gain.Sometimes there are too many ambiguous words in the corpus,which makes throwing away all of them not a viable option,as in the case when classifying documents from the Web.In this paper we look for a method to classify Titled documents with the help of ambiguous words.Titled documents are a kind of documents that have a simple s...  相似文献   

8.
This paper presents a novel classification approach based on rough set theory and supporter vector machine. Sometimes, there are many attributes for classification samples and it is difficult to carry out classification. In this paper, the attributes of data set are reduction by rough set theory firstly, and then the classification is carried out using support vector machine. Finally, the classification results are obtained through the proposed model. Moreover, the proposed classification model has higher prediction accuracy by comparing with the traditional algorithm Naive Bayes algorithm and reduces the cost of calculation.  相似文献   

9.
The object-oriented software development is a kind of promising software methodology and leading to a wholly new way for solving problems.In the research on the rapid construction of Structured Development Environment(SDE)that supports detailed design and coding in software development,a generator that can generate the SE has been applied as a metatool.The kernel of SDE is a syntax-directed editor based on the object-oriented concepts.The key issue in the design of SDE is how to represent the elements of target language with the class concept and a program internally.In this paper,the key concepts and design of the SDE and its generator as well as the implementation of a prototype are to be discussed.  相似文献   

10.
Rough集之间的相似度量   总被引:4,自引:0,他引:4  
Applications of rough set theory in incomplete information systems are a key of putting rough set into real applications. In this paper, after analyzing some basic concepts of classical rough set theory and extended rough set theory, the measure of similarity is developed between two rough sets in the classical rough set theory based on indiscernibility relation and between two rough sets in the extended rough set theory based on limited tolerance relation. Then,some properties of these two methods for measuring similarity are developed respectively. At last,these two measure methods of rough set theory are compared.  相似文献   

11.
提出一种面向制造业设计文档的模糊分类方法.利用领域本体的层次结构和概念间的语义关系,对设计文档进行结构划分与标注,通过特征词与概念之间的距离和位置重要性计算权重,提高了设计文档分类的准确性.  相似文献   

12.
针对文本分类中传统特征选择方法卡方统计量和信息增益的不足进行了分析,得出文本分类中的特征选择关键在于选择出集中分布于某类文档并在该类文档中均匀分布且频繁出现的特征词。因此,综合考虑特征词的文档频、词频以及特征词的类间集中度、类内分散度,提出一种基于类内类间文档频和词频统计的特征选择评估函数,并利用该特征选择评估函数在训练集每个类别中选取一定比例的特征词组成该类别的特征词库,而训练集的特征词库则为各类别特征词库的并集。通过基于SVM的中文文本分类实验表明,该方法与传统的卡方统计量和信息增益相比,在一定程度上提高了文本分类的效果。  相似文献   

13.
基于核方法的Web挖掘研究   总被引:2,自引:0,他引:2  
基于词空间的分类方法很难处理文本的高维特性和捕获文本语义概念.利用核主成分分析和支持向量机。提出一种通过约简文本数据维数抽取语义概念、基于语义概念进行文本分类的新方法.首先将文档映射到高维线性特征空间消除非线性特征,然后在映射空间中通过主成分分析消除变量之间的相关性,实现降维和语义概念抽取,得到文档的语义概念空间,最后在语义概念空间中采用支持向量机进行分类.通过新定义的核函数,不必显式实现到语义概念空间的映射,可在原始文档向量空间中直接实现基于语义概念的分类.利用核化的GHA方法自适应迭代求解核矩阵的特征向量和特征值,适于求解大规模的文本分类问题.试验结果表明该方法对于改进文本分类的性能具有较好的效果.  相似文献   

14.
基于频繁词集聚类的海量短文分类方法   总被引:1,自引:0,他引:1  
王永恒  贾焰  杨树强 《计算机工程与设计》2007,28(8):1744-1746,1780
信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据.文本分类技术对于从这些海量短文中自动获取知识具有重要意义.但是对于关键词出现次数少的短文,现有的一般文本挖掘算法很难得到可接受的准确度.一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据.针对这个问题提出了一个新颖的基于频繁词集聚类的短文分类算法.该算法使用频繁词集聚类来压缩数据,并使用语义信息进行分类.实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法.  相似文献   

15.
Maps such as concept maps and knowledge maps are often used as learning materials. These maps have nodes and links, nodes as key concepts and links as relationships between key concepts. From a map, the user can recognize the important concepts and the relationships between them. To build concept or knowledge maps, domain experts are needed. Therefore, since these experts are hard to obtain, the cost of map creation is high. In this study, an attempt was made to automatically build a domain knowledge map for e-learning using text mining techniques. From a set of documents about a specific topic, keywords are extracted using the TF/IDF algorithm. A domain knowledge map (K-map) is based on ranking pairs of keywords according to the number of appearances in a sentence and the number of words in a sentence. The experiments analyzed the number of relations required to identify the important ideas in the text. In addition, the experiments compared K-map learning to document learning and found that K-map identifies the more important ideas.  相似文献   

16.

This article presents a system that carries out highly effective searches over collections of textual information, such as those found on the Internet. The system is made up of two major parts. The first part consists of an agent, Musag, that learns to relate concepts that are semantically ''similar'' to one another. In other words, this agent dynamically builds a dictionary of expressions for a given concept that captures the words people have in mind when mentioning the specific concept. We aim at achieving this by learning from the context in which these words appear. The second part consists of another agent, Sag, which is responsible for retrieving documents, given a set of keywords with relative weights. This retrieval makes use of the dictionary learned by Musag, in the sense that the documents to be retrieved for a query are related to the concept given according to the context of previously scanned documents. In this way, we overcome two main problems with current text search engines, which are largely based on syntactic methods. One problem is that the keyword given in the query might have ambiguous meaning, leading to the retrieval of documents not related to the topic requested. The second problem concerns relevant documents that will not be recommended to the user, since they did not include the specific keyword mentioned in the query. Using context learning methods, we will be able to retrieve such documents if they include other words, learned by Musag, that are related to the main concept. We describe the agents'system architecture, along with the nature of their interactions. We describe our learning and search algorithms and present results from experiments performed on specific concepts. We also discuss the notion of ''cost of learning'' and how it influences the learning process and the quality of the dictionary at any given time.  相似文献   

17.
提出了一种基于模糊形式概念分析的文本分类模型,通过概念化文本到一个更加抽象的概念形式,以概念而非文本作为训练样本,最终结合近邻分类算法实现文本分类决策。实验结果表明该算法有很好的性能。  相似文献   

18.
基于相似性进行文本分类是当前流行的文本处理方法。基于特征隶属度的文本分类相似性度量方法旨在利用特征与文档间的隶属关系度量文档相似性,从而实现文本分类。该方法基于特征与文档的隶属关系,对特征进行全隶属、偏隶属和无隶属词集划分,并基于3种隶属词集定义隶属度函数。全隶属词集隶属于两篇文档,隶属度随权差增大而降低;偏隶属词集仅隶属于其中某一篇文档,隶属度为一个定值;无隶属词集与两篇文档无隶属关系,隶属度为零。在度量相似性时,偏隶属关系高于全隶属关系。由于同类文档词集相近,异类文档词集差异明显,因此,基于特征与文档的隶属度进行相似性度量,可清晰界定词集与类别的隶属关系,提升分类精度。最后,采用数据集20-Newgroups和Reuters-21578对分类有效性进行验证,结果表明基于特征隶属度的相似性度量方法的性能优于目前流行的相似性度量方法。  相似文献   

19.
基于概念层次的英文文本自动分类研究   总被引:2,自引:0,他引:2  
该文意在设计并且实现一个针对英文文本的自动归类以及检索系统,重点在于提高分类方法的准确率。自动文本分类系统中,一般来说文本内容是以N维特征空间的形式存储的,所以特征提取的方法和准确率极大地影响到分类结果的正确率。传统方法是基于词形的,并不考察词语的意义,忽略了同一意义下词形的多样性、不确定性以及词义之间的关系,尤其是上下位关系。该文提出的方法,在向量空间模型(VSM)的基础上,以“概念”为基础,同时考虑词义的上位关系,使得训练过程中可以从词语中提炼出更加概括性的信息,从而达到提高分类精度的目的。  相似文献   

20.
信息处理领域中,现有的各种文本分类算法大都基于向量空间模型,而向量空间模型却不能够有效地表达文档的结构信息,从而使得它还不能充分地表达文档的语义信息.为了更有效地表达文档的语义信息,本文首先提出了一种新的文档表示模型一图模型,即通过带权标号图表达文档的特征词条及其位置关联信息,在此基础上本文继而提出了一种新的文档相似性度量标准,并用于中文文本的分类.实验结果表明,基于图模型的这种文档表示方式是有效的和可行的.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号