首页 | 本学科首页   官方微博 | 高级检索  
     

大规模文本数据库中的短文分类方法
引用本文:王永恒,贾焰,杨树强.大规模文本数据库中的短文分类方法[J].计算机工程与应用,2006,42(22):5-7.
作者姓名:王永恒  贾焰  杨树强
作者单位:国防科技大学计算机学院网络研究所,长沙,410073
基金项目:高比容电子铝箔的研究开发与应用项目
摘    要:信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据。文本分类技术对于从这些海量短文中自动获取知识具有重要意义。但是由于短文中的关键词出现次数少,而且带标签的训练样本又通常数量很少,现有的一般文本挖掘算法很难得到可接受的准确度。一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据。文本提出了一个新颖的短文分类算法。该算法基于文本语义特征图,并使用类似kNN的方法进行分类。实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法。

关 键 词:文本挖掘  分类  短文  大规模文本数据库
文章编号:1002-8331-(2006)22-0005-03
收稿时间:2006-05
修稿时间:2006年5月1日

Short Documents Classification Method in Very Large Text Database
Wang Yongheng,Jia Yan,Yang Shuqiang.Short Documents Classification Method in Very Large Text Database[J].Computer Engineering and Applications,2006,42(22):5-7.
Authors:Wang Yongheng  Jia Yan  Yang Shuqiang
Affiliation:Institute of Network,Computer School,National University of Defense Technology,Changsha 410073
Abstract:With the rapid development of information technology,huge data are accumulated.A vast amount of such data appears as short documents.It is very useful to classify such short documents to get knowledge automatically form the data.But most of the current classification algorithms can't get acceptable accuracy since key words appear less time in short documents and the labeled training examples are usually very few.Some classification algorithms based on semantic information is more accurate but they are inefficient to be used to process very large document sets.In this paper,we propose a novel classification method based on semantic text features graph and kNN like method.Our experimental study shows that our algorithm is more accurate and efficient than other classification algorithms when classifying large scale short documents.
Keywords:text mining  classification  short document  very large text database
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号