首页 | 本学科首页   官方微博 | 高级检索  
     

基于后缀树模型的文本实时分类系统的研究和实现
引用本文:郭莉,张吉,谭建龙.基于后缀树模型的文本实时分类系统的研究和实现[J].中文信息学报,2005,19(5):18-25.
作者姓名:郭莉  张吉  谭建龙
作者单位:中国科学院计算技术研究所,北京 100080
摘    要:本文在面向网络内容分析的前提下,提出了一种基于后缀树的文本向量空间模型(VSM) ,并在此模型之上实现了文本分类系统。对比基于词的VSM,该模型利用后缀树的快速匹配,实时获得文本的向量表示,不需要对文本进行分词、特征抽取等复杂计算。同时,该模型能够保证训练集中文本的更改,对分类结果产生实时影响。实验结果和算法分析表明,我们系统的文本预处理的时间复杂度为O(N) ,远远优于分词系统的预处理时间复杂度。此外,由于不需要分词和特征抽取,分类过程与具体语种无关,所以是一种独立语种的分类方法。

关 键 词:计算机应用  中文信息处理  实时文本分类  向量空间模型  后缀树  
文章编号:1003-0077(2005)05-0016-08
收稿时间:2004-07-21
修稿时间:2005-01-19

Resarch and Implementation of On-line Text Categorization System Based on Suffix Tree
GUO Li,ZHANG Ji,TAN Jian-long.Resarch and Implementation of On-line Text Categorization System Based on Suffix Tree[J].Journal of Chinese Information Processing,2005,19(5):18-25.
Authors:GUO Li  ZHANG Ji  TAN Jian-long
Affiliation:Institute of Computing Technology , Chinese Academy of Sciences , Beijing 100080 ,China
Abstract:We propose a text vector space model(VSM) base d on suffix tree and implement a text categorizing system on the model. The model can perform fast matching by the support of suffix tree, obtain the vector prese ntation of text and avoid the complex computation such as word segmentation or f eature extraction of the text. In addition, this model can guarantee that the al teration of the training set can affect the result of classification in real tim e. Experiment and analysis of the algorithm show that, the time complexity of te xt preprocessing in our system is O(N), which is much better than that of word s egmentation method. Besides, the avoidance of word segmentation and feature extr action shows that the categorizing process is irrelevant to do with the concrete language and is a language independent method.
Keywords:computer application  Chinese information processing  online text categorization  vector space model  suffix tree
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号