首页 | 本学科首页   官方微博 | 高级检索  
     

使用最大熵模型进行中文文本分类
引用本文:李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101.
作者姓名:李荣陆  王建会  陈晓云  陶晓鹏  胡运发
作者单位:复旦大学计算机与信息技术系,上海,200433;复旦大学计算机与信息技术系,上海,200433;复旦大学计算机与信息技术系,上海,200433;复旦大学计算机与信息技术系,上海,200433;复旦大学计算机与信息技术系,上海,200433
基金项目:国家自然科学基金项目(60173027)
摘    要:随着WWW的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术.由于最大熵模型可以综合观察到各种相关或不相关的概率知识,对许多问题的处理都可以达到较好的结果.但是,将最大熵模型应用在文本分类中的研究却非常少,而使用最大熵模型进行中文文本分类的研究尚未见到.使用最大熵模型进行了中文文本分类.通过实验比较和分析了不同的中文文本特征生成方法、不同的特征数目,以及在使用平滑技术的情况下,基于最大熵模型的分类器的分类性能.并且将其和Baves,KNN,SVM三种典型的文本分类器进行了比较,结果显示它的分类性能胜于Bayes方法,与KNN和SVM方法相当,表明这是一种非常有前途的文本分类方法.

关 键 词:文本分类  最大熵模型  特征  N-Gram

Using Maximum Entropy Model for Chinese Text Categorization
Li Ronglu,Wang Jianhui,Chen Xiaoyun,Tao Xiaopeng,Hu Yunfa.Using Maximum Entropy Model for Chinese Text Categorization[J].Journal of Computer Research and Development,2005,42(1):94-101.
Authors:Li Ronglu  Wang Jianhui  Chen Xiaoyun  Tao Xiaopeng  Hu Yunfa
Abstract:With the rapid development of World Wide Web, text classification has become the key technology in organizing and processing large amount of document data. Maximum entropy model is a probability estimation technique widely used for a variety of natural language tasks. It offers a clean and accommodable frame to combine diverse pieces of contextual information to estimate the probability of a certain linguistics phenomena. This approach for many tasks of NLP perform near state-of-the-art level, or outperform other competing probability methods when trained and tested under similar conditions. However, relatively little work has been done on applying maximum entropy model to text categorization problems. In addition, no previous work has focused on using maximum entropy model in classifying Chinese documents. Maximum entropy model is used for text categorization. Its categorization performance is compared and analyzed using different approaches for text feature generation, different number of feature and smoothing technique. Moreover, in experiments it is compared to Bayes, KNN and SVM, and it is shown that its performance is higher than Bayes and comparable with KNN and SVM. It is a promising technique for text categorization.
Keywords:text classification  maximum entropy model  features  N-Gram
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号