首页 | 本学科首页   官方微博 | 高级检索  
     

SAT-FOIL+:基于句子级关联的文本分类
引用本文:冯玉才 李曲 何玉 冯剑琳. SAT-FOIL+:基于句子级关联的文本分类[J]. 计算机科学, 2005, 32(3): 207-212
作者姓名:冯玉才 李曲 何玉 冯剑琳
作者单位:华中科技大学计算机科学与技术学院,武汉,430074
基金项目:国家自然科学基金(编号60303030),重庆自然科学基金(编号8721)
摘    要:以往基于词语关联的方法在挖掘频繁项集和关联规则时,都是将整个文本看作一个亨务来处理的,然而文本的基本语义单元实际上是句子。那些同时出现在一个句子里的一组单词比仅仅是同时出现在同一篇文档中的一组单词有更强的语义上的联系。基于以上的考虑,我们把一篇文档里的一个句子作为一个单独的事务,从而提出了一种基于句子级关联的分类方法SAT-FOIL。并在本文中提出新的得分模型来获得改进的新算法SAT-FOIL 。通过在标准的文本集Reuters上的大量实验,不仅证明新模型的优越性,而且证明了SAT-FOIL 分类效果同其他几种分类方法是可比的,并且要远远好于以往的基于文档级关联的分类方法。另外,挖掘出来的分类规则还具有易读性,并且易修改。

关 键 词:文本分类  句子级别  关联规则  频繁项目集

SAT-FOIL+: Sentence-Level Association Based Text Classification
FENG Yu-Cai,Li Qu,HE Yu,FENG Jian-Lin. SAT-FOIL+: Sentence-Level Association Based Text Classification[J]. Computer Science, 2005, 32(3): 207-212
Authors:FENG Yu-Cai  Li Qu  HE Yu  FENG Jian-Lin
Affiliation:FENG Yu-Cai,LI Qu,HE Yu,FENG Jian-Lin Department of Computer Science,Huazhong University of Science and Technology,Wuhan 430074
Abstract:While previous association based methods mainly mined frequently co-occurring words (frequent itemsets) at the document-level, the basic semantic unit in a document is actually a sentence. Words within the same sentence are typically more semantically related than words that just appear in the same document. Our proposed SAT-FOIL views a sentence rather than a document as a transaction. In this paper we proposed new score models to get the im- proved algorithm SAT-FOIL . The effectiveness of our proposed SAT-FOIL method has been demonstrated not only better than our former algorithm SAT-FOIL but also comparable to well-known alternatives and much better than previous document-level association based methods by extensive experimental studies using popular benchmark text collections Reuters. In addition, SAT-FOIL has inherent readability and refinability of acquired classification rules.
Keywords:Text classification  Sentence-level  Association rules  Frequent itemsets
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号