首页 | 本学科首页   官方微博 | 高级检索  
     

一种主题爬虫文本分类器的构建
引用本文:姜鹏,宋继华.一种主题爬虫文本分类器的构建[J].中文信息学报,2010,24(6):92-97.
作者姓名:姜鹏  宋继华
作者单位:北京师范大学 信息科学与技术学院,北京 100875
摘    要:该文利用DF与CHI统计量相结合的特征选取方法,针对互联网上对外汉语相关领域的网页进行特征提取,并在此基础上,构建了基于标题与正文相结合的两步式主题相关度判定分类器。基于该分类器做对外汉语相关主题的网页爬取工作,实验表明,效率和召回率比传统分类器都有较大程度的提高,目前该分类器已经用于为大型对外汉语语料库构建提供数据源。

关 键 词:DF  CHI统计量  分类器  主题爬取  

A Method of Text Classifier for Focused Crawler
JIANG Peng,SONG Jihua.A Method of Text Classifier for Focused Crawler[J].Journal of Chinese Information Processing,2010,24(6):92-97.
Authors:JIANG Peng  SONG Jihua
Affiliation:College of Information Science and Technology, Beijing Normal University, Beijing 100875, China
Abstract:This paper combines DF and CHI to select features of web pages related to the area of teaching Chinese as a second language (TCSL). A classifier is first constructed based on two-step topic similarity measurement by the title and the main text. The classifier is then applied to crawling web pages related to TCSL, and the results show substantial improvements on efficiency and recall rate compared with traditional methods. Now this classifier has been deployed for data collection for a big TCSL corpus in actual practice.
Key wordsDF; CHI; classifier; focused crawler
Keywords:DF  CHI  classifier  focused crawler
 
        
 
        
 
        
本文献已被 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号