首页 | 本学科首页   官方微博 | 高级检索  
     

Nutch中庖丁解牛中文分词的实现与评测
引用本文:孙殿哲,魏海平,陈岩.Nutch中庖丁解牛中文分词的实现与评测[J].计算机与现代化,2010(6):187-190.
作者姓名:孙殿哲  魏海平  陈岩
作者单位:1. 辽宁石油化工大学研究生学院,辽宁,抚顺,113001
2. 辽宁石油化工大学计算机与通讯工程学院,辽宁,抚顺,113001
摘    要:中文分词是搜索引擎面临的主要挑战之一。本文通过分析Nutch文档的评分机制,针对Nutch中文分词模块的分词不符合汉语习惯的情况,提出采用以词典分词法为基础的庖丁解牛分词模块对Nutch要采集的数据进行切分,描述在Nutch上实现庖丁解牛分词模块的方法,并对该分词模块进行测试。实验表明,庖丁解牛分词模块的分词结果更符合汉语习惯,并且在词项对文档的覆盖方面更加均衡,另外索引文件所占的存储空间节省20%~65%。

关 键 词:中文分词  评分机制  庖丁解牛

Realization and Evaluation of Paodingjieniu Chinese Segmentation in Nutch
SUN Dian-zhe,WEI Hai-ping,CHEN Yan.Realization and Evaluation of Paodingjieniu Chinese Segmentation in Nutch[J].Computer and Modernization,2010(6):187-190.
Authors:SUN Dian-zhe  WEI Hai-ping  CHEN Yan
Affiliation:1. Graduate School/a>;Liaoning Shihua University/a>;Fushun 113001/a>;China/a>;2. School of Computer and Communication Engineering/a>;China
Abstract:Chinese word segmentation is one of main challenges for search engine. By analyzing the scoring mechanism of the document of Nutch,for the situation that word segmentation of Chinese word segmentation module of Nutch does not conform to Chinese language habit,this paper proposes to use Paodingjieniu Chinese word segmentation module based on dictionary to segment the data collected by Nutch,describes the method that how to realize Paodingjieniu Chinese word segmentation module on Nutch, then tests the word s...
Keywords:Chinese word segmentation  scoring mechanism  Paodingjieniu  
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号