首页 | 本学科首页   官方微博 | 高级检索  
     

基于分词提取重复串的未登录词遗漏量化模型
引用本文:张海军,史树敏,丁溪源,黄河燕. 基于分词提取重复串的未登录词遗漏量化模型[J]. 中文信息学报, 2011, 25(2): 122-129
作者姓名:张海军  史树敏  丁溪源  黄河燕
作者单位:1.中国科学技术大学 计算机科学与技术学院,安徽 合肥 230027;
2.中国科学院 计算机语言信息工程研究中心,北京 100097;
3.北京理工大学 计算机科学与技术学院,北京 100081
基金项目:国家自然科学基金资助项目,国家863计划重点资助项目
摘    要:基于重复串构造候选词集合是未登录词识别(UWI)的重要方法,目前有两种策略用于重复串提取:基于字符和基于分词.该文针对这两种策略实施了大量对比研究,并提出了基于分词提取重复串的未登录词遗漏量化模型,用以评估未登录词漏召问题.分析表明,该量化模型与实验数据之间具有良好的交互验证关系.根据时量化模型的讨论,该文得出了应用不...

关 键 词:未登录词识别  重复串  条件随机域模型  中文分词

Quantitative Omission Model of Candidate Unknown Words for Chinese Word Segmentation Based Repeat Extraction
ZHANG Haijun,SHI Shumin,DING Xiyuan,HUANG Heyan. Quantitative Omission Model of Candidate Unknown Words for Chinese Word Segmentation Based Repeat Extraction[J]. Journal of Chinese Information Processing, 2011, 25(2): 122-129
Authors:ZHANG Haijun  SHI Shumin  DING Xiyuan  HUANG Heyan
Affiliation:1. School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China;
2. Research Center of Computer & Language Information Engineering, Chinese Academy of Sciences, Beijing 100097, China;
3. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Abstract:Constructing candidate words set based on repeats is an important way for Chinese Unknown Words Identification (UWI). There are two kinds of strategies used to extract repeatscharacter based and Chinese Word Segmentation (CWS) based. In this paper, a large number of comparative researches are implemented towards above two strategies, and a quantitative omission model for candidate unknown words based on CWS is presented to evaluate the problem of omission of unknown words. Studies show there is a good correlation between experimental results and the model outcomes. On the basis of discussions of the quantitative model, a reliable conclusion of Chinese UWI via two strategies is reached, which has certain reference value for follow-up researches of UWI.
Key wordsunknown words identification; repeats; CRF; Chinese word segmentation
Keywords:unknown words identification   repeats   CRF   Chinese word segmentation  
本文献已被 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号