基于分词提取重复串的未登录词遗漏量化模型 Quantitative Omission Model of Candidate Unknown Words for Chinese Word Segmentation Based Repeat Extraction期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于分词提取重复串的未登录词遗漏量化模型

引用本文：	张海军,史树敏,丁溪源,黄河燕. 基于分词提取重复串的未登录词遗漏量化模型[J]. 中文信息学报, 2011, 25(2): 122-129

作者姓名：	张海军史树敏丁溪源黄河燕

作者单位：	1.中国科学技术大学计算机科学与技术学院,安徽合肥 230027; 2.中国科学院计算机语言信息工程研究中心,北京 100097; 3.北京理工大学计算机科学与技术学院,北京 100081

基金项目：	国家自然科学基金资助项目，国家863计划重点资助项目

摘要：	基于重复串构造候选词集合是未登录词识别(UWI)的重要方法,目前有两种策略用于重复串提取:基于字符和基于分词.该文针对这两种策略实施了大量对比研究,并提出了基于分词提取重复串的未登录词遗漏量化模型,用以评估未登录词漏召问题.分析表明,该量化模型与实验数据之间具有良好的交互验证关系.根据时量化模型的讨论,该文得出了应用不...
关键词：	未登录词识别重复串条件随机域模型中文分词
Quantitative Omission Model of Candidate Unknown Words for Chinese Word Segmentation Based Repeat Extraction

ZHANG Haijun,SHI Shumin,DING Xiyuan,HUANG Heyan. Quantitative Omission Model of Candidate Unknown Words for Chinese Word Segmentation Based Repeat Extraction[J]. Journal of Chinese Information Processing, 2011, 25(2): 122-129

Authors:	ZHANG Haijun SHI Shumin DING Xiyuan HUANG Heyan

Affiliation:	1. School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China; 2. Research Center of Computer & Language Information Engineering, Chinese Academy of Sciences, Beijing 100097, China; 3. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

Abstract:	Constructing candidate words set based on repeats is an important way for Chinese Unknown Words Identification (UWI). There are two kinds of strategies used to extract repeatscharacter based and Chinese Word Segmentation (CWS) based. In this paper, a large number of comparative researches are implemented towards above two strategies, and a quantitative omission model for candidate unknown words based on CWS is presented to evaluate the problem of omission of unknown words. Studies show there is a good correlation between experimental results and the model outcomes. On the basis of discussions of the quantitative model, a reliable conclusion of Chinese UWI via two strategies is reached, which has certain reference value for follow-up researches of UWI. Key wordsunknown words identification; repeats; CRF; Chinese word segmentation

Keywords:	unknown words identification repeats CRF Chinese word segmentation
本文献已被万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏