HDP与互信息相结合的中文无指导分词 Unsupervised Chinese Word Segmentation Based on HDP and Mutual Information Getting together期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

HDP与互信息相结合的中文无指导分词

引用本文：	曹自强,李素建.HDP与互信息相结合的中文无指导分词[J].中文信息学报,2013,27(6):1-6.

作者姓名：	曹自强李素建

作者单位：	北京大学计算语言学教育部重点实验室,北京 100871

基金项目：	国家自然科学基金资助项目(61273278);国家社会科学基金资助项目(12&ZD227),国家科技支撑计划子课题资助项目(2011BAH10B04-03);国家863计划项目资助(2012AA011101)。

摘要：	该文探讨了无指导条件下的中文分词,这对构建语言无关的健壮分词系统大有裨益。互信息与HDP(Hierarchical Dirichlet Process)是无指导情况下常用的分词模型,该文将两者结合,并改进了采样算法。不考虑标点符号,在两份大小不同的测试语料上获得的F值为0.693与0.741,相比baseline的HDP分别提升了5.8%和3.9%。该文还用该模型进行了半指导分词,实验结果比常用的CRF有指导分词提升了2.6%。
关键词：	HDP 互信息无指导分词
Unsupervised Chinese Word Segmentation Based on HDP and Mutual Information Getting together

CAO Ziqiang,LI Sujian.Unsupervised Chinese Word Segmentation Based on HDP and Mutual Information Getting together[J].Journal of Chinese Information Processing,2013,27(6):1-6.

Authors:	CAO Ziqiang LI Sujian

Affiliation:	Key Laboratory of Computational Linguistics (Peking University), Ministry of Education, Beijing 100871, China

Abstract:	This paper explores Chinese word segmentation without training data, which greatly benefits the foundation of language-independent word segmentation system. Mutual information and HDP are both widely used methods for unsupervised segmentation task. We combine these two models and improve the sampling algorithm. Without regard to punctuations, the F-scores of two test corpus with different sizes are 0.693 and 0.741. Compared to HDP baseline, the scores rise 5.8% and 3.9%, respectively. Finally, our model is applied to semi-supervised word segmentation. The F-score is 2.6% larger than the common supervised CRF model. Key wordsHDP; mutual information; unsupervised word segmentation

Keywords:	HDP mutual information unsupervised word segmentation

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏