基于级联重排序的汉语音字转换 Chinese Pinyin-to-character Conversion Based on Cascaded Reranking期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于级联重排序的汉语音字转换

引用本文：	李鑫鑫,王轩,姚霖,关键.基于级联重排序的汉语音字转换[J].自动化学报,2014,40(4):624-634.

作者姓名：	李鑫鑫王轩姚霖关键

作者单位：	1.哈尔滨工业大学深圳研究生院计算机应用研究中心深圳 518055;

基金项目：	国家科技部重大科技专项（2011ZX03002-004-01），深圳市基础研究重点项目（JC201104210032A，JC201005260112A）资助

摘要：	N元语言模型是解决汉字音字转换问题最常用的方法. 但在解析过程中，每一个新词的确定只依赖于前面的邻近词，缺乏长距离词之间的句法和语法约束. 我们引入词性标注和依存句法等子模型等来加强这种约束关系，并采用两个重排序方法来利用这些子模型提供的信息:1）线性重排序方法，采用最小错误学习方法来得到各个子模型的权重，然后产生候选词序列的概率；2）采用平均感知器方法对候选词序列进行重排序，能够利用词性、依存关系等复杂特征. 实验结果显示，两种方法都能有效地提高词N元语言模型的性能. 而将这两种方法进行级联，即首先采用线性重排序方法，然后把产生的概率作为感知器重排序方法的初始概率时性能取得最优.
关键词：	汉语音字转换重排序最小错误学习感知器方法
收稿时间：	2013-04-22
Chinese Pinyin-to-character Conversion Based on Cascaded Reranking

LI Xin-Xin,WANG Xuan,YAO Lin,GUAN Jian.Chinese Pinyin-to-character Conversion Based on Cascaded Reranking[J].Acta Automatica Sinica,2014,40(4):624-634.

Authors:	LI Xin-Xin WANG Xuan YAO Lin GUAN Jian

Affiliation:	1.Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055;2.Shenzhen Applied Technology Engineering Laboratory for Internet Multimedia Application, Shenzhen 518055;3.Public Service Platform of Mobile Internet Application Security Industry, Shenzhen 518057

Abstract:	The word n-gram language model is the most common approach for Chinese pinyin-to-character conversion. It is simple, efficient, and widely used in practice. However, in the decoding phase of the word n-gram model, the determination of a word only depends on its previous words, which lacks long distance grammatical or syntactic constraints. In this paper, we propose two reranking approaches to solve this problem. The linear reranking approach uses minimum error learning method to combine different sub-models, which includes word and character n-gram language models, part-of-speech tagging model and dependency model. The averaged perceptron reranking approach reranks the candidates generated by word n-gram model by employing features extracted from word sequence, part-of-speech tags, and dependency tree. Experimental results on "Lancaster Corpus of Mandarin Chinese" and "People's Daily" show that both reranking approaches can efficiently utilize information of syntactic structures, and outperform the word n-gram model. The perceptron reranking approach which takes the probability output of linear reranking approach as initial weight achieves the best performance.

Keywords:	Chinese pinyin-to-character conversion reranking approach minimum error learning averaged perceptron
本文献已被 CNKI 等数据库收录！
	点击此处可从《自动化学报》浏览原始摘要信息
	点击此处可从《自动化学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏