首页 | 本学科首页   官方微博 | 高级检索  
     

基于级联重排序的汉语音字转换
引用本文:李鑫鑫,王轩,姚霖,关键.基于级联重排序的汉语音字转换[J].自动化学报,2014,40(4):624-634.
作者姓名:李鑫鑫  王轩  姚霖  关键
作者单位:1.哈尔滨工业大学深圳研究生院计算机应用研究中心 深圳 518055;
基金项目:国家科技部重大科技专项(2011ZX03002-004-01),深圳市基础研究重点项目(JC201104210032A,JC201005260112A)资助
摘    要:N元语言模型是解决汉字音字转换问题最常用的方法. 但在解析过程中,每一个新词的确定只依赖于前面的邻近词,缺乏长距离词之间的句法和语法约束. 我们引入词性标注和依存句法等子模型等来加强这种约束关系,并采用两个重排序方法来利用这些子模型提供的信息:1)线性重排序方法,采用最小错误学习方法来得到各个子模型的权重,然后产生候选词序列的概率;2)采用平均感知器方法对候选词序列进行重排序,能够利用词性、依存关系等复杂特征. 实验结果显示,两种方法都能有效地提高词N元语言模型的性能. 而将这两种方法进行级联,即首先采用线性重排序方法,然后把产生的概率作为感知器重排序方法的初始概率时性能取得最优.

关 键 词:汉语音字转换    重排序    最小错误学习    感知器方法
收稿时间:2013-04-22

Chinese Pinyin-to-character Conversion Based on Cascaded Reranking
LI Xin-Xin,WANG Xuan,YAO Lin,GUAN Jian.Chinese Pinyin-to-character Conversion Based on Cascaded Reranking[J].Acta Automatica Sinica,2014,40(4):624-634.
Authors:LI Xin-Xin  WANG Xuan  YAO Lin  GUAN Jian
Affiliation:1.Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055;2.Shenzhen Applied Technology Engineering Laboratory for Internet Multimedia Application, Shenzhen 518055;3.Public Service Platform of Mobile Internet Application Security Industry, Shenzhen 518057
Abstract:The word n-gram language model is the most common approach for Chinese pinyin-to-character conversion. It is simple, efficient, and widely used in practice. However, in the decoding phase of the word n-gram model, the determination of a word only depends on its previous words, which lacks long distance grammatical or syntactic constraints. In this paper, we propose two reranking approaches to solve this problem. The linear reranking approach uses minimum error learning method to combine different sub-models, which includes word and character n-gram language models, part-of-speech tagging model and dependency model. The averaged perceptron reranking approach reranks the candidates generated by word n-gram model by employing features extracted from word sequence, part-of-speech tags, and dependency tree. Experimental results on "Lancaster Corpus of Mandarin Chinese" and "People's Daily" show that both reranking approaches can efficiently utilize information of syntactic structures, and outperform the word n-gram model. The perceptron reranking approach which takes the probability output of linear reranking approach as initial weight achieves the best performance.
Keywords:Chinese pinyin-to-character conversion  reranking approach  minimum error learning  averaged perceptron
本文献已被 CNKI 等数据库收录!
点击此处可从《自动化学报》浏览原始摘要信息
点击此处可从《自动化学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号