拼写纠正在拼音输入法中的应用 Spelling Correction in Pinyin Input期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

拼写纠正在拼音输入法中的应用

引用本文：	陈正,李开复.拼写纠正在拼音输入法中的应用[J].计算机学报,2001,24(7):758-763.

作者姓名：	陈正李开复

作者单位：	微软中国研究院

摘要：	中文输入法一直是中文语言研究的一个难题，文中以拼音整句输入法为基础，提出了在中文输入过程中的拼写自动修改，通过对用户输入过程中所犯各种错误的分析，建立了一种有效可行的打字模型，通过收集用户真实输入的数据，统计得到用户的打字模型的参数；同时基于大量的中文文本，训练得到一个强大的中文语言模型，并与中文的打字模型相结合，采用类似语音识别的技术，修改用户输入中的各种错误，并得到最适合的汉字。同时，拼写纠正不仅可以进行用户自适应，而且还适用于各种语言。
关键词：	打字模型语言模型拼音输入法中文输入法拼写纠正计算机
修稿时间：	1999年11月15
Spelling Correction in Pinyin Input

CHEN Zheng,LEE Kai,Fu.Spelling Correction in Pinyin Input[J].Chinese Journal of Computers,2001,24(7):758-763.

Authors:	CHEN Zheng LEE Kai Fu

Abstract:	Chinese input method is one of the difficult problems of Chinese Language Processing. Because of its facility to learn and to use, Pinyin is the most popular Chinese input method. Over 97% of the users in China use Pinyin for input. Although Pinyin input method has so many advantages, it also suffers from several problems, including Pinyin to characters conversion errors, user typing errors, etc. Base on sentence based pinyin input method, we propose a new typing model to solve this problem. The system will accept correct typing, but also tolerate common typing errors. After analyzing the most popular errors made by typists, we build a typing model. The typing model is trained on real data, and learns probabilities of typing errors, including substitution errors, insertion errors and deletion errors. We also design a unified approach to Chinese statistical language modeling. This unified approach enhances trigram based statistical language modeling with automatic, maximum likelihood based methods to segment words, select the lexicon, and filter the training data. Compared to the commercial product, our system is up to 50% lower in error rate at the same memory size, and about 76% better without memory limits at all. In the Pinyin to Hanzi conversion, the probabilities of typing model are combined with the language model probabilities, to find the most probable interpretation of a sequence of Roman letters typed. Further more, spelling correction can automatically adapt to typist according to their typing skills. In a real system, skilled typist could be assigned lower LM weight, and the skill of typist can be determined by the their typing speed. Moreover, the method is applicable to any language. Compared to the baseline of system, our system gets approximate 30% error reduction on the open testset.

Keywords:	typing model language model viterbi beam search Trigram adaptation
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏