首页 | 本学科首页   官方微博 | 高级检索  
     

拼写纠正在拼音输入法中的应用
引用本文:陈正,李开复.拼写纠正在拼音输入法中的应用[J].计算机学报,2001,24(7):758-763.
作者姓名:陈正  李开复
作者单位:微软中国研究院
摘    要:中文输入法一直是中文语言研究的一个难题,文中以拼音整句输入法为基础,提出了在中文输入过程中的拼写自动修改,通过对用户输入过程中所犯各种错误的分析,建立了一种有效可行的打字模型,通过收集用户真实输入的数据,统计得到用户的打字模型的参数;同时基于大量的中文文本,训练得到一个强大的中文语言模型,并与中文的打字模型相结合,采用类似语音识别的技术,修改用户输入中的各种错误,并得到最适合的汉字。同时,拼写纠正不仅可以进行用户自适应,而且还适用于各种语言。

关 键 词:打字模型  语言模型  拼音输入法  中文输入法  拼写纠正  计算机
修稿时间:1999年11月15

Spelling Correction in Pinyin Input
CHEN Zheng,LEE Kai,Fu.Spelling Correction in Pinyin Input[J].Chinese Journal of Computers,2001,24(7):758-763.
Authors:CHEN Zheng  LEE Kai  Fu
Abstract:Chinese input method is one of the difficult problems of Chinese Language Processing. Because of its facility to learn and to use, Pinyin is the most popular Chinese input method. Over 97% of the users in China use Pinyin for input. Although Pinyin input method has so many advantages, it also suffers from several problems, including Pinyin to characters conversion errors, user typing errors, etc. Base on sentence based pinyin input method, we propose a new typing model to solve this problem. The system will accept correct typing, but also tolerate common typing errors. After analyzing the most popular errors made by typists, we build a typing model. The typing model is trained on real data, and learns probabilities of typing errors, including substitution errors, insertion errors and deletion errors. We also design a unified approach to Chinese statistical language modeling. This unified approach enhances trigram based statistical language modeling with automatic, maximum likelihood based methods to segment words, select the lexicon, and filter the training data. Compared to the commercial product, our system is up to 50% lower in error rate at the same memory size, and about 76% better without memory limits at all. In the Pinyin to Hanzi conversion, the probabilities of typing model are combined with the language model probabilities, to find the most probable interpretation of a sequence of Roman letters typed. Further more, spelling correction can automatically adapt to typist according to their typing skills. In a real system, skilled typist could be assigned lower LM weight, and the skill of typist can be determined by the their typing speed. Moreover, the method is applicable to any language. Compared to the baseline of system, our system gets approximate 30% error reduction on the open testset.
Keywords:typing model  language model  viterbi beam search  Trigram  adaptation  
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号