首页 | 本学科首页   官方微博 | 高级检索  
     

印刷体汉字识别后处理方法的研究
引用本文:张宏涛,龙翀,朱小燕,孙俊. 印刷体汉字识别后处理方法的研究[J]. 中文信息学报, 2009, 23(6): 67-72
作者姓名:张宏涛  龙翀  朱小燕  孙俊
作者单位:1. 智能技术与系统国家重点实验室 清华信息科学与技术国家实验室(筹) 清华大学计算机系,北京 100084;
2. 富士通研究开发中心有限公司,北京 100016
基金项目:富士通研究开发中心OCR后处理方法研究资助项目 
摘    要:高阶N-gram语言模型在OCR后处理方面有着广泛的应用,但也面临着因模型复杂度大导致的数据稀疏,以及耗费较多的时空资源等问题。该文针对印刷体汉字识别的后处理,提出了一种基于字节的语言模型的后处理算法。通过采用字节作为语言模型的基本表示单位,模型的复杂度大大降低,从而数据稀疏问题得到很大程度上缓解。实验证明,采用基于字节的语言模型的后处理系统能够以极少的时空开销获取很好的识别性能。在有部分分割错误的测试集上,正确率从88.67%提高到了98.32%,错误率下降了85.18%,运行速度较基于字以及基于词的系统有了大幅的提升,提高了后处理系统的综合性能;与目前常用的基于词的语言模型后处理系统相比,新系统能够节省95%的运行时间和98%的内存资源,但系统识别率仅降低了1.11%。

关 键 词:计算机应用  中文信息处理  汉字识别  OCR  语言模型  后处理
  

Post-Processing Approach for Printed Chinese Character Recognition
ZHANG Hongtao,LONG Chong,ZHU Xiaoyan,SUN Jun. Post-Processing Approach for Printed Chinese Character Recognition[J]. Journal of Chinese Information Processing, 2009, 23(6): 67-72
Authors:ZHANG Hongtao  LONG Chong  ZHU Xiaoyan  SUN Jun
Affiliation:1. State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science
and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
2. Information Technology Laboratory, Fujitsu R&D Center Co. Ltd., Beijing 100016, China
Abstract:In Chinese OCR post-processing, the high-order Chinese n-gram language models, such as word based tri-gram and four-gram is still a challenging issue because of the data sparseness issue and large memory cost led by big model size. In this paper, we focus on the post-processing of printed Chinese character recognition and propose a byte-based language model. By choosing byte as the representing unit of language model, we achieve a remarkable reduction of model size which overcomes the sparseness problem to a great extent. The experimental results show that the new language model based on byte works very well with higher performance and lowest time and space costs. For the test set with segmentation errors, the recognition accuracy increases from 88. 67% to 98. 32% , which means 85. 18% error reduction. Compared with the system using traditional word based tri-gram, the new system saves 95% time cost and nearly 98% memory cost at almost no cost in the accuracy performance.
Keywords:OCR
本文献已被 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号