首页 | 本学科首页   官方微博 | 高级检索  
     

中英文混合文章识别问题
引用本文:王恺,王庆人. 中英文混合文章识别问题[J]. 软件学报, 2005, 16(5): 786-798
作者姓名:王恺  王庆人
作者单位:南开大学,机器智能研究所,天津,300071;南开大学,机器智能研究所,天津,300071
基金项目:Supported by the National Natural Science Foundation of China under Grant No.TY10026002-04-04-01(国家自然科学基金天元基金)
摘    要:当前,已经有大量为单一字符集(或语种)而设计的OCR(optical character recognition)分类器.同时,随着全球一体化,多语文档的出现越来越普遍.因此,设计多语文档处理系统势在必行.提出了一般性的解决方案:两项OCR技术、一个系统和语言判断.为了使研究工作具体化,实现了一个中英文混合文章处理系统.其中主要涉及了3个关键问题:系统流程控制、汉英语言区域分离和英文字符切分.与以往的系统相比,该系统增加了汉英语言区域分离模块,并将基于等间距性的新方法应用于该模块.为了验证本系统的有效性,综合以往的方法实现了另一个系统.实验结果表明,该系统的性能明显优于另一个系统,在杂志样和书籍样上的识别率分别从98.48%和98.68%提高到99.13%和99.25%.

关 键 词:系统设计  语言判别  字符切分  多语光学字符识别系统  文档图像处理
文章编号:1000-9825/2005/16(05)0786
收稿时间:2004-03-03
修稿时间:2004-05-08

Research on Chinese/English Mixed Document Recognition
WANG Kai and WANG Qing-Ren. Research on Chinese/English Mixed Document Recognition[J]. Journal of Software, 2005, 16(5): 786-798
Authors:WANG Kai and WANG Qing-Ren
Abstract:Currently, OCR (optical character recognition) classifiers are generally designed for one character set (or language). On the other hand, multilingual document increases drastically due to the globalization. Therefore, designing a document processing system with multilingual capability is very important. A general scheme is presented in this paper: two OCR techniques, a system, and a language classification. For embodying the scheme, a Chinese/English mixed document processing system is implemented. Three key problems are considered: the control of the system flow, the classification of Chinese/English regions, and the segmentation of English characters. Compared with old systems presented in other papers, the module of the classification of Chinese/English regions is added in the system, and a novel approach based on the equidistance is applied to the module. To verify the effectiveness of the system, another system is implemented according to the methods presented in other papers. Experiment shows, the new system is more effective than the old system. The recognition rate increases from 98.48% to 99.13% on magazine samples and from 98.68% to 99.25% on book samples, respectively.
Keywords:systems design  language discrimination  character segmentation  multilingual OCR (optical character recognition) system  document image processing
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号