首页 | 本学科首页   官方微博 | 高级检索  
     

低数据资源条件下基于结构信息共享的无切分维文文档识别字符建模
引用本文:姜志威,丁晓青,彭良瑞,刘长松.低数据资源条件下基于结构信息共享的无切分维文文档识别字符建模[J].电子与信息学报,2015,37(9):2103-2109.
作者姓名:姜志威  丁晓青  彭良瑞  刘长松
基金项目:国家自然科学基金(61032008)和国家973计划项目(2013CB329403)
摘    要:无切分维吾尔文文档识别技术能够有效避免字符切分错误,但是对于低数据资源的新样本类型,原有模型往往难以获得较高的识别性能。为此,该文提出共享常用维文字体间相对稳定的字符结构信息,并用Bootstrap方法提高样本利用效率的解决方法。通过在实际书籍样本上的实验表明,仅利用规模约原始训练样本1/5的新类型样本,该方法在测试集上的平均字符识别准确率就可以达到95.05%;而与常用的最大后验概率估计方法相比,也能使识别错误率相对降低55.76%~63.84%。因此,该方法能够有效解决低数据资源条件下的维文字符建模问题,实现对新样本类型的高性能识别。

关 键 词:文字识别    隐马尔可夫模型    统计学习    维吾尔文
收稿时间:2015-01-06

Uyghur Character Models with Shared Structure Information for Segmentation-free Recognition under Low Data Resource Conditions
Jiang Zhi-wei,Ding Xiao-qing,Peng Liang-rui,Liu Chang-song.Uyghur Character Models with Shared Structure Information for Segmentation-free Recognition under Low Data Resource Conditions[J].Journal of Electronics & Information Technology,2015,37(9):2103-2109.
Authors:Jiang Zhi-wei  Ding Xiao-qing  Peng Liang-rui  Liu Chang-song
Abstract:Although segmentation-free Uyghur character document recognition can efficiently avoid character segmentation error, it does not work well on low-resource new-type samples. This paper suggests sharing stable character structure among different Uyghur fonts, and improves the efficiency of utilizing samples through Bootstrap. Experiments are made on new-type book samples, which contains only 1/5 training sample amount than the original. The average character recognition accuracy of the proposed method on test samples is 95.05%, and has 55.76%~63.84% recognition error rate relative decrease than the one of Maximum A Posteriori (MAP) method. Therefore, the proposed method can accomplish accurate Uyghur character model training under low data resource conditions.
Keywords:
本文献已被 万方数据 等数据库收录!
点击此处可从《电子与信息学报》浏览原始摘要信息
点击此处可从《电子与信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号