Optical character recognition errors and their effects on natural language processing期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Optical character recognition errors and their effects on natural language processing

Authors:	Daniel Lopresti

Affiliation:	(1) Lexicography MasterClass Ltd, Brighton, UK;(2) Trinity College, Dublin, Ireland

Abstract:	Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to downstream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. This dataset has also been made available online to encourage future investigations.

Keywords:
本文献已被 SpringerLink 等数据库收录！