Optical character recognition errors and their effects on natural language processing |
| |
Authors: | Daniel Lopresti |
| |
Affiliation: | (1) Lexicography MasterClass Ltd, Brighton, UK;(2) Trinity College, Dublin, Ireland |
| |
Abstract: | Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced
by these errors presents a serious challenge to downstream processes that attempt to make use of such data. In this paper,
we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis
pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification
as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects
are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection
of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. This dataset
has also been made available online to encourage future investigations. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|