Consistent Partition and Labelling of Text Blocks期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Consistent Partition and Labelling of Text Blocks

Authors:	J. Liang I. T. Phillips R. M. Haralick

Affiliation:	(1) MathSoft, Inc., Seattle, WA, USA, US;(2) Department of Computer Science/Software Engineering, Seattle University, Seattle, WA, USA, US;(3) Department of Electrical Engineering, University of Washington, Seattle, WA, USA, US

Abstract:	This paper presents a text block extraction algorithm that takes as its input a set of text lines of a given document, and partitions the text lines into a set of text blocks, where each text block is associated with a set of homogeneous formatting attributes, e.g. text-alignment, indentation. The text block extraction algorithm described in this paper is probability based. We adopt an engineering approach to systematically characterising the text block structures based on a large document image database, and develop statistical methods to extract the text block structures from the image. All the probabilities are estimated from an extensive training set of various kinds of measurements among the text lines, and among the text blocks in the training data set. The off-line probabilities estimated in the training then drive all decisions in the on-line text block extraction. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. To evaluate the performance of our text block extraction algorithm, we used a three-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-III database of some 1600 scanned document image pages. The text block extraction algorithm identifies and segments 91% of text blocks correctly.

Keywords:	:Document structure Hidden Markov Model Layout analysis Statistical-based Text block extraction UW-III database
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏