A word extraction algorithm for machine-printed documents using a 3D neighborhood graph model |
| |
Authors: | Hwan-Chul Park Se-Young Ok Young-Jung Yu Hwan-Gue Cho |
| |
Affiliation: | (1) R&D Center, PAXVR, Seocho Jeil B/D, 1624-2, Seocho-Dong, Seocho-Ku, Seoul 137-878, Korea, KR;(2) LG Innotek, Yongin-shi, Kyunggi-do, Korea, KR;(3) Graphics Application Lab., Department of Computer Science, Pusan National University, Kum-Jung-Ku, Pusan 609-735, Korea, KR |
| |
Abstract: | Automatic character recognition and image understanding of a given paper document are the main objectives of the computer vision field. For these problems, a basic step is to isolate characters and group words from these isolated characters. In this paper, we propose a new method for extracting characters from a mixed text/graphic machine-printed document and an algorithm for distinguishing words from the isolated characters. For extracting characters, we exploit several features (size, elongation, and density) of characters and propose a characteristic value for classification using the run-length frequency of the image component. In the context of word grouping, previous works have largely been concerned with words which are placed on a horizontal or vertical line. Our word grouping algorithm can group words which are on inclined lines, intersecting lines, and even curved lines. To do this, we introduce the 3D neighborhood graph model which is very useful and efficient for character classification and word grouping. In the 3D neighborhood graph model, each connected component of a text image segment is mapped onto 3D space according to the area of the bounding box and positional information from the document. We conducted tests with more than 20 English documents and more than ten oriental documents scanned from books, brochures, and magazines. Experimental results show that more than 95% of words are successfully extracted from general documents, even in very complicated oriental documents. Received August 3, 2001 / Accepted August 8, 2001 |
| |
Keywords: | : Document analysis – Text extraction – 3D Neighborhood graph – Word grouping |
本文献已被 SpringerLink 等数据库收录! |
|