Document retrieval from compressed images |
| |
Authors: | Yue Lu Author VitaeChew Lim TanAuthor Vitae |
| |
Affiliation: | Department of Computer Science, School of Computing, National University of Singapore, Kent Ridge, 117543 Singapore |
| |
Abstract: | With the emergence of digital libraries, more and more documents are stored and transmitted through the Internet in the format of compressed images. It is of significant meaning to develop a system which is capable of retrieving documents from these compressed document images. Aiming at the popular compression standard-CCITT Group 4 which is widely used for compressing document images, we present an approach to retrieve the documents from CCITT Group 4 compressed document images in this paper. The black and white changing elements are extracted directly from the compressed document images to act as the feature pixels, and the connected components are detected simultaneously. Then the word boxes are bounded based on the merging of the connected components. Weighted Hausdorff distance is proposed to assign all of the word objects from both the query document and the document from database to corresponding classes by an unsupervised classifier, whereas the possible stop words are excluded. Document vectors are built by the occurrence frequency of the word object classes, and the pair-wise similarity of two document images is represented by the scalar product of the document vectors. Nine groups of articles pertaining to different domains are used to test the validity of the presented approach. Preliminary experimental results with the document images captured from students’ theses show that the proposed approach has achieved a promising performance. |
| |
Keywords: | Document image retrieval Compressed image Object matching Document similarity Weighted Hausdorff distance |
本文献已被 ScienceDirect 等数据库收录! |
|