Classification of document pages using structure-based features期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Classification of document pages using structure-based features

Authors:	Christian Shin David Doermann Azriel Rosenfeld

Affiliation:	(1) Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA , US

Abstract:	Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify it by type in the absence of domain-specific models. Our approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class. We use image features such as percentages of text and non-text (graphics, images, tables, and rulings) content regions, column structures, relative point sizes of fonts, density of content area, and statistics of features of connected components which can be derived without class knowledge. In order to obtain class labels for training samples, we conducted a study where subjects ranked document pages with respect to their resemblance to representative page images. Class labels can also be assigned based on known document types, or can be defined by the user. We implemented our classification scheme using decision tree classifiers and self-organizing maps. Received June 15, 2000 / Revised November 15, 2000

Keywords:	: Document image categorization – Document image databases and retrieval – Layout structures – Visual similarity – Similarity searching – Decision tree classifiers – Self-organizing maps
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏