An alternative,layout‐driven approach to the clustering of documents |
| |
Authors: | Vincenzo Loia Sabrina Senatore |
| |
Affiliation: | Dipartimento di Matematica e Informatica, Universitá degli Studi di Salerno, via Ponte Don Melillo, 84084 Fisciano (SA), Italy |
| |
Abstract: | Internet has become a huge repository of information and knowledge, based on the sharing of the electronic documents. Last trends in knowledge management focus on the knowledge representation based on the document content. In fact, most accustomed approaches achieve the document understanding by analyzing the “portions of information'' in the document which describe the content, through techniques of text parsing and extraction. This paper presents an alternative approach that departs from the consolidated techniques of document management and focuses on the logical structure of a PDF document as a discriminating source of document knowledge. The main idea is based on the fact, when the reader looks at a paper, his first perception is related to the layout of the document. The analysis of layout, typesetting, paginating, and graphical arrangement of a document provides interesting information about its content understanding; in general, the documents that are in the same category present similar page layout, fonts, and figures arrangement. In this sense, this work presents an alternative way to deal with documents recognition and understanding, through the analysis of the layout of electronic PDF documents and their classification. © 2008 Wiley Periodicals, Inc. |
| |
Keywords: | |
|
|