首页 | 本学科首页   官方微博 | 高级检索  
     


An alternative,layout‐driven approach to the clustering of documents
Authors:Vincenzo Loia  Sabrina Senatore
Affiliation:Dipartimento di Matematica e Informatica, Universitá degli Studi di Salerno, via Ponte Don Melillo, 84084 Fisciano (SA), Italy
Abstract:Internet has become a huge repository of information and knowledge, based on the sharing of the electronic documents. Last trends in knowledge management focus on the knowledge representation based on the document content. In fact, most accustomed approaches achieve the document understanding by analyzing the “portions of information'' in the document which describe the content, through techniques of text parsing and extraction. This paper presents an alternative approach that departs from the consolidated techniques of document management and focuses on the logical structure of a PDF document as a discriminating source of document knowledge. The main idea is based on the fact, when the reader looks at a paper, his first perception is related to the layout of the document. The analysis of layout, typesetting, paginating, and graphical arrangement of a document provides interesting information about its content understanding; in general, the documents that are in the same category present similar page layout, fonts, and figures arrangement. In this sense, this work presents an alternative way to deal with documents recognition and understanding, through the analysis of the layout of electronic PDF documents and their classification. © 2008 Wiley Periodicals, Inc.
Keywords:
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号