首页 | 本学科首页   官方微博 | 高级检索  
     


Transforming paper documents into XML format with WISDOM++
Authors:Oronzo Altamura  Floriana Esposito  Donato Malerba
Affiliation:(1) Dipartimento di Informatica, Università degli Studi di Bari, via Orabona 4, 70126 Bari, Italy; e-mail: {altamura,esposito,malerba}@di.uniba.it , IT
Abstract:The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported. Received June 15, 2000 / Revised November 7, 2000
Keywords:: Document image analysis –  Layout analysis –  Induction of decision trees –  Transformation into HTML/XML format
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号