Automatic document classification and indexing in high-volume applications |
| |
Authors: | E Appiani F Cesarini AM Colla M Diligenti M Gori S Marinai G Soda |
| |
Affiliation: | (1) Elsag spa TRI Department, Via G. Puccini, 2, 16154 Genova, Italy; e-mail: {enrico.appiani,annamaria.colla}@elsag.it, IT;(2) DSI, Università di Firenze, Via S. Marta, 3, 50139 Firenze, Italy; e-mail: {cesarini,simone,giovanni}@dsi.unifi.it, IT;(3) DII, Università di Siena, Via Roma, 56, 53100 Siena, Italy; e-mail: {diligmic,marco}@ultrA3.dii.unisi.it, IT |
| |
Abstract: | In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described.
This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes.
The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically
index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled
users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying
reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents
automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to
dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning
passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing
strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining
to the specific document class. Experimental results are encouraging overall; in particular, document classification results
fulfill the requirements of high-volume application. Integration into production lines is under execution.
Received March 30, 2000 / Revised June 26, 2001 |
| |
Keywords: | : Document classification – Decision tree – MXY tree – Reading strategy |
本文献已被 SpringerLink 等数据库收录! |
|