首页 | 本学科首页   官方微博 | 高级检索  
     


Automatic document classification and indexing in high-volume applications
Authors:E Appiani  F Cesarini  AM Colla  M Diligenti  M Gori  S Marinai  G Soda
Affiliation:(1) Elsag spa TRI Department, Via G. Puccini, 2, 16154 Genova, Italy; e-mail: {enrico.appiani,annamaria.colla}@elsag.it, IT;(2) DSI, Università di Firenze, Via S. Marta, 3, 50139 Firenze, Italy; e-mail: {cesarini,simone,giovanni}@dsi.unifi.it, IT;(3) DII, Università di Siena, Via Roma, 56, 53100 Siena, Italy; e-mail: {diligmic,marco}@ultrA3.dii.unisi.it, IT
Abstract:In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume application. Integration into production lines is under execution. Received March 30, 2000 / Revised June 26, 2001
Keywords:: Document classification –  Decision tree –  MXY tree –  Reading strategy
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号