Genre identification for office document search and browsing |
| |
Authors: | Francine Chen Andreas Girgensohn Matthew Cooper Yijuan Lu Gerry Filby |
| |
Affiliation: | 1. FX Palo Alto Laboratory, Inc., 3400 Hillview Ave, Bldg. 4, Palo Alto, CA, 94304, USA 2. Texas State University, San Marcos, TX, 78747, USA
|
| |
Abstract: | When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve the performance of genre identification. Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide, and table. Our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to identification of coarse office document genres. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|