首页 | 本学科首页   官方微博 | 高级检索  
     


Genre identification for office document search and browsing
Authors:Francine Chen  Andreas Girgensohn  Matthew Cooper  Yijuan Lu  Gerry Filby
Affiliation:1. FX Palo Alto Laboratory, Inc., 3400 Hillview Ave, Bldg. 4, Palo Alto, CA, 94304, USA
2. Texas State University, San Marcos, TX, 78747, USA
Abstract:When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve the performance of genre identification. Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide, and table. Our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to identification of coarse office document genres. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号