首页 | 本学科首页   官方微博 | 高级检索  
     


Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources
Authors:Z Shi  E Milios  N Zincir-Heywood
Affiliation:(1) Faculty of Computer Science, Dalhousie University, Halifax, N.S., B3H 1W5, Canada
Abstract:Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction programs from examples of lists and tables in html documents. Supervised approaches have been shown to achieve high accuracy, but they require manual labelling of training examples, which is also time consuming. Fully unsupervised approaches, which extract rows and columns by detecting regularities in the data, cannot provide sufficient accuracy for practical domains. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involving the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our method achieves high performance with minimal user input compared to fully supervised techniques.
Keywords:information extraction  grammar induction  template induction  unsupervised learning
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号