首页 | 本学科首页   官方微博 | 高级检索  
     


TEG—a hybrid approach to information extraction
Authors:Ronen Feldman  Benjamin Rosenfeld  Moshe Fresko
Affiliation:(1) Computer Science Department, Bar-Ilan University, Ramat Gan, 52900, Israel
Abstract:This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the robustness of our system under conditions of poor training-data quality. Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc. in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He was an Adjunct Professor at NYU Stern Business School. He is the founder of ClearForest Corporation, a Boston based company specializing in development of text mining tools and applications. He has given more than 30 tutorials on next mining and information extraction and authored numerous papers on these topics. He is currently finishing his book “The Text Mining Handbook” to the published by Cambridge University Press. Benjamin Rosenfeld is a research scientist at ClearForest Corporation. He received his B.Sc. in Mathematics and Computer Science from Bar-Ilan University. He is the co-inventor of the DIAL information extraction language. Moshe Fresko is finalizing his Ph.D. in Computer Science Department at Bar-Ilan University in Israel. He received his B.Sc. in Computer Engineering from Bogazici University, Istanbul/Turkey on 1991, and M.Sc. on 1994. He is also an adjunct lecturer at the Computer Science Department of Bar-Ilan University and functions as the Information-Extraction Group Leader in the Data Mining Laboratory.
Keywords:Text mining  Information extraction  Hidden Markov models  Rule bases systems  Hybrid approaches  Stochastic context free grammars
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号