TEG—a hybrid approach to information extraction |
| |
Authors: | Ronen Feldman Benjamin Rosenfeld Moshe Fresko |
| |
Affiliation: | (1) Computer Science Department, Bar-Ilan University, Ramat Gan, 52900, Israel |
| |
Abstract: | This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations
at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while
drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation
of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules
in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system
does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic
components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction
tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems,
while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the
robustness of our system under conditions of poor training-data quality.
Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director
of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc.
in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He was an Adjunct
Professor at NYU Stern Business School. He is the founder of ClearForest Corporation, a Boston based company specializing
in development of text mining tools and applications. He has given more than 30 tutorials on next mining and information extraction
and authored numerous papers on these topics. He is currently finishing his book “The Text Mining Handbook” to the published
by Cambridge University Press.
Benjamin Rosenfeld is a research scientist at ClearForest Corporation. He received his B.Sc. in Mathematics and Computer Science from Bar-Ilan
University. He is the co-inventor of the DIAL information extraction language.
Moshe Fresko is finalizing his Ph.D. in Computer Science Department at Bar-Ilan University in Israel. He received his B.Sc. in Computer
Engineering from Bogazici University, Istanbul/Turkey on 1991, and M.Sc. on 1994. He is also an adjunct lecturer at the Computer
Science Department of Bar-Ilan University and functions as the Information-Extraction Group Leader in the Data Mining Laboratory. |
| |
Keywords: | Text mining Information extraction Hidden Markov models Rule bases systems Hybrid approaches Stochastic context free grammars |
本文献已被 SpringerLink 等数据库收录! |
|