A web-based Bengali news corpus for named entity recognition |
| |
Authors: | Asif Ekbal Sivaji Bandyopadhyay |
| |
Affiliation: | (1) Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India |
| |
Abstract: | The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively. |
| |
Keywords: | Web as corpus News corpus Web-based tagged Bengali news corpus Named entity Named entity recognition |
本文献已被 SpringerLink 等数据库收录! |
|