首页 | 本学科首页   官方微博 | 高级检索  
     


A web-based Bengali news corpus for named entity recognition
Authors:Asif Ekbal  Sivaji Bandyopadhyay
Affiliation:(1) Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
Abstract:The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.
Contact Information Sivaji BandyopadhyayEmail: Email:
Keywords:Web as corpus  News corpus  Web-based tagged Bengali news corpus  Named entity  Named entity recognition
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号