A web-based Bengali news corpus for named entity recognition期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

首页 | 本学科首页

官方微博 | 高级检索

A web-based Bengali news corpus for named entity recognition

Authors:

Asif Ekbal Sivaji Bandyopadhyay

Affiliation:

(1) Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India

Abstract:

The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.

Sivaji BandyopadhyayEmail: Email:

Keywords:

Web as corpus News corpus Web-based tagged Bengali news corpus Named entity Named entity recognition

本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏