首页 | 本学科首页   官方微博 | 高级检索  
     


Building a text collection for Urdu information retrieval
Authors:Imran Rasheed  Haider Banka  Hamaid M. Khan
Affiliation:1. Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, India;2. Aluteam, Fatih Sultan Mehmet Vakif University, Istanbul, Turkey
Abstract:Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.
Keywords:assessors agreement  relevance judgment  text collection construction and evaluation  Urdu corpus  Urdu information retrieval
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号