Extracting an Arabic Lexicon from Arabic Newspaper Text |
| |
Authors: | Saleem Abuleil and Martha Evens |
| |
Affiliation: | (1) Chicago State University, 9501 S. King Drive, Chicago, IL 60628, USA;(2) Illinois Institute of Technology, 10 West 31 Street, Chicago, IL 60616, USA |
| |
Abstract: | We describe how to build a largecomprehensive, integrated Arabic lexicon byautomatic parsing of newspaper text. We havebuilt a parser system to read Arabic newspaperarticles, isolate the tokens from them, findthe part of speech, and the features for eachtoken. To achieve this goal we designed a setof algorithms, we generated several sets ofrules, and we developed a set of techniques,and a set of components to carry out thesetechniques. As each sentence is processed, newwords and features are added to the lexicon, sothat it grows continuously as the system runs.To test the system we have used 100 articles(80,444 words) from the Al-Raya newspaper.The system consists of several modules: thetokenizer module to isolate the tokens, the type findersystem to find the part of speech of eachtoken, the proper noun phrase parser module tomark the proper nouns and to discover someinformation about them and the feature findermodule to find the features of the words. |
| |
Keywords: | morphology analyzer parser part of speech proper nouns tokenizer |
本文献已被 SpringerLink 等数据库收录! |
|