首页 | 本学科首页   官方微博 | 高级检索  
     

维吾尔语词法分析的有向图模型
引用本文:麦热哈巴·艾力,姜文斌,王志洋,吐尔根·依布拉音,刘群.维吾尔语词法分析的有向图模型[J].软件学报,2012,23(12):3115-3129.
作者姓名:麦热哈巴·艾力  姜文斌  王志洋  吐尔根·依布拉音  刘群
作者单位:1. 新疆大学信息科学与工程学院,新疆乌鲁木齐830046;中国科学院计算技术研究所,北京 100190
2. 中国科学院计算技术研究所,北京 100190;中国科学院研究生院,北京 100049
3. 新疆大学信息科学与工程学院,新疆乌鲁木齐,830046
4. 中国科学院计算技术研究所,北京,100190
基金项目:国家自然科学基金,国家社会科学基金,国家工信部电子发展基金(工信部财,新疆高校青年教师科研培养基金,新疆大学优秀博士创新项目基金
摘    要:维吾尔语是典型的黏着性语言,其派生能力很强,具有丰富的形态变化,同时遵循语音和谐规律,生成过程中会出现弱化、增音、脱落等音变现象.这些特性决定了维吾尔语词法分析的难点,包括词干提取、发生音变字母的还原以及标注.将维吾尔语词的层次结构引入到词法分析研究中,提出了维吾尔语词法分析的有向图模型,该模型将维吾尔语词法分析描述为有向图结构,图中节点表示词干、词缀及其相应标注,其边表示节点之间的转移或生成概率并将此概率作为候选择优的依据.针对维吾尔语在形态变化过程中发生的音变现象,又提出基于词内字母对齐算法的自动还原模型,该模型将音变现象泛化到每个字母上的假设之下,将还原问题转变成类似于词性标注问题,再利用统计方法进行还原.在对新疆多语种信息技术重点实验室手工标注的《维吾尔语百万词词法分析语料库》上进行的实验中,取得了词干提取正确率为94.7%,词干与各词缀切分并标注的F值达到92.6%的好成绩.

关 键 词:维吾尔语  词法分析  词语切分  词性标注  有向图
收稿时间:4/8/2011 12:00:00 AM
修稿时间:2012/2/22 0:00:00

Directed Graph Model of Uyghur Morphological Analysis
Mairehaba · AILI , JIANG Wen-Bin , WANG Zhi-Yang , Tuergen · YIBULAYIN , LIU Qun.Directed Graph Model of Uyghur Morphological Analysis[J].Journal of Software,2012,23(12):3115-3129.
Authors:Mairehaba · AILI  JIANG Wen-Bin  WANG Zhi-Yang  Tuergen · YIBULAYIN  LIU Qun
Affiliation:1(College of Information Science and Engineering,Xinjiang University,Urumqi 830046,China) 2(Institute of Computing Technology,The Chinese Academy of Sciences,Beijing 100190,China) 3(Graduate University,The Chinese Academy of Sciences,Beijing 100049,China)
Abstract:Uyghur is a typical agglutinative language. It has a strong derivational ability with very a rich morphological structure and follows a harmonious rule. In the formation process, some phenomena may occur such as weakened, increased tone and fallen tone. The specific character of Uyghur language determines the difficulty of the Uyghur morphological analysis, including stemming and restoring the changed letter and POS tagging. This paper employs the hierarchical structure of Uyghur word, and proposes a directed graph model for Uyghur morphological analysis. In this model, words and tags are described as a directed graph. In this graph, nodes represent stems, affixes and their corresponding tags, while edges represent the transition, or general probabilities between nodes. Aimed at providing some light on the phenomenon of morphological sandhi in Uyghur language, this paper also proposes a restore model by changing the word to its original form. With the assumption that one letter can be changed to any letter, this model converts restoring problem into a sequence labeling problem, which could be solved by statistical methods. Experiment results on "Mega-words Corpus of Morphological Analysis of Uyghur", which is manually annotated by Xinjiang multilingual key laboratory shows that the accuracy of stemming reaches 94.7%, and the F score of stem and affix in line with tag reaches 92.6%.
Keywords:Uyghur language  morphological analysis  word segmentation  POS tagging  directed graph
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号