首页 | 本学科首页   官方微博 | 高级检索  
     

基于CRFs和歧义模型的越南语分词
引用本文:熊明明李英郭剑毅毛存礼余正涛.基于CRFs和歧义模型的越南语分词[J].数据采集与处理,2017,32(3):636-642.
作者姓名:熊明明李英郭剑毅毛存礼余正涛
作者单位:1.昆明理工大学信息工程与自动化学院, 昆明,650500; 2.昆明理工大学智能信息处理重点实验室,昆明,650500
摘    要:通过对越南语词法特点的研究,把越南语的基本特征融入到条件随机场中(Condition random fields,CRFs),提出了一种基于CRFs和歧义模型的越南语分词方法。通过机器标注、人工校对的方式获取了25 981条越南语分词语料作为CRFs的训练语料。越南语中交叉歧义广泛分布在句子中,为了克服交叉歧义的影响,通过词典的正向和逆向匹配算法从训练语料中抽取了5 377条歧义片段,并通过最大熵模型训练得到一个歧义模型,并融入到分词模型中。把训练语料均分为10份做交叉验证实验,分词准确率达到了96.55%。与已有越南语分词工具VnTokenizer比较,实验结果表明该方法提高了越南语分词的准确率、召回率和F值。

关 键 词:条件随机场模型    越南语分词    词法    基本特征    最大熵    歧义模型

Vietnamese Word Segmentation with Conditional Random Fields and Ambiguity Model
Affiliation:1.School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming,650500, China; 2.The Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology,Kunming,650500, China
Abstract:The Vietnamese lexical features are discussed and essential characteristics of Vietnamese are integrated into condition random fields (CRFs) to propose a Vietnamese word segmentation method based on CRFs and ambiguity model. The segmentation corpus consisting of 25 981 Vietnamese is obtained as a training corpus of CRFs by computer marking and artificial proofreading. Vietnamese crossing ambiguity is widely distributed in the sentence. To eliminate the effects of crossing ambiguity, 5 377 ambiguity fragments are extracted from training corpus through dictionary of the forward and reverse matching algorithm. An ambiguity model is obtained by training the maximum entropy model. Then they are both incorparted into the segmentation model. The training corpus is divided into ten copies evenly for cross validation experiments. The segmentation accuracy reaches 96.55% in the experiment. Experimental results show that the method improves the segmentation accuracy rate, the recall rate and the F value of Vietnamese word obviously, compared with Vietnamese segmentation tool VnTokenizer.
Keywords:condition random fields (CRFs)  Vietnamese segmentation  morphology  essential characteristics  maximum entropy  ambiguity model
点击此处可从《数据采集与处理》浏览原始摘要信息
点击此处可从《数据采集与处理》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号