首页 | 本学科首页   官方微博 | 高级检索  
     

基于自适应隐马尔可夫模型的石油领域文档分词
引用本文:宫法明,朱朋海.基于自适应隐马尔可夫模型的石油领域文档分词[J].计算机科学,2018,45(Z6):97-100.
作者姓名:宫法明  朱朋海
作者单位:中国石油大学华东计算机与通信工程学院 山东 青岛266580,中国石油大学华东计算机与通信工程学院 山东 青岛266580
基金项目:本文受科技部创新方法工作:大数据环境下的油气开采创新方法研究与应用示范(2015IM010300)资助
摘    要:中文分词技术是把没有分割标志的汉字串转换为符合语言应用特点的词串的过程,是构建石油领域本体的第一步。石油领域的文档有其独有的特点,分词更加困难,目前仍然没有有效的分词算法。通过引入术语集,在隐马尔可夫分词模型的基础上,提出了一种基于自适应隐马尔可夫模型的分词算法。该算法以自适应隐马尔可夫模型为基础,结合领域词典和互信息,以语义约束和词义约束校准分词,实现对石油领域专业术语和组合词的精确识别。通过与中科院的NLPIR汉语分词系统进行对比,证明了所提算法进行分词时的准确率和召回率有显著提高。

关 键 词:中文分词  隐马尔可夫模型  组合词  石油

Word Segmentation Based on Adaptive Hidden Markov Model in Oilfield
GONG Fa-ming and ZHU Peng-hai.Word Segmentation Based on Adaptive Hidden Markov Model in Oilfield[J].Computer Science,2018,45(Z6):97-100.
Authors:GONG Fa-ming and ZHU Peng-hai
Affiliation:College of Computer & Communication Engineering,China University of Petroleum,Qingdao,Shandong 266580,China and College of Computer & Communication Engineering,China University of Petroleum,Qingdao,Shandong 266580,China
Abstract:The Chinese word segmentation is the first step in constructing the petroleum field ontology.Documents in petroleum field have their own unique characteristics which make word segmentation more complex.Until now,there is no effective word segmentation algorithm,especially for Chinese characters.Based on the hidden Markovian model,an adaptive hidden Markovian word segmentation model was proposed in this paper,which combines the domain-knowledge dictionary and user-defined information,by introducing the terminology set.The proposed algorithm calibrates word segmentation under semantic constraints and word meaning constraints,and can identify professional terms and character combinations in the field of petroleum accurately.It is also proved that the proposed algorithm achieves remarkable improvements in both accuracy and recall rate in word segmentation,compared to the NLPIR Chinese word segmentation system invented by Chinese Academy of Science.
Keywords:Chinese word segmentation  Hidden Markov model  Combined character  Petroleum
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号