首页 | 本学科首页   官方微博 | 高级检索  
     

基于无监督方法的电力文本专业词汇识别研究
作者姓名:朱婷婷  杜一帆  李睿凡  熊永平
作者单位:北京邮电大学计算机学院;北京邮电大学网络技术研究院
基金项目:国家电网有限公司总部科技项目(5200-201918255A-0-0-00)
摘    要:电力专业词汇识别是面向变电运检文档进行深入语言理解和知识图谱构建等智能应用的基础。领域无关识别方法的效果不能令人满意,为此本文根据电力领域词汇的语言学特征提出一种面向电力领域的无监督专业词汇发现方法。首先以通用词典对电力文档语料分词,然后根据电力专业词汇的特征设置不同大小的滑动窗口,将之前分词结果的多种组合作为候选词。进一步计算邻接变化度、信息熵、点态互信息以及词频等四种候选词统计量。最后基于综合语言学特征和成词边界三种语法规则对候选词进行筛选形成专业电力新词。在公开数据集上与基线方法进行了对比实验,实验结果表明了本文提出方法的有效性。

关 键 词:领域词典  无监督学习  新词识别  滑动窗口
收稿时间:2019/10/15 0:00:00
修稿时间:2020/1/20 0:00:00

An unsupervised approach to recognizing new words in power domain
Authors:ZHU Tingting  DU Yifan  LI Ruifan  XIONG Yongping
Affiliation:School of Computer Science, Beijing University of Posts and Telecommunications
Abstract:The terminology words recognition in power domain lays the foundation for a profound language understanding of power documents and the Intelligent knowledge graph construction. By incorporating the morphology of the power domain vocabulary, an unsupervised approach to recognizing new terminology words in documents is proposed. First, the common dictionary is used to segment the corpus. Then segmented words are combined with terminology feature-based sliding window of different sizes constituting candidate words. Furthermore, four statistics including accessor variety, information entropy, point-wise mutual information, and word frequency are computed. Finally, based on the linguistics statistics and three types of word-formation grammatical rules, those words are screened generating the last electric new words. Experimental results on a public dataset demonstrate the effectiveness of our proposed method.
Keywords:domain dictionary  unsupervised learning  new word recognition  sliding window  statistical features
点击此处可从《》浏览原始摘要信息
点击此处可从《》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号