首页 | 本学科首页   官方微博 | 高级检索  
     

中国英语新词语料库构建技术研究
引用本文:刘永芳,郝晓燕,刘荣. 中国英语新词语料库构建技术研究[J]. 计算机工程与应用, 2020, 56(16): 165-168. DOI: 10.3778/j.issn.1002-8331.1906-0128
作者姓名:刘永芳  郝晓燕  刘荣
作者单位:1.太原理工大学 信息与计算机学院,太原 0300002.太原理工大学 外国语学院,太原 030000
基金项目:山西省自然科学基金;教育部人文社会科学研究项目
摘    要:随着中国英语新词大量出现,缺少中国英语新词语料库成为研究中国英语的主要障碍,新词识别是建设语料库主要的技术问题。针对现有的点互信息和邻接熵新词识别算法中的词内部凝聚度低,及点互信息单阈值设置存在较多高阈值无效词组,且低阈值新词组无法识别的问题,提出了改进多字点互信息和邻接熵中国英语新词识别算法。利用多字点互信息以及点互信息双阈值的设定来识别新词。实验结果表明,相同数据和实验环境下,该方法提高了准确率、召回率和[F]值,对语料库建设是有效可行的。

关 键 词:中国英语  中国英语新词语料库  新词识别  点互信息(PMI)  双阈值  

Research of Technology on Building China English New Words Corpus
LIU Yongfang,HAO Xiaoyan,LIU Rong. Research of Technology on Building China English New Words Corpus[J]. Computer Engineering and Applications, 2020, 56(16): 165-168. DOI: 10.3778/j.issn.1002-8331.1906-0128
Authors:LIU Yongfang  HAO Xiaoyan  LIU Rong
Affiliation:1.College of Information and Computer, Taiyuan University of Technology, Taiyuan 030000, China2.Foreign Language College, Taiyuan University of Technology, Taiyuan 030000, China
Abstract:Specialized corpus about new words is too rare to systematically study the growing amount of China English new words, and new words identification is the main technical problem in constructing a corpus. Aiming at the problem that existing new words recognition algorithms based on Pointwise Mutual Information(PMI) and Branch Entropy(BE) have a low inner cohesion degree of new words, and invalid phrases with high threshold and unrecognizable new phrases with low threshold in setting single threshold of mutual information, a recognition algorithm of China English new words based on improved multi-word PMI and BE is proposed. The new words are identified through multi-word PMI and double threshold of PMI. Experimental results show that the proposed method improves the accuracy rate, recall rate and the [F] value, and is effective and feasible for corpus construction.
Keywords:China English  corpus of China English new words  identification of new words  Pointwise Mutual Information(PMI)  double threshold  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号