首页 | 本学科首页   官方微博 | 高级检索  
     

基于构成模式和条件随机场的企业简称预测
引用本文:孙丽萍,过弋,唐文武,徐永斌. 基于构成模式和条件随机场的企业简称预测[J]. 计算机应用, 2016, 36(2): 449-454. DOI: 10.11772/j.issn.1001-9081.2016.02.0449
作者姓名:孙丽萍  过弋  唐文武  徐永斌
作者单位:1. 华东理工大学 信息科学与工程学院, 上海 200237;2. 石河子大学 信息科学与技术学院, 新疆 石河子 832007
基金项目:国家自然科学基金资助项目(61462073,61272198)。
摘    要:针对目前企业营销的不断深入,企业简称被各大新闻广泛使用,而作为新词又难以被有效识别的问题,提出一种基于构成模式和条件随机场(CRF)的企业简称预测方法。首先,从语言学的角度对企业全称和简称的构成规律进行了总结,并采用词库以及规则相结合的方式对Bi-gram算法进行改进,提出CBi-gram算法,实现了对企业全称的结构化切分,并提高了企业全称中核心词识别的准确性。然后,依据上述切分结果对企业类型进行再次细分,并通过人工总结和规则自学习的方法形成不同企业类型下的简称规则集。最后再基于规则生成企业的候选简称集,降低了不适用的规则对于不同类型的企业在生成简称过程中产生的噪声。另外,为了弥补单纯基于规则在解决全称缩写和简写缩写混合的局限性,引入CRF,从统计的角度对简称进行预测,并选取词、音调以及词在全称组成成分中的位置作为模型特征,进行模型训练,以实现两种方法的相互补充。实验结果显示,该方法具有较高的准确率,输出的企业简称集基本覆盖了企业的常用简称范围。

关 键 词:企业简称  构成模式  简称预测  核心词识别  条件随机场  
收稿时间:2015-08-29
修稿时间:2015-09-11

Enterprise abbreviation prediction based on constitution pattern and conditional random field
SUN Liping,GUO Yi,TANG Wenwu,XU Yongbin. Enterprise abbreviation prediction based on constitution pattern and conditional random field[J]. Journal of Computer Applications, 2016, 36(2): 449-454. DOI: 10.11772/j.issn.1001-9081.2016.02.0449
Authors:SUN Liping  GUO Yi  TANG Wenwu  XU Yongbin
Affiliation:1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China;2. College of Information Science and Technology, Shihezi University, Shihezi Xinjiang 832007, China
Abstract:With the continuous development of enterprise marketing, the enterprise abbreviation has been widely used. Nevertheless, as one of the main sources of unknown words, the enterprise abbreviation can not be effectively identified. A methodology on predicting enterprise abbreviation based on constitution pattern and Conditional Random Field (CRF) was proposed. First, the constitution patterns of enterprise name and abbreviation were summarized from the perspective of linguistics, and the Bi-gram algorithm was improved by a combination of lexicon and rules, namely CBi-gram. CBi-gram algorithm was used to realize the automatic segmentation of the enterprise name and improve the recognition accuracy of the company's core word. Then the enterprise type was subdivided by CBi-gram, and the abbreviation rule sets were collected by artificial summary and self-learning method to reduce noise caused by unsuitable rules. Besides, in order to make up the limitations of artificial building rules on abbreviations and mixed abbreviation, the CRF was introduced to generate enterprise abbreviation statistically, and word, tone and word position were used as characteristics to train model as supplementary. The experimental results show that the method exhibites a good performance and the output can fundamentally cover the usual range of enterprise abbreviations.
Keywords:enterprise abbreviation   constitution pattern   abbreviation prediction   core word recognition   Conditional Random Field(CRF)
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号