基于边界标记集的专利文献术语抽取方法 A patent literature term extraction method based on the boundary tag sets期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于边界标记集的专利文献术语抽取方法

引用本文：	丁杰,吕学强,刘克会.基于边界标记集的专利文献术语抽取方法[J].计算机工程与科学,2015,37(8):1591-1598.

作者姓名：	丁杰吕学强刘克会

作者单位：	;1.北京信息科技大学网络文化与数字传播重点实验室;2.北京城市系统工程研究中心

基金项目：	国家自然科学基金资助项目(61271304)；北京市教委科技发展计划重点项目暨北京市自然科学基金B类重点资助项目(KZ20131123237)；北京市属高等学校创新团队建设与教师职业发展计划资助项目（IDHT20130519）

摘要：	目前,大部分术语边界的确定方法是通过选取合适的统计量,设置合适的阈值计算字符串之间的紧密程度,但该类方法在抽取长术语时不能得到很好的效果。为了解决在术语抽取过程中长术语抽取召回率低的问题,在研究了大量专利文献的基础上,提出了一种基于专利术语边界标记集的术语抽取方法。方法中提出了边界标记集的概念,并结合专利文献中术语边界的特点构建专利术语边界标记集;提出了一种种子术语权重计算方法抽取种子术语;使用人民日报语料作为对比语料抽取专利文献术语部件词库,提高候选术语的术语度;最后采用左右边界熵的方法对识别出的术语进行过滤。实验表明,所提出的方法具有较好的实验结果,正确率81.67%,召回率71.92%,F值0.765,较对比实验有较大提高。
关键词：	边界标记集种子术语部件库左右边界熵
收稿时间：	2014-04-21
修稿时间：	2015-08-25
A patent literature term extraction method based on the boundary tag sets

DING Jie,L Xue qiang,LIU Ke hui.A patent literature term extraction method based on the boundary tag sets[J].Computer Engineering & Science,2015,37(8):1591-1598.

Authors:	DING Jie L Xue qiang LIU Ke hui

Affiliation:	(1.Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University,Beijing 100101; 2.Beijing Research Center of Urban System Engineering,Beijing 100035,China)

Abstract:	Currently, most term boundary detection methods calculate the tightness between the strings by selecting an appropriate statistic magnitude and setting an appropriate threshold. However, these methods cannot obtain good results when extracting long terms. In order to solve the low recall problem of long-term extraction during the term extraction process, we propose a patent literature term extraction method based on boundary tag sets on the basis of studying a lot of patent literatures. We first propose the concept of boundary tag set and then construct boundary tag sets based on the characteristics of the boundary of terms in patent literatures. Besides, a new seed term weighting approach is proposed to extract seed terms. Patent document terminology is compared with the Chinese Daily corpus to get terminology component library, thus improving the termhood of the candidate terms. Finally, the terms are filtered by boundary entropy so as to get a better result.Experimental results show that the proposed method has better results, with a correct rate of 81.67%, a recall rate of 71.92%, and F value of 0.765, and the results are better than the other methods mentioned in this paper.

Keywords:	boundary tag set seed term term component library boundary entropy
本文献已被万方数据等数据库收录！
	点击此处可从《计算机工程与科学》浏览原始摘要信息
	点击此处可从《计算机工程与科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏