首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于搭配的中文词汇语义相似度计算方法
引用本文:王石,曹存根,裴亚军,夏飞.一种基于搭配的中文词汇语义相似度计算方法[J].中文信息学报,2013,27(1):7-15.
作者姓名:王石  曹存根  裴亚军  夏飞
作者单位:1. 中国科学院 计算技术研究所 智能信息处理重点实验室,北京 100190;
2. 中国科学院大学,北京 100049; 3. 全国科学技术名词审定委员会,北京 100717
基金项目:国家自然科学基金资助项目,国家863计划资助项目,国家社科基金重点资助项目
摘    要:词汇间的语义相似度计算在自然语言处理相关的许多应用中有基础作用。该文提出了一种新的计算方法,具有高效实用、准确率较高的特点。该方法从传统的分布相似度假设“相似的词汇出现在相似的上下文中”出发,提出不再采用词汇在句子中的邻接词,而是采用词汇在二词名词短语中的搭配词作为其上下文,将更能体现词汇的语义特征,可取得更好的计算结果。在自动构建大规模二词名词短语的基础上,首先基于tf-idf构造直接和间接搭配词向量,然后通过计算搭配词向量间的余弦距离得到词汇间的语义相似度。为了便于与相关方法比较,构建了基于人工评分的中文词汇语义相似度基准测试集,在该测试集中的名、动、形容词中,方法分别得到了0.703、0.509、0.700的相关系数,及100%的覆盖率。

关 键 词:语义相似度  词汇搭配  相似度基准测试集  

A Collocation-based Method for Semantic Similarity Measure for Chinese Words
WANG Shi , CAO Cungen , PEI Yajun , XIA Fei.A Collocation-based Method for Semantic Similarity Measure for Chinese Words[J].Journal of Chinese Information Processing,2013,27(1):7-15.
Authors:WANG Shi  CAO Cungen  PEI Yajun  XIA Fei
Affiliation:1. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology,
Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. China National Committee for Terms in Sciences and Technologies, Beijing 100717, China
Abstract:The word similarity measure plays a basic role in many NLP related applications. In this paper, we propose a novel and practical method for this purpose with acceptable precision. Guided by the classic distribution hypothesis that “similar words occur in similar contexts”, we suggest the collocations in two-word noun phrases can serve as better contexts than the adjacent words because the former are more semantic related. By using automatic built large-scale noun phrases, we firstly construct tf-idf weighted words vectors containing direct and indirect collocations, and then take their cosine distances as desired semantic similarities. In order to compare with related approaches, we manually design a benchmark test set. On the benchmark test set, the proposed method achieves the correlation coefficients of 0.703, 0.509, and 0.700 on nouns, verbs, and adjectives, respectively, at a coverage 100%.
Key wordssemantic similarity, word collocation, similarity benchmark set
Keywords:
本文献已被 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号