首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于维基百科的中文词语相关度学习算法
引用本文:黄岚,杜友福.一种基于维基百科的中文词语相关度学习算法[J].中文信息学报,2016,30(3):36-45.
作者姓名:黄岚  杜友福
作者单位:长江大学 计算机科学学院 湖北 荆州434000
基金项目:长江青年基金(2015cqn52)
摘    要:词语相关程度计算是语义计算的基础。维基百科是目前最大、更新最快的在线开放式百科全书,涵盖概念广,概念解释详细,蕴含了大量概念间关联关系,为语义计算提供了丰富的背景知识。然而,中文维基百科中存在严重的数据稀疏问题,降低了中文词语相关度计算方法的有效性。针对这一问题,该文利用机器学习技术,提出一种新的基于多种维基资源的词语相关度学习算法。在三个标准数据集上的实验结果验证了新算法的有效性,在已知最好结果的基础上提升了20%—40%。


关 键 词:词语相关度  维基百科  中文信息处理  回归  链接结构
  

Learning the Semantic Relatedness of Chinese Words from Wikipedia
HUANG Lan,DU Youfu.Learning the Semantic Relatedness of Chinese Words from Wikipedia[J].Journal of Chinese Information Processing,2016,30(3):36-45.
Authors:HUANG Lan  DU Youfu
Affiliation:College of Computer Science,Yangtze University,Jingzhou,Hubei 434000,China
Abstract:Semantic word relatedness measures are fundamental to many text analysis tasks such as information retrieval,classification and clustering. As the largest online encyclopedia today,Wikipedia has been successfully exploited for background knowledge to overcome the lexical differences between words and derive accurate semantic word relatedness measures. In Chinese version,however,the Chinese Wikipedia covers only ten percent of its English counterpart. The sparseness in concept space and associated resources adversely impacts word relatedness computation. To address this sparseness problem,we propose a method that utilizes different types of structured information that are automatically extracted from various resources in Wikipedia,such as article's full-text and their associated hyperlink structures. We use machine learning algorithms to learn the best combination of different resources from manually labeled training data. Experiments on three standard benchmark datasets in Chinese showed that our method is 20%-40% more consistent with an average human labeler than the state-of-the-art methods.
Keywords:word relatedness  Wikipedia  Chinese information processing  regression  hyperlink structure  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号