首页 | 本学科首页   官方微博 | 高级检索  
     


A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model
Authors:Kai Hu  Huayi Wu  Kunlun Qi  Jingmin Yu  Siluo Yang  Tianxing Yu  Jie Zheng  Bo Liu
Affiliation:1.The State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing,Wuhan University,Wuhan,China;2.Collaborative Innovation Center of Geospatial Technology,Wuhan University,Wuhan,China;3.Faculty of Information Engineering,China University of Geosciences (Wuhan),Wuhan,China;4.Changjiang Spatial Information Technology Engineering CO., LTD,Wuhan,China;5.School of Information Management,Wuhan University,Wuhan,China;6.Faculty of Geomatics,East China Institute of Technology,Nanchang,China
Abstract:In bibliometric research, keyword analysis of publications provides an effective way not only to investigate the knowledge structure of research domains, but also to explore the developing trends within domains. To identify the most representative keywords, many approaches have been proposed. Most of them focus on using statistical regularities, syntax, grammar, or network-based characteristics to select representative keywords for the domain analysis. In this paper, we argue that the domain knowledge is reflected by the semantic meanings behind keywords rather than the keywords themselves. We apply the Google Word2Vec model, a model of a word distribution using deep learning, to represent the semantic meanings of the keywords. Based on this work, we propose a new domain knowledge approach, the Semantic Frequency-Semantic Active Index, similar to Term Frequency-Inverse Document Frequency, to link domain and background information and identify infrequent but important keywords. We adopt a semantic similarity measuring process before statistical computation to compute the frequencies of “semantic units” rather than keyword frequencies. Semantic units are generated by word vector clustering, while the Inverse Document Frequency is extended to include the semantic inverse document frequency; thus only words in the inverse documents with a certain similarity will be counted. Taking geographical natural hazards as the domain and natural hazards as the background discipline, we identify the domain-specific knowledge that distinguishes geographical natural hazards from other types of natural hazards. We compare and discuss the advantages and disadvantages of the proposed method in relation to existing methods, finding that by introducing the semantic meaning of the keywords, our method supports more effective domain knowledge analysis.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号