首页 | 本学科首页   官方微博 | 高级检索  
     

基于HDP的主题词向量构造——以柬语为例
引用本文:李超,严馨,谢俊,徐广义,周枫,莫源源.基于HDP的主题词向量构造——以柬语为例[J].计算机工程与科学,2020,42(6):1111-1119.
作者姓名:李超  严馨  谢俊  徐广义  周枫  莫源源
作者单位:(1.昆明理工大学信息工程与自动化学院,云南 昆明 650504;2.昆明理工大学云南省人工智能重点实验室,云南 昆明 650504; 3.云南南天电子信息产业股份有限公司,云南 昆明 650400;4.云南民族大学东南亚南亚语言文化学院,云南 昆明 650500; 5.上海师范大学语言研究所,上海 200234)
摘    要:针对单一词向量中存在的一词多义和一义多词的问题,以柬语为例提出了一种基于HDP主题模型的主题词向量的构造方法。在单一词向量基础上融入了主题信息,首先通过HDP主题模型得到单词主题标签,然后将其视为伪单词与单词一起输入Skip-Gram模型,同时训练出主题向量和词向量,最后将文本主题信息的主题向量与单词训练后得到的词向量进行级联,获得文本中每个词的主题词向量。与未融入主题信息的词向量模型相比,该方法在单词相似度和文本分类方面均取得了更好的效果,获取的主题词向量具有更多的语义信息。

关 键 词:HDP主题模型  主题词向量  Skip-Gram模型
收稿时间:2019-07-13
修稿时间:2019-12-11

Construction of topic word embeddings based on HDP:Khmer as an example
LI Chao,YAN Xin,XIE Jun,XU Guang-yi,ZHOU Feng,MO Yuan-yuan.Construction of topic word embeddings based on HDP:Khmer as an example[J].Computer Engineering & Science,2020,42(6):1111-1119.
Authors:LI Chao  YAN Xin  XIE Jun  XU Guang-yi  ZHOU Feng  MO Yuan-yuan
Affiliation:(1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology, Kunming 650504; 2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650504; 3.Yunnan Nantian Electronic Information Industry Co., Ltd., Kunming 650400; 4.School of Southeast & South Asia Languages and Culture,Yunnan Minzu University,Kunming 650500; 5.Institute of Linguistics,Shanghai Normal University,Shanghai 200234,China)  
Abstract:Aiming at the problem of polysemy in a single word embedding, a topic word embeddings construction method on HDP (Hierarchical Dirichlet Process) is proposed in the case of Khmer. The method integrates the topic information on the basis of a single word embedding. In this way, the word topic tag is obtained through the HDP, and then it is regarded as a pseudo word and the word is input into the Skip-Gram model. Next, the topic word embeddings and the word embeddings are trained. Finally, the topic word embeddings of the text topic information is concatenated with the word embeddings obtained after the word training, and the topic word embedding of each word in the text is obtained. Compared with the word embeddings model that is not integrated into the topic information, this method achieves better results in terms of word similarity and text classification. Therefore, the topic word embeddings obtained in this paper has more semantic information.
Keywords:HDP topic model  topic word embeddings  Skip-Gram model  
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号