首页 | 本学科首页   官方微博 | 高级检索  
     

基于跨语言神经主题模型的汉越新闻话题发现方法
引用本文:杨威亚,余正涛,高盛祥,宋燃.基于跨语言神经主题模型的汉越新闻话题发现方法[J].计算机应用,2021,41(10):2879-2884.
作者姓名:杨威亚  余正涛  高盛祥  宋燃
作者单位:1. 昆明理工大学 信息工程与自动化学院, 昆明 650500;2. 云南省人工智能重点实验室(昆明理工大学), 昆明 650500
基金项目:国家自然科学基金资助项目(61972196,61762056,61472168);云南省重大科技专项(202002AD080001);云南省高新技术产业专项(201606)。
摘    要:针对汉越跨语言新闻话题发现任务中汉越平行语料稀缺,训练高质量的双语词嵌入较为困难,而且新闻文本一般较长导致双语词嵌入的方法难以很好地表征文本的问题,提出一种基于跨语言神经主题模型(CL-NTM)的汉越新闻话题发现方法,利用新闻的主题信息对新闻文本进行表征,将双语语义对齐转化为双语主题对齐任务。首先,针对汉语和越南语分别训练基于变分自编码器的神经主题模型,从而得到单语的主题抽象表征;然后,利用小规模的平行语料将双语主题映射到同一语义空间;最后,使用K-means方法对双语主题表征进行聚类,从而发现新闻事件簇的话题。实验结果表明,所提方法相较于面向中英文的隐狄利克雷分配主题改进模型(ICE-LDA)在Macro-F1值与主题一致性上分别提升了4个百分点与7个百分点,可见所提方法可有效提升新闻话题的聚类效果与话题可解释性。

关 键 词:跨语言  主题对齐  神经主题模型  K-means聚类  话题发现  
收稿时间:2020-12-29
修稿时间:2021-04-22

Chinese-Vietnamese news topic discovery method based on cross-language neural topic model
YANG Weiya,YU Zhengtao,GAO Shengxiang,SONG Ran.Chinese-Vietnamese news topic discovery method based on cross-language neural topic model[J].journal of Computer Applications,2021,41(10):2879-2884.
Authors:YANG Weiya  YU Zhengtao  GAO Shengxiang  SONG Ran
Affiliation:1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;2. Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology) Kunming Yunnan 650500, China
Abstract:In Chinese-Vietnamese cross-language news topic discovery task, the Chinese-Vietnamese parallel corpora are rare, it is difficult to train high-quality bilingual word embedding, and the news text is generally long, so that the method of bilingual word embedding is difficult to represent the text well. In order to solve the problems, a Chinese-Vietnamese news topic discovery method based on Cross-Language Neural Topic Model (CL-NTM) was proposed. In the method, the news topic information was used to represent news text, and the bilingual semantic alignment was converted into bilingual topic alignment tasks. Firstly, the neural topic models based on the variational autoencoder were trained in Chinese and Vietnamese respectively to obtain the monolingual abstract representations of the topics. Then, a small-scale parallel corpus was used to map the bilingual topics into the same semantic space. Finally, the K-means method was used to cluster the bilingual topic representations for finding the topics of news event clusters. Experimental results show that, compared with the Improved Chinese-English Latent Dirichlet Allocation model (ICE-LDA), the proposed method increases the Macro-F1 value and topic-coherence by 4 percentage points and 7 percentage points respectively, showing that the proposed method can effectively improve the clustering effect and topic interpretability of news topics.
Keywords:cross-language  topic alignment  Neural Topic Model (NTM)  K-means clustering  topic discovery  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号