首页 | 本学科首页   官方微博 | 高级检索  
     

基于短语的柬汉双语LDA主题模型
引用本文:谢庆,严馨,诺宇,徐广义,周枫,郭剑毅.基于短语的柬汉双语LDA主题模型[J].计算机工程与科学,2019,41(8):1497-1503.
作者姓名:谢庆  严馨  诺宇  徐广义  周枫  郭剑毅
作者单位:昆明理工大学信息工程与自动化学院,云南昆明,650504;云南南天电子信息产业股份有限公司,云南昆明,650041
基金项目:国家自然科学基金(61462055,61562049)
摘    要:为了有效地获取双语文档的主题分布,提出了一种基于短语的柬汉双语LDA主题模型。修改了传统LDA主题模型中的词袋模型,融入短语(N-gram)的概念,能够在主题预测过程中考虑文章的词序以及上下文,并将之应用于可比语料的双语环境中。本模型基于一个3层贝叶斯网络模型,在此框架下,首先搜集中文和柬埔寨语的可比语料,每一对双语可比语料文档共享一个相同的主题分布,之后引入发现主题以及主题短语的主题模型:对每个单词,首先进行主题抽样,然后将其状态作为短语进行采样,最后对来自特定主题短语分布的单词进行采样。通过实验结果可知,基于短语的双语LDA主题模型比一般的双语LDA模型更能抓住文章的主题,且有更好的主题预测能力。

关 键 词:柬汉双语  短语  主题模型
收稿时间:2018-07-03
修稿时间:2019-08-25

A phrase-based Khmer-Chinese bilingual LDA topic model construction method
XIE Qing,YAN Xin,NUO Yu,XU Guang-yi,ZHOU Feng,GUO Jian-yi.A phrase-based Khmer-Chinese bilingual LDA topic model construction method[J].Computer Engineering & Science,2019,41(8):1497-1503.
Authors:XIE Qing  YAN Xin  NUO Yu  XU Guang-yi  ZHOU Feng  GUO Jian-yi
Affiliation:(1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504; 2.Yunnan Nantian Electronics Information Co.Ltd.,Kunming 650041,China)  
Abstract:In order to obtain the topic distribution of bilingual documents effectively, we propose a phrase-based Khmer-Chinese bilingual LDA topic model. We modify the bag-of-word model in the traditional LDA topic model and incorporate the concept of phrase (N-gram). The method considers the word order and context of the article in the topic prediction process and applies it to the bilingual environment of comparable corpus. It is based on a three-layer Bayesian network model. Under this framework, we firstly collect comparable Chinese and Khmer corpus, and each pair of bilingual comparable corpus shares a common topic distribution. And then we introduce the topic model of discovery topic and topic phrase: the topic of each word is firstly sampled; then its status is sampled as a phrase; and finally words from a particular topic phrase distribution are sampled. Experimental results show that the phrase-based bilingual LDA topic model is more capable of grasping the topic of the article than general bilingual LDA models and has better topic prediction ability.
Keywords:Khmer-Chinese bilingual  phrase  topic model  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号