基于案件要素指导及深度聚类的新闻与案件相关性分析 Chinese-Burmese Parallel Sentence Pair Extraction Based on CNN-CorrNet期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于案件要素指导及深度聚类的新闻与案件相关性分析

引用本文：	毛存礼,吴霞,朱俊国,余正涛,李云龙,王振晗.基于案件要素指导及深度聚类的新闻与案件相关性分析[J].中文信息学报,2021,34(11):60-69.

作者姓名：	毛存礼吴霞朱俊国余正涛李云龙王振晗

作者单位：	1.昆明理工大学信息工程与自动化学院,云南昆明 650500; 2.昆明理工大学云南省人工智能重点实验室,云南昆明 650500

基金项目：	国家自然科学基金(61732005,61662041,61761026,61866019,61972186);云南省应用基础研究计划重点项目(2019FA023);云南省中青年学术和技术带头人后备人才项目(2019HB006)

摘要：	新闻与案件相关性分析是案件领域新闻舆情分析的基础,其可以转化为文本聚类问题。由于缺乏有效的监督信息,传统聚类方法易导致聚类发散,降低结果的准确性。针对案件和新闻文本的特点,该文提出了基于案件要素指导及深度聚类的新闻与案件相关性分析方法。该方法首先抽取出重要的句子表征文本;然后利用案件要素对案件进行表征,用于初始化聚类中心,指导聚类的搜索过程;最后选用卷积自编码器获得文本表征,利用重构损失和聚类损失联合训练网络,使文本的表征更接近于案件,并将文本表征和聚类过程统一到同一框架中,交替更新自编码器参数及聚类模型参数,实现文本聚类。实验表明,该文的方法较基线方法在准确率上提高了4.61%。
关键词：	相关性分析深度聚类文本表征案件要素
收稿时间：	2020-03-09
Chinese-Burmese Parallel Sentence Pair Extraction Based on CNN-CorrNet

MAO Cunli,WU Xia,ZHU Junguo,YU Zhengtao,LI Yunlong,WANG Zhenhan.Chinese-Burmese Parallel Sentence Pair Extraction Based on CNN-CorrNet[J].Journal of Chinese Information Processing,2021,34(11):60-69.

Authors:	MAO Cunli WU Xia ZHU Junguo YU Zhengtao LI Yunlong WANG Zhenhan

Affiliation:	1.Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan 650500, China;2.Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming, Yunnan 650500, China

Abstract:	Bilingual parallel corpus is a key resources to improve the quality of machine translation. We propose a Chinese-Burmese parallel sentence pair extraction method based on CNN-CorrNet network. Specifically, we first use BERT to obtain vector representations of Chinese and Burmese words, and use convolution neural network to represent sentences in Chinese and Burmese to capture important feature information of sentences. Then, in order to ensure the maximum correlation between the cross-language representations of the two languages, the existing Chinese and Burmese parallel sentence pairs are used as constraints, and CorrNet (Correlational Neural Networks) is applied to map the Chinese and Burmese sentence representation into the common semantic space. Finally, the distance of Chinese and Burmese sentences in the public semantic space is calculated to determine the true bilingual sentence pairs. The experiment results show that, compared with the maximum entropy model and the siamese network model, the F₁ value of the method proposed in this paper is increased by 13.3% or 5.1%, respectively.

Keywords:	Chinese-Burmese bilingual parallel sentence pair CNN correlational neural networks common semantic space

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏