首页 | 本学科首页   官方微博 | 高级检索  
     

融合文本特征的汉老双语句子相似度计算方法
引用本文:谭琪辉,周兰江,刘畅. 融合文本特征的汉老双语句子相似度计算方法[J]. 中文信息学报, 2021, 35(10): 64-72
作者姓名:谭琪辉  周兰江  刘畅
作者单位:1.昆明理工大学 信息工程与自动化学院 智能信息处理重点实验室,云南 昆明 650500;
2.西南交通大学 信息科学与技术学院,四川 成都 611756
基金项目:国家自然科学基金(61662040)
摘    要:双语句子相似度旨在计算不同语言句子间的语义相似程度,在信息检索、平行语料库构建、机器翻译等领域有重要作用。由于汉语、老挝语平行语料稀少,且老挝语在语义表达、句子结构上与汉语有明显差异,导致汉老双语句子相似度研究的难度较大。该文提出了一种融合文本特征的汉老双语句子相似度计算方法,并构建了句子相似度模型。首先,在句子相似度模型中将汉语、老挝语的词性、数字共现等文本特征与GloVe预训练词向量融合,以此丰富句子特征,提升模型计算准确率。其次,由基于自注意力的双向长短时记忆网络组成多层孪生网络来提取长距离上下文特征和深层次语义信息,其中自注意力机制可以保证语义信息的有效利用。最后,采用迁移学习的方法将通用模型参数初始化,并使用不同的微调参策略增强模型的泛化能力。实验表明,该文提出的方法,其召回率、准确率和F1值分别达到了82.5%、85.78%和84.00%。

关 键 词:双语句子相似度  老挝语  迁移学习  文本特征  
收稿时间:2020-03-19

Textual Feature Based Bilingual Sentence Similarity Measure Between Chinese and Lao
TAN Qihui,ZHOU Lanjiang,LIU Chang. Textual Feature Based Bilingual Sentence Similarity Measure Between Chinese and Lao[J]. Journal of Chinese Information Processing, 2021, 35(10): 64-72
Authors:TAN Qihui  ZHOU Lanjiang  LIU Chang
Affiliation:1.The Key Laboratory of Intelligent Information Processing, School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan 650500, China;2.School of Information Science and Technology, Southwest Jiaotong University, Chengdu, Sichuan 611756, China
Abstract:The bilingual sentence similarity aims to calculate the semantic similarity between different language sentences, which are of substantial application in the fields of information retrieval, parallel corpus construction, and machine translation. Challenged by the lack of parallel corpora and the obvious semantic and syntactic differences between Lao and Chinese, this paper proposes a model of bilingual sentence similarity with textual features for Chinese and Lao. Firstly, text features including part of speech and word co-occurrence in Chinese and Lao are fused with GloVe pretrained word vectors. Secondly, long-distance context features and deep-level semantic information are distinguished based on a multi-layered siamese network, which is composed of bidirectional long-term and short-term memory self-attention networks. Finally, the method of transfer learning is used to initialize the model by its parameters, and different strategies of fine-tuning are used to enhance the generalization ability of the model. Experimental results indicate that the recall rate, precision and F1 value of the proposed method reach 82.5%, 85.78% and 84.00%, respectively.
Keywords:bilingual sentence similarity    Lao    transfer learning    text features  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号