基于深度学习的相似语言短文本的语种识别方法 LANGUAGE IDENTIFICATION METHOD FOR SIMILAR LANGUAGE IN SHORT TEXT BASED ON DEEP LEARNING期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于深度学习的相似语言短文本的语种识别方法

引用本文：	张琳琳,杨雅婷,陈沾衡,潘一荣,李毓.基于深度学习的相似语言短文本的语种识别方法[J].计算机应用与软件,2020,37(2):124-129,176.

作者姓名：	张琳琳杨雅婷陈沾衡潘一荣李毓

作者单位：	中国科学院新疆理化技术研究所新疆乌鲁木齐 830011;中国科学院大学北京 100049;新疆理化技术研究所新疆民族语音语言信息处理实验室新疆乌鲁木齐 830011;中国科学院新疆理化技术研究所新疆乌鲁木齐 830011

基金项目：	国家自然科学基金;西部之光"人才培养计划;中国科学院青年创新促进会项目;新疆维吾尔自治区项目;新疆维吾尔自治区高层次人才引进工程项目

摘要：	在语种识别中,传统的N-Gram方法对文本长度依赖度高,因而无法有效地对短文本进行语种识别。现有的基于神经网络的模型无法同时考虑词本身信息和词间组合信息,从而降低了短文本语种识别的质量。针对以上问题,提出一种基于深度学习的字符级短文本语种识别方法。采用卷积神经网络从字符向量中获取词中字符组合信息;通过长短期记忆网络获取词与词之间的特征信息;使用全连接网络实现相似语言的语种识别。在维吾尔语、哈萨克语以及DSL2017数据集上的实验结果表明,该方法可以有效地提高相似语言短文本的识别精度。
关键词：	语种识别相似语言短文本神经网络文本分类
LANGUAGE IDENTIFICATION METHOD FOR SIMILAR LANGUAGE IN SHORT TEXT BASED ON DEEP LEARNING

Zhang Linlin,Yang Yating,Chen Zhanheng,Pan Yirong,Li Yu.LANGUAGE IDENTIFICATION METHOD FOR SIMILAR LANGUAGE IN SHORT TEXT BASED ON DEEP LEARNING[J].Computer Applications and Software,2020,37(2):124-129,176.

Authors:	Zhang Linlin Yang Yating Chen Zhanheng Pan Yirong Li Yu

Affiliation:	(Xinjiang Technical Institute of Physics and Chemistry,Chinese Academy of Sciences,Urumqi 830011,Xinjiang,China;University of the Chinese Academy of Sciences,Beiing 100049,China;Xinjiang Laboratory of Minority Speech and Language Information Processing,Xinjiang Technical Institute of Physics and Chemistry,Urumgi 830011,Xinjiang,China)

Abstract:	In the language identification,the traditional N-Gram method has a high degree of dependence on the length of the text,so it cannot effectively identify the short text.Moreover,the existing models based on neural network cannot consider the information of the word itself and the combination of words at the same time,which reduces the quality of short text recognition.Aiming at the above problems,this paper proposes a character level short text language identification method based on deep learning.CNN was used to obtain the character combination information from the character vector.Then,LSTM was used to obtain the features between words.Finally,we used the full connection network to realize the language identification of similar languages.The experimental results on the corpus of Uyghur and Kazakh as well as DSL2017 show that this method can effectively improve the identification accuracy of short texts in similar languages.

Keywords:	Language identification Similar language Short text Neural network Text categorization
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏