首页 | 本学科首页   官方微博 | 高级检索  
     

基于深度学习的相似语言短文本的语种识别方法
引用本文:张琳琳,杨雅婷,陈沾衡,潘一荣,李毓.基于深度学习的相似语言短文本的语种识别方法[J].计算机应用与软件,2020,37(2):124-129,176.
作者姓名:张琳琳  杨雅婷  陈沾衡  潘一荣  李毓
作者单位:中国科学院新疆理化技术研究所 新疆 乌鲁木齐 830011;中国科学院大学 北京 100049;新疆理化技术研究所新疆民族语音语言信息处理实验室 新疆 乌鲁木齐 830011;中国科学院新疆理化技术研究所 新疆 乌鲁木齐 830011
基金项目:国家自然科学基金;西部之光"人才培养计划;中国科学院青年创新促进会项目;新疆维吾尔自治区项目;新疆维吾尔自治区高层次人才引进工程项目
摘    要:在语种识别中,传统的N-Gram方法对文本长度依赖度高,因而无法有效地对短文本进行语种识别。现有的基于神经网络的模型无法同时考虑词本身信息和词间组合信息,从而降低了短文本语种识别的质量。针对以上问题,提出一种基于深度学习的字符级短文本语种识别方法。采用卷积神经网络从字符向量中获取词中字符组合信息;通过长短期记忆网络获取词与词之间的特征信息;使用全连接网络实现相似语言的语种识别。在维吾尔语、哈萨克语以及DSL2017数据集上的实验结果表明,该方法可以有效地提高相似语言短文本的识别精度。

关 键 词:语种识别  相似语言  短文本  神经网络  文本分类

LANGUAGE IDENTIFICATION METHOD FOR SIMILAR LANGUAGE IN SHORT TEXT BASED ON DEEP LEARNING
Zhang Linlin,Yang Yating,Chen Zhanheng,Pan Yirong,Li Yu.LANGUAGE IDENTIFICATION METHOD FOR SIMILAR LANGUAGE IN SHORT TEXT BASED ON DEEP LEARNING[J].Computer Applications and Software,2020,37(2):124-129,176.
Authors:Zhang Linlin  Yang Yating  Chen Zhanheng  Pan Yirong  Li Yu
Affiliation:(Xinjiang Technical Institute of Physics and Chemistry,Chinese Academy of Sciences,Urumqi 830011,Xinjiang,China;University of the Chinese Academy of Sciences,Beiing 100049,China;Xinjiang Laboratory of Minority Speech and Language Information Processing,Xinjiang Technical Institute of Physics and Chemistry,Urumgi 830011,Xinjiang,China)
Abstract:In the language identification,the traditional N-Gram method has a high degree of dependence on the length of the text,so it cannot effectively identify the short text.Moreover,the existing models based on neural network cannot consider the information of the word itself and the combination of words at the same time,which reduces the quality of short text recognition.Aiming at the above problems,this paper proposes a character level short text language identification method based on deep learning.CNN was used to obtain the character combination information from the character vector.Then,LSTM was used to obtain the features between words.Finally,we used the full connection network to realize the language identification of similar languages.The experimental results on the corpus of Uyghur and Kazakh as well as DSL2017 show that this method can effectively improve the identification accuracy of short texts in similar languages.
Keywords:Language identification  Similar language  Short text  Neural network  Text categorization
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号