首页 | 本学科首页   官方微博 | 高级检索  
     

利用领域外数据对口语风格短文本的相近语种识别研究
引用本文:何峻青,黄娴,赵学敏,张克亮.利用领域外数据对口语风格短文本的相近语种识别研究[J].中文信息学报,2019,33(3):71-78.
作者姓名:何峻青  黄娴  赵学敏  张克亮
作者单位:1.中国科学院 声学研究所 语言声学与内容理解实验室,北京 100190;
2.中国科学院大学,北京 100049;
3.信息工程大学 洛阳校区,河南 洛阳 471003
基金项目:国家自然科学基金(11590771,11590770-4,11722437,61650202,U1536117,61671442,11674352,11504406,61601453);国家重点研发计划(2016YFB0801203,2016YFC0800503,2017YFB1002803);新疆维吾尔自治区重大科技专项(2016A03007-1);中国科学院声学研究所青年英才计划(QNYC201603)
摘    要:该文以维吾尔语和哈萨克语这一组相近语言为例,在哈语语料受限的情况下,使用领域外语料增补原始语料,经同化后提高了在口语风格短文本上进行语种识别的精确度。该文分析了维、哈两种语言的词形学特点,设计了多种特征,构建了一个最大熵分类器,在测试集上识别维语和哈语口语风格短文本的精确度达到95.7%,而CNN分类器的精确度仅为69.1%。实验结果证明该系统对其他语种口语风格短文本的语种识别亦具有适用性。

关 键 词:相近语种识别  领域外数据  口语风格短文本  字符的形态学特征

A Study on Discrimination Between Identification of Similar Languages on Short Conversational Texts with Out-of-domain Data
HE Junqing,HUANG Xian,ZHAO Xuemin,ZHANG Keliang.A Study on Discrimination Between Identification of Similar Languages on Short Conversational Texts with Out-of-domain Data[J].Journal of Chinese Information Processing,2019,33(3):71-78.
Authors:HE Junqing  HUANG Xian  ZHAO Xuemin  ZHANG Keliang
Affiliation:1.Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China;
2.University of Chinese Academy of Sciences, Beijing 100049, China;
3.Luoyang Division, Information Engineering University, Luoyang, Henan 471003, China
Abstract:This paper aims at identification similar languages such as Uyghur and Kazakh from short conversational texts. To alleviate the severe data imbalance resulted from the low-recource Kazakh, we leverage a compensation strategy and an assimilation method by selecting appropriate out-of-domain data. Then we constructed a maximum entropy MaxEnt classifier based on morphologic features to discriminate between the two languages and investigated the contribution of each feature. Experimental results suggest that the MaxEnt classifier effectively discriminates between Uyghur and Kazakh on the test set with an accuracy of 95.7%, outperforming the champion of the VarDial’2016 DSL shared task on test sets B1 and B2 by 0.6% and 1.2%.
Keywords:similar language identification  out-of-domain data  short conversational texts  morphological features  
本文献已被 维普 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号