利用领域外数据对口语风格短文本的相近语种识别研究 A Study on Discrimination Between Identification of Similar Languages on Short Conversational Texts with Out-of-domain Data期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

利用领域外数据对口语风格短文本的相近语种识别研究

引用本文：	何峻青,黄娴,赵学敏,张克亮.利用领域外数据对口语风格短文本的相近语种识别研究[J].中文信息学报,2019,33(3):71-78.

作者姓名：	何峻青黄娴赵学敏张克亮

作者单位：	1.中国科学院声学研究所语言声学与内容理解实验室,北京 100190; 2.中国科学院大学,北京 100049; 3.信息工程大学洛阳校区,河南洛阳 471003

基金项目：	国家自然科学基金(11590771,11590770-4,11722437,61650202,U1536117,61671442,11674352,11504406,61601453);国家重点研发计划(2016YFB0801203,2016YFC0800503,2017YFB1002803);新疆维吾尔自治区重大科技专项(2016A03007-1);中国科学院声学研究所青年英才计划(QNYC201603)

摘要：	该文以维吾尔语和哈萨克语这一组相近语言为例,在哈语语料受限的情况下,使用领域外语料增补原始语料,经同化后提高了在口语风格短文本上进行语种识别的精确度。该文分析了维、哈两种语言的词形学特点,设计了多种特征,构建了一个最大熵分类器,在测试集上识别维语和哈语口语风格短文本的精确度达到95.7%,而CNN分类器的精确度仅为69.1%。实验结果证明该系统对其他语种口语风格短文本的语种识别亦具有适用性。
关键词：	相近语种识别领域外数据口语风格短文本字符的形态学特征
A Study on Discrimination Between Identification of Similar Languages on Short Conversational Texts with Out-of-domain Data

HE Junqing,HUANG Xian,ZHAO Xuemin,ZHANG Keliang.A Study on Discrimination Between Identification of Similar Languages on Short Conversational Texts with Out-of-domain Data[J].Journal of Chinese Information Processing,2019,33(3):71-78.

Authors:	HE Junqing HUANG Xian ZHAO Xuemin ZHANG Keliang

Affiliation:	1.Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China; 2.University of Chinese Academy of Sciences, Beijing 100049, China; 3.Luoyang Division, Information Engineering University, Luoyang, Henan 471003, China

Abstract:	This paper aims at identification similar languages such as Uyghur and Kazakh from short conversational texts. To alleviate the severe data imbalance resulted from the low-recource Kazakh, we leverage a compensation strategy and an assimilation method by selecting appropriate out-of-domain data. Then we constructed a maximum entropy MaxEnt classifier based on morphologic features to discriminate between the two languages and investigated the contribution of each feature. Experimental results suggest that the MaxEnt classifier effectively discriminates between Uyghur and Kazakh on the test set with an accuracy of 95.7%, outperforming the champion of the VarDial’2016 DSL shared task on test sets B1 and B2 by 0.6% and 1.2%.

Keywords:	similar language identification out-of-domain data short conversational texts morphological features
本文献已被维普等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏