首页 | 本学科首页   官方微博 | 高级检索  
     

两级特征联合学习的情感说话人识别
引用本文:刘金琳,李冬冬,王喆,蔡立志.两级特征联合学习的情感说话人识别[J].计算机工程与应用,2023,59(1):149-155.
作者姓名:刘金琳  李冬冬  王喆  蔡立志
作者单位:1.华东理工大学 信息科学与工程学院,上海 200237 2.苏州大学 江苏省计算机信息处理技术重点实验室,江苏 苏州 215006
基金项目:国家自然科学基金(61806078);;国家重大新药开发科技专项(2019ZX09210004);;上海市教育发展基金会和上海市教育委员会“曙光计划”(61725301);
摘    要:针对说话人识别的性能易受到情感因素影响的问题,提出利用片段级别特征和帧级别特征联合学习的方法。利用长短时记忆网络进行说话人识别任务,提取时序输出作为片段级别的情感说话人特征,保留了语音帧特征原本信息的同时加强了情感信息的表达,再利用全连接网络进一步学习片段级别特征中每一个特征帧的说话人信息来增强帧级别特征的说话人信息表示能力,最后拼接片段级别特征和帧级别特征得到最终的说话人特征以增强特征的表征能力。在普通话情感语音语料库(MASC)上进行实验,验证所提出方法有效性的同时,探究了片段级别特征中包含语音帧数量和不同情感状态对情感说话人识别的影响。

关 键 词:情感说话人识别  长短时记忆网络  深度神经网络

Segment-Level Feature and Frame-Level Feature Joint Learning for Emotional Speaker Recognition
LIU Jinlin,LI Dongdong,WANG Zhe,CAI Lizhi.Segment-Level Feature and Frame-Level Feature Joint Learning for Emotional Speaker Recognition[J].Computer Engineering and Applications,2023,59(1):149-155.
Authors:LIU Jinlin  LI Dongdong  WANG Zhe  CAI Lizhi
Affiliation:1.School of Information Science and Engineering, East China University of Technology, Shanghai 200237, China 2.Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, Jiangsu 215006, China
Abstract:The performance of speaker recognition is easily affected by emotional factors. A joint learning method using segment-level features and frame-level features is proposed in this paper. To retain the original speaker information of each frame and fully express the emotional information, long short-term memory-network is used to extract sequence output as segment-level emotional speaker embedding. Then each frame of the segment-level feature is learned by full-connected network to improve the frame-level feature representation ability. At last, the final speaker embedding is the concatenation of the segment-level feature and the frame-level feature, which can further improve the ability of feature expression. Experiments are conducted on Mandarin emotional speech corpus(MASC) to verify the effectiveness of the proposed method. Meanwhile, this paper discusses the suitable number of frames contained in segment-level feature and the effects of different emotional states on emotional speaker recognition.
Keywords:emotional speaker recognition  long short-term memory  deep neutral network  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号