首页 | 本学科首页   官方微博 | 高级检索  
     

针对长尾问题的二重加权多音字消歧算法
引用本文:高羽,熊一瑾,叶建成.针对长尾问题的二重加权多音字消歧算法[J].中文信息学报,2022,36(11):169-176.
作者姓名:高羽  熊一瑾  叶建成
作者单位:美的集团(上海)有限公司 AI创新中心,上海 201702
摘    要:数据的长尾分布问题是NLP实践领域中的常见问题。以语音合成前端的多音字消歧任务为例,多音字数据的极度不均衡、尾部数据的缺乏,影响着语音合成系统的工业实用效果。该文观察到,汉语多音字的分布在“字符”与“字音”两个维度上都呈长尾特性,因此该文针对性地提出一种二重加权算法(Double Weighted, DW)。DW算法可分别与两种长尾算法:MARC,Decouple-cRT结合,进一步提升模型性能。在开源数据和工业数据上,DW算法较基线模型和两种原始算法取得了不同程度的准确率提升,为多维长尾问题提供解决方案与借鉴思路。

关 键 词:多音字消歧  长尾分布  重加权  解耦特征与分类器
收稿时间:2022-04-21

Double-Weighted Disambiguation Algorithm for Long-tail Polyphone Problem
GAO Yu,XIONG Yijin,YE Jiancheng.Double-Weighted Disambiguation Algorithm for Long-tail Polyphone Problem[J].Journal of Chinese Information Processing,2022,36(11):169-176.
Authors:GAO Yu  XIONG Yijin  YE Jiancheng
Affiliation:AI Innovation Center, Midea Group (Shanghai) Co., Ltd., Shanghai 201702, China
Abstract:The problem of long-tail distributed data is common in NLP practice. Taking the polyphone disambiguation task in text-to-speech (TTS) as an example, the extreme data imbalance and the lack of tail data affect industrial online TTS applications. Observging that the Chinese polyphone is long-tail distributed on both “character” and “pronunciation” dimensions, this paper proposes a double-weighted (DW) algorithm, which can be combined with the other two long-tail algorithms: MARC and Decouple-cRT. Given the perspectives of both open-source data and industrial data, DW demonstrates improvement in accuracy compared to the baseline model and the two original algorithms.
Keywords:polyphone disambiguation  long-tail distribution  re-weighting  decouple representation and classifier  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号