首页 | 本学科首页   官方微博 | 高级检索  
     

基于字词特征自注意力学习的社交媒体文本分类方法
引用本文:王晓莉,叶东毅. 基于字词特征自注意力学习的社交媒体文本分类方法[J]. 模式识别与人工智能, 2020, 33(4): 287-294. DOI: 10.16451/j.cnki.issn1003-6059.202004001
作者姓名:王晓莉  叶东毅
作者单位:1.福州大学 数学与计算机科学学院 福州 350108
2.福州大学 空间数据挖掘与信息共享教育部重点实验室 福州 350108
基金项目:国家自然科学基金项目;福建省高校产学合作科技项目
摘    要:社交媒体文本中突出的长尾效应和过量的词典外词汇(OOV)导致严重的特征稀疏问题,影响分类模型的准确率.针对此问题,文中提出基于字词特征自注意力学习的社交媒体文本分类方法.在字级别构建全局特征,用于学习文本中各词的注意力权值分布.改进现有的多头注意力机制,降低参数规模和计算复杂度.为了更好地分析字词特征融合的作用,提出OOV词汇敏感度,用于衡量不同类型的特征受OOV词汇的影响.多组社交媒体文本分类任务的实验表明,文中方法在融合字特征和词特征方面的有效性与分类准确度均有较明显的提升.此外,OOV词汇敏感度指标的量化结果验证文中方法是可行有效的.

关 键 词:社交媒体文本分类  自注意力机制  字词特征融合  词典外词汇敏感度
收稿时间:2020-01-02

Social Media Text Classification Method Based on Character-Word Feature Self-attention Learning
WANG Xiaoli,YE Dongyi. Social Media Text Classification Method Based on Character-Word Feature Self-attention Learning[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(4): 287-294. DOI: 10.16451/j.cnki.issn1003-6059.202004001
Authors:WANG Xiaoli  YE Dongyi
Affiliation:1.College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108
2.Key Laboratory of Spatial Data Mining and Information Sharing, Ministry of Education, Fuzhou University, Fuzhou 350108
Abstract:Long tail effect and excessive out-of-vocabulary(OOV) words in social media texts result in severe feature sparsity and reduce classification accuracy. To solve the problem, a social media text classification method based on character-word feature self-attention learning is proposed. Global features are constructed at the character level to learn attention weight distribution, and the existing multi-head attention mechanism is improved to reduce parameter scale and computational complexity. To further analyze character-word feature fusion, OOV sensitivity is proposed to measure the impact of OOV words on different types of features. Experiments on several social media text classification tasks indicate that the effectiveness and classification accuracy of the proposed method are obviously improved in terms of fusing word features and character features. Moreover, the quantitative results of OOV vocabulary sensitivity index verify the feasiblity and effectiveness of the proposed method.
Keywords:Social Media Text Classification  Self-attention Learning  Character-Word Feature Fusion  Out of Vocabulary Sensitivity  
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《模式识别与人工智能》浏览原始摘要信息
点击此处可从《模式识别与人工智能》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号