首页 | 本学科首页   官方微博 | 高级检索  
     

基于多语BERT的无监督攻击性言论检测
引用本文:师夏阳,张风远,袁嘉琪,黄敏. 基于多语BERT的无监督攻击性言论检测[J]. 计算机应用, 2022, 42(11): 3379-3385. DOI: 10.11772/j.issn.1001-9081.2021112005
作者姓名:师夏阳  张风远  袁嘉琪  黄敏
作者单位:郑州轻工业大学 软件学院,郑州 450001
郑州轻工业大学 数学与信息科学学院,郑州 450001
摘    要:攻击性言论会对社会安定造成严重不良影响,但目前攻击性言论自动检测主要集中在少数几种高资源语言,对低资源语言缺少足够的攻击性言论标注语料导致检测困难,为此,提出一种跨语言无监督攻击性迁移检测方法。首先,使用多语BERT(mBERT)模型在高资源英语数据集上进行对攻击性特征的学习,得到一个原模型;然后,通过分析英语与丹麦语、阿拉伯语、土耳其语、希腊语的语言相似程度,将原模型迁移到这四种低资源语言上,实现对低资源语言的攻击性言论自动检测。实验结果显示,与BERT、线性回归(LR)、支持向量机(SVM)、多层感知机(MLP)这四种方法相比,所提方法在丹麦语、阿拉伯语、土耳其语、希腊语这四种语言上的攻击性言论检测的准确率和F1值均提高了近2个百分点,接近目前的有监督检测,可见采用跨语言模型迁移学习和迁移检测相结合的方法能够实现对低资源语言的无监督攻击性检测。

关 键 词:跨语言模型  攻击性言论检测  BERT  无监督方法  迁移学习  
收稿时间:2021-11-25
修稿时间:2021-12-31

Detection of unsupervised offensive speech based on multilingual BERT
Xiayang SHI,Fengyuan ZHANG,Jiaqi YUAN,Min HUANG. Detection of unsupervised offensive speech based on multilingual BERT[J]. Journal of Computer Applications, 2022, 42(11): 3379-3385. DOI: 10.11772/j.issn.1001-9081.2021112005
Authors:Xiayang SHI  Fengyuan ZHANG  Jiaqi YUAN  Min HUANG
Affiliation:College of Software Engineering,Zhengzhou University of Light Industry,Zhengzhou Henan 450001,China
College of Mathematics and Information Science,Zhengzhou University of light industry,Zhengzhou Henan 450001,China
Abstract:Offensive speech has a serious negative impact on social stability. Currently, automatic detection of offensive speech focuses on a few high?resource languages, and the lack of sufficient offensive speech tagged corpus for low?resource languages makes it difficult to detect offensive speech in low?resource languages. In order to solve the above problem, a cross?language unsupervised offensiveness transfer detection method was proposed. Firstly, an original model was obtained by using the multilingual BERT (multilingual Bidirectional Encoder Representation from Transformers, mBERT) model to learn the offensive features on the high?resource English dataset. Then, by analyzing the language similarity between English and Danish, Arabic, Turkish, Greek, the obtained original model was transferred to the above four low?resource languages to achieve automatic detection of offensive speech on low?resource languages. Experimental results show that compared with the four methods of BERT, Linear Regression (LR), Support Vector Machine (SVM) and Multi?Layer Perceptron (MLP), the proposed method increases both the accuracy and F1 score of detecting offensive speech of languages such as Danish, Arabic, Turkish, and Greek by nearly 2 percentage points, which are close to those of the current supervised detection, showing that the combination of cross?language model transfer learning and transfer detection can achieve unsupervised offensiveness detection of low?resource languages.
Keywords:cross?language model  offensive speech detection  BERT (Bidirectional Encoder Representation from Transformers)  unsupervised method  Transfer Learning (TL)  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号