首页 | 本学科首页   官方微博 | 高级检索  
     

基于DSR和BGRU模型的聊天文本证据分类方法
引用本文:张宇,李炳龙,李学娟,张和禹. 基于DSR和BGRU模型的聊天文本证据分类方法[J]. 网络与信息安全学报, 2022, 8(2): 150-159. DOI: 10.11959/j.issn.2096-109x.2022007
作者姓名:张宇  李炳龙  李学娟  张和禹
作者单位:1. 信息工程大学,河南 郑州 450001;2. 河南理工大学,河南 焦作 454003
基金项目:国家自然科学基金(60903220)
摘    要:即时通信等社交软件产生的聊天文本内容证据数据量大且聊天内容含有“黑话”等复杂语义,数字取证时无法快速识别和提取与犯罪事件有关的聊天文本证据。为此,基于DSR(dynamic semantic representation)模型和BGRU(bidirectional gated recurrent unit)模型提出一个聊天文本证据分类模型(DSR-BGRU)。通过预处理手段处理聊天文本数据,使其保存犯罪领域特征。设计并实现了基于DSR模型的聊天文本证据语义特征表示方法,从语义层面对聊天文本进行特征表示,通过聚类算法筛选出语义词,并通过单词属性与语义词的加权组合对非语义词词向量进行特征表示,且将语义词用于对新单词进行稀疏表示。利用Keras框架构建了包含DSR模型输入层、BGRU模型隐藏层和softmax分类层的多层聊天文本特征提取与分类模型,该模型使用DSR模型进行词的向量表示组成的文本矩阵作为输入向量,从语义层面对聊天文本进行特征表示,基于BGRU模型的多层隐藏层对使用这些词向量组成的文本提取上下文特征,从而能够更好地准确理解聊天文本的语义信息,并利用softmax分类层实现聊天文本...

关 键 词:文本语义表示  一词多义  文本分类  数字取证

Evidence classification method of chat text based on DSR and BGRU model
Yu ZHANG,Binglong LI,Xuejuan LI,Heyu ZHANG. Evidence classification method of chat text based on DSR and BGRU model[J]. Chinese Journal of Network and Information Security, 2022, 8(2): 150-159. DOI: 10.11959/j.issn.2096-109x.2022007
Authors:Yu ZHANG  Binglong LI  Xuejuan LI  Heyu ZHANG
Affiliation:1. Information Engineering University, Zhengzhou 450001, China;2. Henan Polytechnic University, Jiaozuo 454003, China
Abstract:It is always unlikely to efficiently identify and extract chat text evidence related to criminal events, due to the complex semantics such as “slang” in the chat content and the huge amount of chat text data generated by social software such as instant messaging.Based on this motivation, a chat text evidence classification model (DSR-BGRU) based on the DSR (dynamic semantic representation) model and the BGRU (bidirectional gated recurrent unit) model was proposed.The chat text data was pre-processed to preserve the characteristics of the criminal field.Then a multi-layer chat text feature extraction and classification model using the Keras framework was proposed.With the text matrix composed of vector representation of words in the DSR model as the input vector, the input layer of the DSR model featured the chat text from the semantic level.Then the hidden layer of the BGRU model extracted the context characteristics of the text composed of the word vectors.The softmax classification layer recognized and extracted the chat text evidence.The experimental results show that the proposed DSR-BGRU can more accurately identify and extract chat records compared with other models and methods for text classification, and it can also effectively extract the criminal text information from the chat information with the accuracy rate 92.06% and the F1 score 91.00%.
Keywords:text semantic representation  polysemy  text classification  digital forensics  
点击此处可从《网络与信息安全学报》浏览原始摘要信息
点击此处可从《网络与信息安全学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号