首页 | 本学科首页   官方微博 | 高级检索  
     

基于Spark的快速短文本数据流分类方法
引用本文:胡阳,胡学钢,李培培.基于Spark的快速短文本数据流分类方法[J].计算机工程与应用,2020,56(14):138-147.
作者姓名:胡阳  胡学钢  李培培
作者单位:1.合肥工业大学 计算机与信息学院,合肥 230009 2.工业安全与应急技术安徽省重点实验室,合肥 230009
基金项目:安徽省自然科学基金;国家自然科学基金
摘    要:微博、脸书等社交网络平台涌现的短文本数据流具有海量、高维稀疏、快速可变等特性,使得短文本数据流分类面临着巨大挑战。已有的短文本数据流分类方法难以有效地解决特征高维稀疏问题,并且在处理海量数据流时时间代价较高。基于此,提出一种基于Spark的分布式快速短文本数据流分类方法。一方面,利用外部语料库构建Word2vec词向量模型解决了短文本的高维稀疏问题,并构建扩展词向量库以适应文本的快速可变性,提出一种LR分类器集成模型用于短文本数据流分类,该分类器使用一种FTRL方法实现模型参数的在线更新,并引入时间因子加权机制以适应概念漂移环境;另一方面,所提方法的使用分布式处理提高了海量短文本数据流的处理效率。在3个真实短文本数据流上的实验表明:所提方法在提高分类精度的同时,降低了时间消耗。

关 键 词:短文本数据流分类  分布式处理  Spark环境  概念漂移

Fast Short Text Data Stream Classification Method Based on Spark
HU Yang,HU Xuegang,LI Peipei.Fast Short Text Data Stream Classification Method Based on Spark[J].Computer Engineering and Applications,2020,56(14):138-147.
Authors:HU Yang  HU Xuegang  LI Peipei
Affiliation:1.School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China 2.Key Laboratory of Industrial Safety and Emergency Technology Anhui Province, Hefei 230009, China
Abstract:Short text data streams emerging on social network platforms such as Weibo and Facebook have the characteristics of magnanimity, high-dimension, sparsity and fast variable, and it is hence a huge challenge for the short text data stream classification. Existing short text data stream classification methods are difficult to effectively solve the high-dimensional and sparse feature problem, and spend higher time costs in the processing of massive data streams. Motivated by this, a distributed fast short text data stream classification method based on Spark is proposed in this paper. On the one hand, the external corpus is used to construct the Word2vec model to solve the high-dimension and sparsity issue of short texts, and the extended word vector library is constructed to adapt to the fast variability of the texts. Then, an LR classifier integration model is proposed for classifying short text data streams. The classifier utilizes an FTRL method to implement online update of model parameters, and introduces a time factor weighting mechanism to adapt to the concept drift environment. On the other hand, the proposed method uses distributed processing to improve the performance of handling with massive short text streams. Finally, experiments conducted on three real short text data streams show that the proposed method greatly reduces the time consumption while improving the classification accuracy.
Keywords:short text data stream classification  distributed processing  Spark  concept drift  
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号