首页 | 本学科首页   官方微博 | 高级检索  
     

基于Spark Streaming的流式并行文本校对
引用本文:杨宗霖,李天瑞,刘胜久,殷成凤,贾真,珠杰. 基于Spark Streaming的流式并行文本校对[J]. 计算机科学, 2020, 47(4): 36-41
作者姓名:杨宗霖  李天瑞  刘胜久  殷成凤  贾真  珠杰
作者单位:西南交通大学信息科学与技术学院 成都 611756;西南交通大学信息科学与技术学院 成都 611756;西南交通大学人工智能研究院 成都 611756;西藏大学计算机科学系 拉萨 850000
基金项目:国家自然科学基金;四川省科技服务业示范项目
摘    要:互联网的高速发展催生了海量网络文本,这对传统的串行文本校对算法提出了新的性能挑战。尽管近年来文本自动校对任务受到了较多关注,但相关研究工作多集中于串行算法,鲜有涉及校对的并行化。文中首先对串行校对算法进行泛化,给出一种串行校对的通用框架,然后针对串行校对算法处理大规模文本存在的耗时长的不足,提出3种通用的文本校对并行化方法:1)基于多线程的线程并行校对,它基于线程池的方式实现段落和校对功能的同时并行;2)基于Spark MapReduce的批处理并行校对,它通过RDD并行计算的方式实现段落的并行校对;3)基于Spark Streaming流式计算框架的流式并行校对,它通过将文本流的实时计算转为一系列小规模的基于时间分片的批处理作业,有效避免了固定开销,显著缩短了校对时延。由于流式计算兼有低时延和高吞吐的优势,文中最后选用流式校对来构建并行校对系统。性能对比实验表明,线程并行适合校对小规模文本,批处理并行适合大规模文本的离线校对,流式并行校对有效减少了约110s的固定时延,相比批处理校对,采用Streaming计算框架的流式校对取得了极大的性能提升。

关 键 词:自动校对  流式计算  并行计算  多线程  SPARK

Streaming Parallel Text Proofreading Based on Spark Streaming
YANG Zong-lin,LI Tian-rui,LIU Sheng-jiu,YIN Cheng-feng,JIA Zhen,ZHU Jie. Streaming Parallel Text Proofreading Based on Spark Streaming[J]. Computer Science, 2020, 47(4): 36-41
Authors:YANG Zong-lin  LI Tian-rui  LIU Sheng-jiu  YIN Cheng-feng  JIA Zhen  ZHU Jie
Affiliation:(School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China;Institute of Artificial Intelligence,Southwest Jiaotong University,Chengdu 611756,China;Department of Computer Science,Tibetan University,Lasa 850000,China)
Abstract:The rapid development of the Internet has prompted the generation of massive amounts of network text,which poses new performance challenges for traditional serial text proofreading algorithms.Although the text automatic proofreading task has received more and more attention in recent years,the related research work mostly focuses on serial algorithms,and rarely involves the parallelization of proofreading.Firstly,the serial proofreading algorithm is generalized,and a general framework of serial proofreading is given.Then,in view of the shortcomings of serial proofreading for processing large-scale texts,three general text proofreading parallelization methods are proposed:1)a parallel proofreading method based on multi-threading,which implements simultaneous parallelism of paragraph and proofreading functions based on the thread pool;2)a batch processing parallel proofreading method based on Spark MapReduce,which implements paragraph parallel proofreading by means of RDD parallel computing;3)a Spark Streaming-based parallel proofreading approach,which converts the real-time calculation of text streams into a series of small-scale time fragmentation based batch jobs,making it can effectively avoid fixed overhead and significantly reduce proofreading delay.Because the streaming computing has the advantages of low delay and high throughput,the paper finally chooses the streaming computing-based method to build the parallel proofreading system.Performance comparison experiments demonstrate that thread parallelism is suitable for proofreading small-scale text,batch processing is suitable for off-line proofreading of large-scale text,and streaming parallel proofreading effectively reduces the fixed delay of about 110 seconds.Compared with batch proofreading,the streaming proofreading using a real-time computing framework has achieved a great performance improvement.
Keywords:Automatic correction  Streaming computing  Parallel computing  Multi-threading  Spark
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号