首页 | 本学科首页   官方微博 | 高级检索  
     

基于Spark的基因短序列比对模型
引用本文:冯晓龙,高静.基于Spark的基因短序列比对模型[J].计算机仿真,2020,37(2):231-236.
作者姓名:冯晓龙  高静
作者单位:内蒙古农业大学计算机与信息工程学院,内蒙古呼和浩特010018;内蒙古农业大学计算机与信息工程学院,内蒙古呼和浩特010018
摘    要:针对生物信息分析中基因短序列比对任务计算耗时长的问题,采用Spark平台、RDD数据集以及分布式文件系统HDFS设计了一种分布式计算模型。采用分而治之的策略将庞大的计算任务分割为多个互不重叠的小任务在分布式集群上并行执行。通过基于位置偏移量等分的数据分区算法实现数据的分发;通过将基因短序列封装入RDD数据集的方法实现了短序列的逐条处理;通过将基因比对算法传入RDD的Map函数的方法实现了基因序列的比对。计算模型的实现使得串行比对算法在分布式集群上可扩展,并显著降低了计算耗时,计算结果可与后续的生物信息分析工作相兼容。实验结果证明计算模型具有较好的稳定性和可扩展性,在Spark集群上取得了优秀的加速比。

关 键 词:基因序列比对  短序列映射  分布式计算  并行计算

A Scalable Distributed Computing Model for Biological Short Reads Mapping Algorithm
FENG Xiao-long,GAO Jing.A Scalable Distributed Computing Model for Biological Short Reads Mapping Algorithm[J].Computer Simulation,2020,37(2):231-236.
Authors:FENG Xiao-long  GAO Jing
Affiliation:(College of Computer and Information Engineering,Inner Mongolia Agricultural University,Hohhot Inner Mongolia 010018,China)
Abstract:Aiming at the long time-consuming problem of short reads mapping in bioinformatics analysis,a distributed computing model was designed using Spark platform,RDD data set and distributed file system HDFS.Using divide-and-conquer strategy,an enormous computing job was divided into several small tasks that do not overlap with each othe,r and executed in parallel in distributed cluster.Data distribution was implemented by data partitioning algorithm based on position offset,short sequences were processed by encapsulating them into RDD datasets,and short reads mapping was implemented by passing alignment algorithm into Map function of RDD.The implementation of the computing model makes the serial alignment algorithm scalable on distributed cluster,and significantly reduces the time-consuming.The results are compatible with the subsequent bioinformatics analysis work.The experimental results show that the computing model has good stability and scalability,and achieves excellent speedup ratio on the Spark cluster.
Keywords:Gene sequence alignment  Short reads mapping  Distributed computing  Parallel computing
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号