使用分布式De Bruijn图遍历基因拼接并行构建和化简 Parallelized De Bruijn Graph Construction and Simplification for Genome Assembly期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

使用分布式De Bruijn图遍历基因拼接并行构建和化简

引用本文：	曾理,成杰峰,孟金涛,涂志兵,冯圣中.使用分布式De Bruijn图遍历基因拼接并行构建和化简[J].软件学报,2013,24(S2):140-149.

作者姓名：	曾理成杰峰孟金涛涂志兵冯圣中

作者单位：	中国科学院深圳先进技术研究院先进计算与数字工程研究所, 广东深圳 518055;中国科学院深圳先进技术研究院先进计算与数字工程研究所, 广东深圳 518055;中国科学院深圳先进技术研究院先进计算与数字工程研究所, 广东深圳 518055;中国科学院深圳先进技术研究院先进计算与数字工程研究所, 广东深圳 518055;中国科学院深圳先进技术研究院先进计算与数字工程研究所, 广东深圳 518055

基金项目：	国家自然科学基金（61103049）

摘要：	目前基因拼接软件中应用最广泛的技术是基于De Bruijn图的基因拼接算法，需要对长达数十亿BP长度的基因组测序数据进行处理.针对海量的基因测序数据，快速、高效和可扩展的基因拼接算法非常重要.虽然已出现一些并行拼接算法（如YAGA）开始研究这些问题，但是拼接过程中时间、空间消耗较大的构图和单链化简这两大步骤在海量数据的挑战下仍然是最主要的计算瓶颈.这是因为现有工作在处理这几个步骤时通常使用了并行的表排序（list ranking），而该方法需要多次对De Bruijn图的海量顶点信息进行分布式的排序，产生了大量的计算节点间的通信.单链化简可由1次De Bruijn 图深度优先遍历完成而不再需要表排序，于是提出一种基于分布式海量图遍历方法对单链化简进行优化，极大地减少了处理器间的通信和计算节点之间的数据移动，因而取得较好的扩展性，其算法复杂度为O（g/p），通信复杂度为O（g），这里g为参考序列的长度，p为处理器的核数.当对E.coli和Yeast数据集进行测试，处理器的核数从8个增加到512个时，算法可以得到13倍和10倍的加速比;当对C.elegans和人类1号染色体（chr1）数据集进行测试，处理器的核数从32个增加到512个时，算法可以得到7倍和10倍的加速比.
关键词：	并行算法 De Bruijn 图基因拼接图遍历分布式处理
收稿时间：	8/5/2012 12:00:00 AM
修稿时间：	2013/7/22 0:00:00
Parallelized De Bruijn Graph Construction and Simplification for Genome Assembly

ZENG Li,CHENG Jie-Feng,MENG Jin-Tao,TU Zhi-Bing and FENG Sheng-Zhong.Parallelized De Bruijn Graph Construction and Simplification for Genome Assembly[J].Journal of Software,2013,24(S2):140-149.

Authors:	ZENG Li CHENG Jie-Feng MENG Jin-Tao TU Zhi-Bing and FENG Sheng-Zhong

Affiliation:	Institute of Advanced Computing and Digital Engineering, Shenzhen Institutes of Advanced Technology, The Chinese Academy of Sciences, Shenzhen 518055, China;Institute of Advanced Computing and Digital Engineering, Shenzhen Institutes of Advanced Technology, The Chinese Academy of Sciences, Shenzhen 518055, China;Institute of Advanced Computing and Digital Engineering, Shenzhen Institutes of Advanced Technology, The Chinese Academy of Sciences, Shenzhen 518055, China;Institute of Advanced Computing and Digital Engineering, Shenzhen Institutes of Advanced Technology, The Chinese Academy of Sciences, Shenzhen 518055, China;Institute of Advanced Computing and Digital Engineering, Shenzhen Institutes of Advanced Technology, The Chinese Academy of Sciences, Shenzhen 518055, China

Abstract:	De Bruijn graph is a vastly used technique for developing genome assembly software nowadays. The scale of this kind of graph can reach billions of vertices and edges, posing great challenges to the genome assembly task. It is of great importance to study scalable genome assembly algorithms in order to cope with this situation. Despite some recent works and therefore begin to address the scalability problem with parallel assembly algorithms, massive De Bruijn graph processing is still very time consuming which needs optimized operations. This paper aims to significantly improve the efficiency of massive De Bruijn graph processing. Specifically, focus is placed on the time consuming and memory intensive processing in the construction phase and the simplification phase of De Bruijn graph. The study observes observe that the existing list ranking approach repeatedly performs parallel global sorting over all De Bruijin graph vertices, which results in a huge amount of communications between computing nodes. It therefore proposes to use depth-first traversal over the underlying De Bruijn graph once to achieve the same objective as the existing list ranking approach. The new method is fast, effective and can be executed in parallel. It has a computing complexity of O(g/p) and communication complexity of O(g), where g is the length of genome reference, p is the number of processors, which is smaller than the existing list ranking approach, Experimental results using error-free data show that, when the number of CPUs scales from 8 to 512, our algorithm has a speedup of 13 and 10 times on processing the data sets of E.coli and Yeast respectively; and when the number of CPUs scales from 32 to 512, the algorithm has a speedup of 7 and 10 times on processing the data sets of C.elegans and chr1 respectively.

Keywords:	parallelized algorithm De Brujin graph assembler graph traversal distributed processing

	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏