首页 | 本学科首页   官方微博 | 高级检索  
     


SDAFT: A novel scalable data access framework for parallel BLAST
Affiliation:1. EECS, University of Central Florida, Orlando, United States;2. Department of Computer Science, Virginia Tech, Blacksburg, VA 2406, United States;1. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China;2. School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China;1. Pacific Northwest National Laboratory Richland, WA 99354, USA;2. Pacific Northwest National Laboratory Seattle, WA 98109, USA;3. NVIDIA Research Santa Clara, CA 95051, USA;1. Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA;2. Department of Computer Science, Illinois Institute of Technology, 10 West 31st Street, Chicago, IL 60616, USA;3. Environmental Sciences Division, Oak Ridge National Laboratory, 1 Bethel Valley Road, MS 6301, Oak Ridge, TN 37831-6301, USA;4. Department of Earth and Planetary Sciences, Department of Electrical Engineering and Computer Science, University of Tennessee, 1412 Circle Drive, Knoxville, TN 37936, USA;1. Department of Computer Science and Industrial Engineering, INSPIRES Research Institute, Universitat de Lleida, Av. Jaume II 69, E-25001 Lleida, Spain;2. Department of Mathematics, INSPIRES Research Institute, Universitat de Lleida, Av. Jaume II 69, E-25001 Lleida, Spain
Abstract:In order to run tasks in a parallel and load-balanced fashion, existing scientific parallel applications such as mpiBLAST introduce a data-initializing stage to move database fragments from shared storage to local cluster nodes. Unfortunately, with the exponentially increasing size of sequence databases in today’s big data era, such an approach is inefficient.In this paper, we develop a scalable data access framework to solve the data movement problem for scientific applications that are dominated by “read” operation for data analysis. SDAFT employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. SDAFT consists of two interlocked components: (1) a data centric load-balanced scheduler (DC-scheduler) to enforce data-process locality and (2) a translation layer to translate conventional parallel I/O operations into HDFS I/O. By experimenting our SDAFT prototype system with real-world database and queries at a wide variety of computing platforms, we found that SDAFT can reduce I/O cost by a factor of 4–10 and double the overall execution performance as compared with existing schemes.
Keywords:MPI/POSIX I/O  HDFS  Parallel sequence search  mpiBLAST
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号