Benchmarking performance for migrating a relational application to a parallel implementation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Benchmarking performance for migrating a relational application to a parallel implementation

Affiliation:	1. Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai 200433, China;2. Department of Computer Science and Technology, Tongji University, Shanghai 201804, China;1. Department of Computer Engineering, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia;2. School of Computing, Queen’s University, Kingston, ON, Canada;1. School of Computer and Information Science, Southwest University, ChongQing 400715 PR China;2. School of Computer and Information, Hefei University of Technology, Hefei, China;3. Academy of Broadcasting Science, SAPPRFT, Beijing, China

Abstract:	Many organizations rely on relational database platforms for OLAP-style querying (aggregation and filtering) for small to medium size applications. We investigate the impact of scaling up the data sizes for such queries. We intend to illustrate what kind of performance results an organization could expect should they migrate current applications to big data environments. This paper benchmarks the performance of Hive (Thusoo et al., 2009) 9], a parallel data warehouse platform that is a part of the Hadoop software stack. We set up a 4-node Hadoop cluster using Hortonworks HDP 1.3.2 (Hortonworks HDP 1.3.2). We use the data generator provided by the TPC-DS benchmark (DSGen v1.1.0) to generate data of different scales. We compare the performance of loading data and querying for SQL and Hive Query Language (HiveQL) on a relational database installation (MySQL) and on a Hive cluster, respectively. We measure the speedup for query execution for three dataset sizes resulting from the scale up. Hive loads the large datasets faster than MySQL, while it is marginally slower than MySQL when loading the smaller datasets. Query execution in Hive is also faster. We also investigate executing Hive queries concurrently in workloads and conclude that serial execution of queries is a much better practice for clusters with limited resources.

Keywords:	Hive Hadoop Benchmarking Big data SQL Queries
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏