Analyzing the Query Performance of Hive-QL with ORCfile on Hadoop Cluster

Authors

  • พันธิการ์ วัฒนกุล Department of Business Computer, Faculty of Management Science, Nakhon Pathom Rajabhat University
  • กฤษณ์วรา รัตนโอภาส Computer Department, Faculty of Science and Technology, Songkhla Rajabhat University
  • สุรีรัตน์ แก้วคีรี Department of Business Computer, Faculty of Management Science, Songkhla Rajabhat University

DOI:

https://doi.org/10.14456/rmutlengj.2017.12

Keywords:

Big data, Hadoop, Hive, Performance Analysis

Abstract

The weather forecast data is one of the most important datasets in Big Data. The Hive application was the first relational database that runs on Hadoop cluster. This paper presents a performance analysis of HiveQL on Hadoop cluster with varying number of data node and data replication. The results show that the best performing Map-Reduce configuration for distributed nodes in Hadoop cluster is Map=5/Reduce=1. This ratio is consistent with the best query performance setup which is 3 replications per 5 data nodes. Meanwhile, increasing the number of data nodes and replications did not affect the result in anyway.

References

1. V. Reynolds. Big Data For Beginners: Understanding SMART Big Data, Data Mining & Data Analytics For improved Business Performance, Life Decisions & More!. Kindle Edition, 2016.
2. Hadoop’s open source query tools. Performance test of Pig vs Hive with code examples. Available From: http://www.open-bigdata.com/performance-test-pig-vs-hive-code-examples/ [Accessed 5th Fab 2017].
3. D. Abadi, S. Babu, F. Ozcan, and I Pandis. Tutorial: SQL-on-Hadoop Systems. Proceedings of the VLDB Endowment. 2015 Aug 31-Sep 4; Kohala Coast, Hawaii. p. 2050-2051.
4 K. Jayasri, R. Rajmohan, and D. Dinagaran. Analyzing the Query Performances of Description Logic based Service Matching using Hadoop. Proceeding of International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. 2015 May 6-8; Chennai, T.N., India. p. 1-7.
5. Adisorn G, Suparerk M. Performance of the Apache Mahout on Apache Hadoop Cluster. Proceeding of Electrical Engineering Conference 38th. 2015 Nov 18-20; Pranakornsrya, Ayutthaya, Thailand, p. 858-861.Thai.
6. The Big Data Blog. Hadoop Ecosystem Overview. Available from: http://thebigdata blog.weebly.com/blog/the-hadoop-ecosystem-overview/ [Accessed 5th Fab 2017].
7. The Hortonworks Blog. ORCFile in HDP 2: Better Compression, Better Performance. Available from: http://hortonworks.com/blog /orcfile-in-hdp-2-better-compression-better-performance/.
8. MAPR.blog. What Kind of Hive Table is Best for Your Data. Available From: https://www.mapr .com/blog/what-kind-hive-table-best-your-data/ [Accessed 5th Fab 2017].

Downloads

Published

2017-12-01

How to Cite

วัฒนกุล พ., รัตนโอภาส ก., & แก้วคีรี ส. (2017). Analyzing the Query Performance of Hive-QL with ORCfile on Hadoop Cluster. RMUTL Engineering Journal, 2(2), 43–52. https://doi.org/10.14456/rmutlengj.2017.12

Issue

Section

Research Article