research-article

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Authors:
Maribel Yasmina Santos

University of Minho, ALGORITMI Research Centre, Portugal

University of Minho, ALGORITMI Research Centre, Portugal
View Profile

,
Carlos Costa

University of Minho, ALGORITMI Research Centre, Portugal

University of Minho, ALGORITMI Research Centre, Portugal
View Profile

,
João Galvão

University of Minho, ALGORITMI Research Centre, Portugal

University of Minho, ALGORITMI Research Centre, Portugal
View Profile

,
Carina Andrade

University of Minho, ALGORITMI Research Centre, Portugal

University of Minho, ALGORITMI Research Centre, Portugal
View Profile

,
Bruno Augusto Martinho

University of Minho, ALGORITMI Research Centre, Portugal

University of Minho, ALGORITMI Research Centre, Portugal
View Profile

,
Francisca Vale Lima

University of Minho, ALGORITMI Research Centre, Portugal

University of Minho, ALGORITMI Research Centre, Portugal
View Profile

,
Eduarda Costa

University of Minho, ALGORITMI Research Centre, Portugal

University of Minho, ALGORITMI Research Centre, Portugal
View Profile

IDEAS '17: Proceedings of the 21st International Database Engineering & Applications SymposiumJuly 2017Pages 242–252https://doi.org/10.1145/3105831.3105842

Published:12 July 2017Publication History

IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Pages 242–252

ABSTRACT

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-Hadoop systems increased notoriety, providing Structured Query Language (SQL) interfaces and interactive queries on Hadoop. A benchmark based on a denormalized version of the TPC-H is used to compare the performance of Hive on Tez, Spark, Presto and Drill. Some key contributions of this work include: the direct comparison of a vast set of technologies; unlike previous scientific works, SQL-on-Hadoop systems were connected to Hive tables instead of raw files; allow to understand the behaviour of these systems in scenarios with ever-increasing requirements, but not-so-good hardware. Besides these benchmark results, this paper also makes available interesting findings regarding an architecture and infrastructure in SQL-on-Hadoop for Big Data Warehousing, helping practitioners and fostering future research.

References

Armbrust, M. et al. 2015. Spark sql: Relational data processing in spark. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015), 1383--1394.Google ScholarDigital Library
Chandarana, P. and Vijayalakshmi, M. 2014. Big Data analytics frameworks. 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA) (Apr. 2014), 430--434. Google ScholarCross Ref
Chang, L. et al. 2014. HAWQ: a massively parallel processing SQL engine in hadoop. Proceedings of the 2014 ACM SIGMOD international conference on Management of data (2014), 1223--1234.Google ScholarDigital Library
Chen, M. et al. 2014. Big Data: A Survey. Mobile Networks and Applications. 19, 2 (Apr. 2014), 171--209. Google ScholarDigital Library
Chen, Y. et al. 2014. A study of SQL-on-Hadoop systems. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 8807, (2014), 154--166.Google Scholar
Choi, H. et al. 2013. Tajo: A distributed data warehouse system on large clusters. Data Engineering (ICDE), 2013 IEEE 29th International Conference on (2013), 1320--1323.Google ScholarDigital Library
Clegg, D. 2015. Evolving data warehouse and BI architectures: The big data challenge. TDWI Business Intelligence Journal. 20, 1 (2015), 19--24.Google Scholar
Floratou, A. et al. 2014. SQL-on-Hadoop: Full Circle Back to Shared-nothing Database Architectures. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1295--1306. Google ScholarDigital Library
Golab, L. and Johnson, T. 2014. Data stream warehousing. 2014 IEEE 30th International Conference on Data Engineering (ICDE) (Mar. 2014), 1290--1293.Google ScholarCross Ref
Goss, R.G. and Veeramuthu, K. 2013. Heading towards big data building a better data warehouse for more data, more speed, and more users. Advanced Semiconductor Manufacturing Conference (ASMC), 2013 24th Annual SEMI (2013), 220--225.Google ScholarCross Ref
Hashem, I.A.T. et al. 2015. The rise of "big data" on cloud computing: Review and open research issues. Information Systems. 47, (Jan. 2015), 98--115. Google ScholarDigital Library
Hausenblas, M. and Nadeau, J. 2013. Apache drill: interactive ad-hoc analysis at scale. Big Data. 1, 2 (2013), 100--104. Google ScholarCross Ref
Hive LLAP Documentation: 2016. https://cwiki.apache.org/confluence/display/Hive/LLAP. Accessed: 2016-11-01.Google Scholar
Huai, Y. et al. 2014. Major Technical Advancements in Apache Hive. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2014), 1235--1246. Google ScholarDigital Library
Kimball, R. and Ross, M. 2013. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons.Google Scholar
Kobielus, J. 2012. Hadoop: Nucleus of the next-generation big data warehouse. IBM Data Management Magazine.Google Scholar
Kornacker, M. et al. 2015. Impala: A modern, open-source sql engine for hadoop. Proc. CIDR'15 (California, USA, 2015).Google Scholar
Krishnan, K. 2013. Data Warehousing in the Age of Big Data. Morgan Kaufmann Publishers Inc.Google Scholar
Madden, S. 2012. From databases to big data. IEEE Internet Computing. 16, 3 (2012), 4--6. Google ScholarDigital Library
Manyika, J. et al. 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.Google Scholar
Marz, N. and Warren, J. 2015. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co.Google ScholarDigital Library
Mohanty, S. et al. 2013. Big Data imperatives: enterprise Big Data warehouse, BI implementations and analytics. Apress. Google ScholarCross Ref
Murthy, R. and Goel, R. 2012. Peregrine: Low-latency Queries on Hive Warehouse Data. XRDS. 19, 1 (Sep. 2012), 40--43. Google ScholarDigital Library
NBD-PWG 2015. NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. Technical Report #NIST SP 1500-6. National Institute of Standards and Technology.Google Scholar
Philip Chen, C.L. and Zhang, C.-Y. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences. 275, (Aug. 2014), 314--347. Google ScholarCross Ref
Presto | Distributed SQL Query Engine for Big Data: 2016. https://prestodb.io/. Accessed: 2016-10-23.Google Scholar
Russom, P. 2016. Data Warehouse Modernization in the Age of Big Data Analytics. The Data Warehouse Institute.Google Scholar
Russom, P. 2014. Evolving Data Warehouse Architectures in the Age of Big Data. The Data Warehouse Institute.Google Scholar
Shvachko, K. et al. 2010. The Hadoop Distributed File System. 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (May 2010), 1--10.Google Scholar
Sun, L. et al. 2013. Present Situation and Prospect of Data Warehouse Architecture under the Background of Big Data. Information Science and Cloud Computing Companion (ISCC-C), 2013 International Conference on (2013), 529--535.Google ScholarDigital Library
Thusoo, A. et al. 2010. Data Warehousing and Analytics Infrastructure at Facebook. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), 1013--1020. Google ScholarDigital Library
Thusoo, A. et al. 2010. Hive-a petabyte scale data warehouse using hadoop. IEEE 26th International Conference on Data Engineering (ICDE) (2010), 996--1005.Google Scholar
TPC Benchmark ™ H (TPC-H): 2016. http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf. Accessed: 2016-11-01.Google Scholar
Vavilapalli, V.K. et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing (New York, NY, USA, 2013), 5:1--5:16. Google ScholarDigital Library
Wang, H. et al. 2011. LinearDB: A Relational Approach to Make Data Warehouse Scale Like MapReduce. Database Systems for Advanced Applications. J.X. Yu et al., eds. Springer Berlin Heidelberg. 306--320.Google Scholar
Ward, J.S. and Barker, A. 2013. Undefined By Data: A Survey of Big Data Definitions. arXiv:1309.5821 [cs.DB]. (Sep. 2013).Google Scholar
White, T. 2015. Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. O'Reilly Media.Google Scholar
Wouw, S. van et al. 2015. An Empirical Performance Evaluation of Distributed SQL Query Engines. Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (New York, NY, USA, 2015), 123--131. Google ScholarDigital Library

Index Terms

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
1. Information systems
  1. Data management systems
    1. Database administration
      1. Database performance evaluation
    2. Information integration
      1. Data warehouses
  2. Information systems applications
    1. Decision support systems
      1. Data analytics

Recommendations

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Read More
A Performance Study of Big Spatial Data Systems
BigSpatial '18: Proceedings of the 7th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data

With the accelerated growth in spatial data volume, being generated from a wide variety of sources, the need for efficient storage, retrieval, processing and analyzing of spatial data is ever more important. Hence, spatial data processing system has ...
Read More
The Era of Big Spatial Data: Challenges and Opportunities
MDM '15: Proceedings of the 2015 16th IEEE International Conference on Mobile Data Management - Volume 02

This seminar describes the state-of-the-art research in the area of big spatial data and it consists of four parts. Part I gives a background about big spatial data and the limitations of traditional systems in handling such data. Part II gives an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium
July 2017
338 pages
ISBN:9781450352208
DOI:10.1145/3105831
Editors:
Bipin C. Desai
Concordia University
,
Jun Hong
UWE, Bristol
,
Richard McClatchey
UWE, Bristol
,
General Chair:
Bipin C. Desai
Concordia University
,
Program Chair:
Jun Hong
UWE, Bristol
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 July 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Benchmark
Big Data
Big Data Warehousing
Data Warehouse
Drill
Hadoop
Hive
Presto
SQL-on-Hadoop
Spark
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
IDEAS '17 Paper Acceptance Rate38of102submissions,37%Overall Acceptance Rate74of210submissions,35%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 356
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

A Performance Study of Big Spatial Data Systems

The Era of Big Spatial Data: Challenges and Opportunities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

A Performance Study of Big Spatial Data Systems

The Era of Big Spatial Data: Challenges and Opportunities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media