skip to main content
10.1145/3105831.3105842acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Authors Info & Claims
Published:12 July 2017Publication History

ABSTRACT

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-Hadoop systems increased notoriety, providing Structured Query Language (SQL) interfaces and interactive queries on Hadoop. A benchmark based on a denormalized version of the TPC-H is used to compare the performance of Hive on Tez, Spark, Presto and Drill. Some key contributions of this work include: the direct comparison of a vast set of technologies; unlike previous scientific works, SQL-on-Hadoop systems were connected to Hive tables instead of raw files; allow to understand the behaviour of these systems in scenarios with ever-increasing requirements, but not-so-good hardware. Besides these benchmark results, this paper also makes available interesting findings regarding an architecture and infrastructure in SQL-on-Hadoop for Big Data Warehousing, helping practitioners and fostering future research.

References

  1. Armbrust, M. et al. 2015. Spark sql: Relational data processing in spark. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015), 1383--1394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Chandarana, P. and Vijayalakshmi, M. 2014. Big Data analytics frameworks. 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA) (Apr. 2014), 430--434. Google ScholarGoogle ScholarCross RefCross Ref
  3. Chang, L. et al. 2014. HAWQ: a massively parallel processing SQL engine in hadoop. Proceedings of the 2014 ACM SIGMOD international conference on Management of data (2014), 1223--1234.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, M. et al. 2014. Big Data: A Survey. Mobile Networks and Applications. 19, 2 (Apr. 2014), 171--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chen, Y. et al. 2014. A study of SQL-on-Hadoop systems. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 8807, (2014), 154--166.Google ScholarGoogle Scholar
  6. Choi, H. et al. 2013. Tajo: A distributed data warehouse system on large clusters. Data Engineering (ICDE), 2013 IEEE 29th International Conference on (2013), 1320--1323.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Clegg, D. 2015. Evolving data warehouse and BI architectures: The big data challenge. TDWI Business Intelligence Journal. 20, 1 (2015), 19--24.Google ScholarGoogle Scholar
  8. Floratou, A. et al. 2014. SQL-on-Hadoop: Full Circle Back to Shared-nothing Database Architectures. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1295--1306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Golab, L. and Johnson, T. 2014. Data stream warehousing. 2014 IEEE 30th International Conference on Data Engineering (ICDE) (Mar. 2014), 1290--1293.Google ScholarGoogle ScholarCross RefCross Ref
  10. Goss, R.G. and Veeramuthu, K. 2013. Heading towards big data building a better data warehouse for more data, more speed, and more users. Advanced Semiconductor Manufacturing Conference (ASMC), 2013 24th Annual SEMI (2013), 220--225.Google ScholarGoogle ScholarCross RefCross Ref
  11. Hashem, I.A.T. et al. 2015. The rise of "big data" on cloud computing: Review and open research issues. Information Systems. 47, (Jan. 2015), 98--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hausenblas, M. and Nadeau, J. 2013. Apache drill: interactive ad-hoc analysis at scale. Big Data. 1, 2 (2013), 100--104. Google ScholarGoogle ScholarCross RefCross Ref
  13. Hive LLAP Documentation: 2016. https://cwiki.apache.org/confluence/display/Hive/LLAP. Accessed: 2016-11-01.Google ScholarGoogle Scholar
  14. Huai, Y. et al. 2014. Major Technical Advancements in Apache Hive. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2014), 1235--1246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kimball, R. and Ross, M. 2013. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons.Google ScholarGoogle Scholar
  16. Kobielus, J. 2012. Hadoop: Nucleus of the next-generation big data warehouse. IBM Data Management Magazine.Google ScholarGoogle Scholar
  17. Kornacker, M. et al. 2015. Impala: A modern, open-source sql engine for hadoop. Proc. CIDR'15 (California, USA, 2015).Google ScholarGoogle Scholar
  18. Krishnan, K. 2013. Data Warehousing in the Age of Big Data. Morgan Kaufmann Publishers Inc.Google ScholarGoogle Scholar
  19. Madden, S. 2012. From databases to big data. IEEE Internet Computing. 16, 3 (2012), 4--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Manyika, J. et al. 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.Google ScholarGoogle Scholar
  21. Marz, N. and Warren, J. 2015. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mohanty, S. et al. 2013. Big Data imperatives: enterprise Big Data warehouse, BI implementations and analytics. Apress. Google ScholarGoogle ScholarCross RefCross Ref
  23. Murthy, R. and Goel, R. 2012. Peregrine: Low-latency Queries on Hive Warehouse Data. XRDS. 19, 1 (Sep. 2012), 40--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. NBD-PWG 2015. NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. Technical Report #NIST SP 1500-6. National Institute of Standards and Technology.Google ScholarGoogle Scholar
  25. Philip Chen, C.L. and Zhang, C.-Y. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences. 275, (Aug. 2014), 314--347. Google ScholarGoogle ScholarCross RefCross Ref
  26. Presto | Distributed SQL Query Engine for Big Data: 2016. https://prestodb.io/. Accessed: 2016-10-23.Google ScholarGoogle Scholar
  27. Russom, P. 2016. Data Warehouse Modernization in the Age of Big Data Analytics. The Data Warehouse Institute.Google ScholarGoogle Scholar
  28. Russom, P. 2014. Evolving Data Warehouse Architectures in the Age of Big Data. The Data Warehouse Institute.Google ScholarGoogle Scholar
  29. Shvachko, K. et al. 2010. The Hadoop Distributed File System. 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (May 2010), 1--10.Google ScholarGoogle Scholar
  30. Sun, L. et al. 2013. Present Situation and Prospect of Data Warehouse Architecture under the Background of Big Data. Information Science and Cloud Computing Companion (ISCC-C), 2013 International Conference on (2013), 529--535.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Thusoo, A. et al. 2010. Data Warehousing and Analytics Infrastructure at Facebook. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), 1013--1020. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Thusoo, A. et al. 2010. Hive-a petabyte scale data warehouse using hadoop. IEEE 26th International Conference on Data Engineering (ICDE) (2010), 996--1005.Google ScholarGoogle Scholar
  33. TPC Benchmark ™ H (TPC-H): 2016. http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf. Accessed: 2016-11-01.Google ScholarGoogle Scholar
  34. Vavilapalli, V.K. et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing (New York, NY, USA, 2013), 5:1--5:16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Wang, H. et al. 2011. LinearDB: A Relational Approach to Make Data Warehouse Scale Like MapReduce. Database Systems for Advanced Applications. J.X. Yu et al., eds. Springer Berlin Heidelberg. 306--320.Google ScholarGoogle Scholar
  36. Ward, J.S. and Barker, A. 2013. Undefined By Data: A Survey of Big Data Definitions. arXiv:1309.5821 [cs.DB]. (Sep. 2013).Google ScholarGoogle Scholar
  37. White, T. 2015. Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. O'Reilly Media.Google ScholarGoogle Scholar
  38. Wouw, S. van et al. 2015. An Empirical Performance Evaluation of Distributed SQL Query Engines. Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (New York, NY, USA, 2015), 123--131. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium
          July 2017
          338 pages
          ISBN:9781450352208
          DOI:10.1145/3105831

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 July 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          IDEAS '17 Paper Acceptance Rate38of102submissions,37%Overall Acceptance Rate74of210submissions,35%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader