ABSTRACT
Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-Hadoop systems increased notoriety, providing Structured Query Language (SQL) interfaces and interactive queries on Hadoop. A benchmark based on a denormalized version of the TPC-H is used to compare the performance of Hive on Tez, Spark, Presto and Drill. Some key contributions of this work include: the direct comparison of a vast set of technologies; unlike previous scientific works, SQL-on-Hadoop systems were connected to Hive tables instead of raw files; allow to understand the behaviour of these systems in scenarios with ever-increasing requirements, but not-so-good hardware. Besides these benchmark results, this paper also makes available interesting findings regarding an architecture and infrastructure in SQL-on-Hadoop for Big Data Warehousing, helping practitioners and fostering future research.
- Armbrust, M. et al. 2015. Spark sql: Relational data processing in spark. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015), 1383--1394.Google ScholarDigital Library
- Chandarana, P. and Vijayalakshmi, M. 2014. Big Data analytics frameworks. 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA) (Apr. 2014), 430--434. Google ScholarCross Ref
- Chang, L. et al. 2014. HAWQ: a massively parallel processing SQL engine in hadoop. Proceedings of the 2014 ACM SIGMOD international conference on Management of data (2014), 1223--1234.Google ScholarDigital Library
- Chen, M. et al. 2014. Big Data: A Survey. Mobile Networks and Applications. 19, 2 (Apr. 2014), 171--209. Google ScholarDigital Library
- Chen, Y. et al. 2014. A study of SQL-on-Hadoop systems. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 8807, (2014), 154--166.Google Scholar
- Choi, H. et al. 2013. Tajo: A distributed data warehouse system on large clusters. Data Engineering (ICDE), 2013 IEEE 29th International Conference on (2013), 1320--1323.Google ScholarDigital Library
- Clegg, D. 2015. Evolving data warehouse and BI architectures: The big data challenge. TDWI Business Intelligence Journal. 20, 1 (2015), 19--24.Google Scholar
- Floratou, A. et al. 2014. SQL-on-Hadoop: Full Circle Back to Shared-nothing Database Architectures. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1295--1306. Google ScholarDigital Library
- Golab, L. and Johnson, T. 2014. Data stream warehousing. 2014 IEEE 30th International Conference on Data Engineering (ICDE) (Mar. 2014), 1290--1293.Google ScholarCross Ref
- Goss, R.G. and Veeramuthu, K. 2013. Heading towards big data building a better data warehouse for more data, more speed, and more users. Advanced Semiconductor Manufacturing Conference (ASMC), 2013 24th Annual SEMI (2013), 220--225.Google ScholarCross Ref
- Hashem, I.A.T. et al. 2015. The rise of "big data" on cloud computing: Review and open research issues. Information Systems. 47, (Jan. 2015), 98--115. Google ScholarDigital Library
- Hausenblas, M. and Nadeau, J. 2013. Apache drill: interactive ad-hoc analysis at scale. Big Data. 1, 2 (2013), 100--104. Google ScholarCross Ref
- Hive LLAP Documentation: 2016. https://cwiki.apache.org/confluence/display/Hive/LLAP. Accessed: 2016-11-01.Google Scholar
- Huai, Y. et al. 2014. Major Technical Advancements in Apache Hive. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2014), 1235--1246. Google ScholarDigital Library
- Kimball, R. and Ross, M. 2013. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons.Google Scholar
- Kobielus, J. 2012. Hadoop: Nucleus of the next-generation big data warehouse. IBM Data Management Magazine.Google Scholar
- Kornacker, M. et al. 2015. Impala: A modern, open-source sql engine for hadoop. Proc. CIDR'15 (California, USA, 2015).Google Scholar
- Krishnan, K. 2013. Data Warehousing in the Age of Big Data. Morgan Kaufmann Publishers Inc.Google Scholar
- Madden, S. 2012. From databases to big data. IEEE Internet Computing. 16, 3 (2012), 4--6. Google ScholarDigital Library
- Manyika, J. et al. 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.Google Scholar
- Marz, N. and Warren, J. 2015. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co.Google ScholarDigital Library
- Mohanty, S. et al. 2013. Big Data imperatives: enterprise Big Data warehouse, BI implementations and analytics. Apress. Google ScholarCross Ref
- Murthy, R. and Goel, R. 2012. Peregrine: Low-latency Queries on Hive Warehouse Data. XRDS. 19, 1 (Sep. 2012), 40--43. Google ScholarDigital Library
- NBD-PWG 2015. NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. Technical Report #NIST SP 1500-6. National Institute of Standards and Technology.Google Scholar
- Philip Chen, C.L. and Zhang, C.-Y. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences. 275, (Aug. 2014), 314--347. Google ScholarCross Ref
- Presto | Distributed SQL Query Engine for Big Data: 2016. https://prestodb.io/. Accessed: 2016-10-23.Google Scholar
- Russom, P. 2016. Data Warehouse Modernization in the Age of Big Data Analytics. The Data Warehouse Institute.Google Scholar
- Russom, P. 2014. Evolving Data Warehouse Architectures in the Age of Big Data. The Data Warehouse Institute.Google Scholar
- Shvachko, K. et al. 2010. The Hadoop Distributed File System. 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (May 2010), 1--10.Google Scholar
- Sun, L. et al. 2013. Present Situation and Prospect of Data Warehouse Architecture under the Background of Big Data. Information Science and Cloud Computing Companion (ISCC-C), 2013 International Conference on (2013), 529--535.Google ScholarDigital Library
- Thusoo, A. et al. 2010. Data Warehousing and Analytics Infrastructure at Facebook. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), 1013--1020. Google ScholarDigital Library
- Thusoo, A. et al. 2010. Hive-a petabyte scale data warehouse using hadoop. IEEE 26th International Conference on Data Engineering (ICDE) (2010), 996--1005.Google Scholar
- TPC Benchmark ™ H (TPC-H): 2016. http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf. Accessed: 2016-11-01.Google Scholar
- Vavilapalli, V.K. et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing (New York, NY, USA, 2013), 5:1--5:16. Google ScholarDigital Library
- Wang, H. et al. 2011. LinearDB: A Relational Approach to Make Data Warehouse Scale Like MapReduce. Database Systems for Advanced Applications. J.X. Yu et al., eds. Springer Berlin Heidelberg. 306--320.Google Scholar
- Ward, J.S. and Barker, A. 2013. Undefined By Data: A Survey of Big Data Definitions. arXiv:1309.5821 [cs.DB]. (Sep. 2013).Google Scholar
- White, T. 2015. Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. O'Reilly Media.Google Scholar
- Wouw, S. van et al. 2015. An Empirical Performance Evaluation of Distributed SQL Query Engines. Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (New York, NY, USA, 2015), 123--131. Google ScholarDigital Library
Index Terms
- Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
Recommendations
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
A Performance Study of Big Spatial Data Systems
BigSpatial '18: Proceedings of the 7th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial DataWith the accelerated growth in spatial data volume, being generated from a wide variety of sources, the need for efficient storage, retrieval, processing and analyzing of spatial data is ever more important. Hence, spatial data processing system has ...
The Era of Big Spatial Data: Challenges and Opportunities
MDM '15: Proceedings of the 2015 16th IEEE International Conference on Mobile Data Management - Volume 02This seminar describes the state-of-the-art research in the area of big spatial data and it consists of four parts. Part I gives a background about big spatial data and the limitations of traditional systems in handling such data. Part II gives an ...
Comments