research-article

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Authors:
Azza Abouzeid

Yale University

Yale University
View Profile

,
Kamil Bajda-Pawlikowski

Yale University

Yale University
View Profile

,
Daniel Abadi

Yale University

Yale University
View Profile

,
Avi Silberschatz

Yale University

Yale University
View Profile

,
Alexander Rasin

Brown University

Brown University
View Profile

Proceedings of the VLDB Endowment Volume 2 Issue 1pp 922–933https://doi.org/10.14778/1687627.1687731

Published:01 August 2009Publication History

Proceedings of the VLDB Endowment

Abstract

The production environment for analytical data management applications is rapidly changing. Many enterprises are shifting away from deploying their analytical databases on high-end proprietary machines, and moving towards cheaper, lower-end, commodity hardware, typically arranged in a shared-nothing MPP architecture, often in a virtualized environment inside public or private "clouds". At the same time, the amount of data that needs to be analyzed is exploding, requiring hundreds to thousands of machines to work in parallel to perform the analysis.

There tend to be two schools of thought regarding what technology to use for data analysis in such an environment. Proponents of parallel databases argue that the strong emphasis on performance and efficiency of parallel databases makes them well-suited to perform such analysis. On the other hand, others argue that MapReduce-based systems are better suited due to their superior scalability, fault tolerance, and flexibility to handle unstructured data. In this paper, we explore the feasibility of building a hybrid system that takes the best features from both technologies; the prototype we built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.

References

Hadoop. Web Page. hadoop.apache.org/core/.Google Scholar
HadoopDB Project. Web page. db.cs.yale.edu/hadoopdb/hadoopdb.html.Google Scholar
Vertica. www.vertica.com/.Google Scholar
D. Abadi. What is the right way to measure scale? DBMS Musings Blog. dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html.Google Scholar
P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proc. of SOSP, 2003. Google ScholarDigital Library
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. In Proc. of VLDB, 2008. Google ScholarDigital Library
G. Czajkowski. Sorting 1pb with mapreduce. googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. DatabaseColumn Blog. www.databasecolumn. com/2008/01/mapreduce-a-major-step-back.html.Google Scholar
D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A High Performance Dataflow Database Machine. In VLDB '86, 1986. Google ScholarDigital Library
Facebook. Hive. Web page. issues.apache.org/jira/browse/HADOOP-3601.Google Scholar
S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database Machine. In VLDB '86, 1986. Google ScholarDigital Library
Hadoop Project. Hadoop Cluster Setup. Web Page. hadoop.apache.org/core/docs/current/cluster_setup.html.Google Scholar
J. Hamilton. Cooperative expendable micro-slice servers (cems): Low cost, low power servers for internet-scale services. In Proc. of CIDR, 2009.Google Scholar
Hive Project. Hive SVN Repository. Accessed May 19th 2009. svn.apache.org/viewvc/hadoop/hive/.Google Scholar
J. N. Hoover. Start-Ups Bring Google's Parallel Processing To Data Warehousing. InformationWeek, August 29th, 2008.Google Scholar
S. Madden, D. DeWitt, and M. Stonebraker. Database parallelism choices greatly impact scalability. DatabaseColumn Blog. www.databasecolumn.com/2007/10/database-parallelism-choices.html.Google Scholar
Mayank Bawa. A $5.1M Addendum to our Series B. www.asterdata.com/blog/index.php/2009/02/25/a-51m-addendum-to-our-series-b/.Google Scholar
C. Monash. The 1-petabyte barrier is crumbling. www.networkworld.com/community/node/31439.Google Scholar
C. Monash. Cloudera presents the MapReduce bull case. DBMS2 Blog. www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/.Google Scholar
C. Olofson. Worldwide RDBMS 2005 vendor shares. Technical Report 201692, IDC, May 2006.Google Scholar
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proc. of SIGMOD, 2008. Google ScholarDigital Library
A. Pavlo, A. Rasin, S. Madden, M. Stonebraker, D. DeWitt, E. Paulson, L. Shrinivas, and D. J. Abadi. A Comparison of Approaches to Large Scale Data Analysis. In Proc. of SIGMOD, 2009. Google ScholarDigital Library
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A column-oriented DBMS. In VLDB, 2005. Google ScholarDigital Library
D. Vesset. Worldwide data warehousing tools 2005 vendor shares. Technical Report 203229, IDC, August 2006.Google Scholar

Index Terms

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
  2. Information systems applications

Recommendations

HadoopDB in action: building real world applications
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

HadoopDB is a hybrid of MapReduce and DBMS technologies, designed to meet the growing demand of analyzing massive datasets on very large clusters of machines. Our previous work has shown that HadoopDB approaches parallel databases in performance and ...
Read More
Big data analysis and query optimization improve HadoopDB performance
SEM '14: Proceedings of the 10th International Conference on Semantic Systems

High performance and scalability are two essentials requirements for data analytics systems as the amount of data being collected, stored and processed continue to grow rapidly. In this paper, we propose a new approach based on HadoopDB. Our main goal ...
Read More
Tradeoffs between parallel database systems, Hadoop, and HadoopDB as platforms for petabyte-scale analysis
SSDBM'10: Proceedings of the 22nd international conference on Scientific and statistical database management

As the market demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen companies have launched in the past few years that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 2, Issue 1
August 2009
1293 pages
ISSN:2150-8097
Editors:
Serge Abiteboul,
Tova Milo,
Jignesh Patel,
Philippe Rigaux
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2009
Published in pvldb Volume 2, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 257
  Total Citations
  View Citations
- 4,703
  Total Downloads
- Downloads (Last 12 months)107
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

HadoopDB in action: building real world applications

Big data analysis and query optimization improve HadoopDB performance

Tradeoffs between parallel database systems, Hadoop, and HadoopDB as platforms for petabyte-scale analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

HadoopDB in action: building real world applications

Big data analysis and query optimization improve HadoopDB performance

Tradeoffs between parallel database systems, Hadoop, and HadoopDB as platforms for petabyte-scale analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media