skip to main content
article

Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Authors Info & Claims
Published:01 August 2013Publication History
Skip Abstract Section

Abstract

Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.

References

  1. Geocouch. https://github.com/couchbase/geocouch/.Google ScholarGoogle Scholar
  2. neo4j/spatial. https://github.com/neo4j/spatial.Google ScholarGoogle Scholar
  3. The sloan digital sky survey project (sdss). http://www.sdss.org.Google ScholarGoogle Scholar
  4. Spatial index library. http://libspatialindex.github.com.Google ScholarGoogle Scholar
  5. Spatialhadoop. http://spatialhadoop.cs.umn.edu/.Google ScholarGoogle Scholar
  6. Geos. http://trac.osgeo.org/geos, 2013.Google ScholarGoogle Scholar
  7. Hadoop-GIS wiki. https://web.cci.emory.edu/confluence/display/hadoopgis, 2013.Google ScholarGoogle Scholar
  8. Openstreetmap. http://www.openstreetmap.org, 2013.Google ScholarGoogle Scholar
  9. Pathology analytical imaging standards. https://web.cci.emory.edu/confluence/display/PAIS, 2013.Google ScholarGoogle Scholar
  10. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2:922-933, August 2009. Google ScholarGoogle Scholar
  11. A. Aji. High performance spatial query processing for large scale scientific data. In Proceedings of the on SIGMOD/PODS 2012 PhD Symposium, pages 9-14, New York, NY, USA, 2012. ACM. Google ScholarGoogle Scholar
  12. A. Aji, F. Wang, and J. H. Saltz. Towards building a high performance spatial query system for large scale medical imaging data. In SIGSPATIAL/GIS, pages 309-318, 2012. Google ScholarGoogle Scholar
  13. A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi-based geospatial query processing with mapreduce. In CLOUDCOM, pages 9-16, 2010. Google ScholarGoogle Scholar
  14. N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, 1990. Google ScholarGoogle Scholar
  15. J. V. d. Bercken and B. Seeger. An evaluation of generic bulk loading techniques. In VLDB, pages 461-470, 2001. Google ScholarGoogle Scholar
  16. S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In SIGMOD, 2010. Google ScholarGoogle Scholar
  17. T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Parallel processing of spatial joins using r-trees. In ICDE, 1996. Google ScholarGoogle Scholar
  18. A. Cary, Z. Sun, V. Hristidis, and N. Rishe. Experiences on processing spatial data with mapreduce. In SSDBM, pages 302-319, 2009. Google ScholarGoogle Scholar
  19. R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265-1276, 2008. Google ScholarGoogle Scholar
  20. J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53(1):72-77, 2010. Google ScholarGoogle Scholar
  21. A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high level dataflow system on top of MapReduce: The Pig experience. PVLDB, 2(2):1414-1425, 2009. Google ScholarGoogle Scholar
  22. H. Gupta, B. Chawda, S. Negi, T. A. Faruquie, L. V. Subramaniam, and M. Mohania. Processing multi-way spatial joins on map-reduce. In EDBT, pages 113-124, 2013. Google ScholarGoogle Scholar
  23. I. Kamel and C. Faloutsos. Hilbert r-tree: An improved r-tree using fractals. In VLDB, pages 500-509, 1994. Google ScholarGoogle Scholar
  24. R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. Ysmart: Yet another sql-to-mapreduce translator. In ICDCS, 2011. Google ScholarGoogle Scholar
  25. M.-L. Lo and C. V. Ravishankar. Spatial hash-joins. In SIGMOD, pages 247-258, 1996. Google ScholarGoogle Scholar
  26. M. A. Nieto-Santisteban, A. R. Thakar, and A. S. Szalay. Cross-matching very large datasets. In NSTC NASA Conference, 2007.Google ScholarGoogle Scholar
  27. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarGoogle Scholar
  28. J. Patel, J. Yu, N. Kabra, K. Tufte, B. Nag, J. Burger, N. Hall, K. Ramasamy, R. Lueder, C. Ellmann, J. Kupsch, S. Guo, J. Larson, D. De Witt, and J. Naughton. Building a scaleable geo-spatial dbms: technology, implementation, and evaluation. In SIGMOD, SIGMOD '97, pages 336-347, 1997. Google ScholarGoogle Scholar
  29. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165-178, 2009. Google ScholarGoogle Scholar
  30. M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Commun. ACM, 53(1):64-71, 2010. Google ScholarGoogle Scholar
  31. A. S. Szalay, G. Bell, J. vandenBerg, A. Wonders, R. C. Burns, D. Fay, J. Heasley, T. Hey, M. A. Nieto-Santisteban, A. Thakar, C. v. Ingen, and R. Wilton. Graywulf: Scalable clustered architecture for data intensive computing. In HICSS, pages 1-10, 2009. Google ScholarGoogle Scholar
  32. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626-1629, Aug. 2009. Google ScholarGoogle Scholar
  33. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. volume 2, pages 1626-1629, August 2009. Google ScholarGoogle Scholar
  34. F. Wang, J. Kong, L. Cooper, T. Pan, K. Tahsin, W. Chen, A. Sharma, C. Niedermayr, T. W. Oh, D. Brat, A. B. Farris, D. Foran, and J. Saltz. A data model and database for high-resolution pathology analytical image informatics. J Pathol Inform, 2(1):32, 2011.Google ScholarGoogle Scholar
  35. F. Wang, J. Kong, J. Gao, D. Adler, L. Cooper, C. Vergara-Niedermayr, Z. Zhou, B. Katigbak, T. Kurc, D. Brat, and J. Saltz. A high-performance spatial database based approach for pathology imaging algorithm evaluation. J Pathol Inform, 4(5), 2013.Google ScholarGoogle Scholar
  36. K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H. Saltz. Accelerating pathology image data cross-comparison on cpu-gpu hybrid systems. Proc. VLDB Endow., 5(11):1543-1554, July 2012. Google ScholarGoogle Scholar
  37. Y. Xu, P. Kostamaa, and L. Gao. Integrating hadoop and parallel dbms. In SIGMOD, pages 969-974, 2010. Google ScholarGoogle Scholar
  38. S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. Sjmr: Parallelizing spatial join with mapreduce on clusters. In CLUSTER, 2009.Google ScholarGoogle Scholar
  39. Y. Zhong, J. Han, T. Zhang, Z. Li, J. Fang, and G. Chen. Towards parallel spatial query processing for big spatial data. In IPDPSW, pages 2085-2094, 2012. Google ScholarGoogle Scholar
  40. X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 2:175-204, June 1998. Google ScholarGoogle Scholar

Index Terms

  1. Hadoop GIS: a high performance spatial data warehousing system over mapreduce
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 6, Issue 11
          August 2013
          237 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 August 2013
          Published in pvldb Volume 6, Issue 11

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader