Skip to main content
Log in

GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

In the last decade, geospatial data which is extracted from GPS traces and satellites image has become ubiquitous. GeoVisual analytics, abbr. GeoViz, is the science of analytical reasoning assisted by geospatial map interfaces. GeoViz involves two phases: (1) spatial data processing: that loads spatial data and executes spatial queries to return the set of spatial objects to be visualized. (2) Map visualization: that applies a map visualization effect, e.g., Heatmap, on the spatial objects produced in the first phase. Existing GeoViz system architectures decouple these two phases, which lose the opportunity to co-optimize the data processing and map visualization phases in the same cluster. To remedy this, the paper presents GeoSparkViz, a full-fledged system that allows the user to load, process, integrate and execute GeoViz tasks on spatial data at scale. GeoSparkViz extends a state-of-the-art distributed data management system to provide native support for general geospatial map visualization. The system encapsulates the main steps of the map visualization process, e.g., pixelize spatial objects, pixel aggregation, and map tile rendering into a set of massively parallelized map building operators. This allows the system to co-optimize the spatial query operators and map building operators side by side. GeoSparkViz is also equipped with a GeoViz-aware spatial partitioning operator that achieves load balancing for GeoViz workloads among all nodes in the cluster. Experiments based on an implementation in Spark show that GeoSparkViz achieves up to an order of magnitude less data-to-visualization time than its counterparts when running visual analytics tasks over large-scale spatial data extracted from the NYC taxi dataset and OpenStreetMaps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J.H.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. Proc. VLDB Endow. PVLDB 6(11), 1009–1020 (2013)

    Article  Google Scholar 

  2. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: Sellis, T.K., Davidson, S.B., Ives, Z.G. (eds.) Proceedings of the ACM International Conference on Management of Data, SIGMOD, pp. 1383–1394. ACM (2015)

  3. Baig, F., Mehrotra, M., Vo, H., Wang, F., Saltz, J.H., Kurç, T.M.: SparkGIS: efficient comparison and evaluation of algorithm results in tissue image analysis studies. In: Workshop on Biomedical Data Management and Graph Online Querying—VLDB, pp. 134–146 (2015)

  4. Battle, L., Stonebraker, M., Chang, R.: Dynamic reduction of query result sets for interactive visualizaton. In: Proceedings of International Conference on Big Data, BigData, pp. 1–8 (2013)

  5. Crotty, A., Galakatos, A., Zgraggen, E., Binnig, C., Kraska, T.: Vizdom: interactive analytics through pen and touch. Proc. VLDB Endow. PVLDB 8(12), 2024–2027 (2015)

    Article  Google Scholar 

  6. de Lara Pahins, C.A., Stephens, S.A., Scheidegger, C., Comba, J.L.D.: Hashedcubes: simple, low memory, real-time visual exploration of big data. IEEE Trans. Vis. Comput. Graph. TVCG 23(1), 671–680 (2017)

    Article  Google Scholar 

  7. Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in spatial hadoop. Proc. VLDB Endow. PVLDB 8(12), 1602–1605 (2015)

    Article  Google Scholar 

  8. Eldawy, A., Mokbel, M.F.: A demonstration of spatialhadoop: an efficient mapreduce framework for spatial data. Proc. VLDB Endow. PVLDB 6(12), 1230–1233 (2013)

    Article  Google Scholar 

  9. Eldawy, A., Mokbel, M.F., Alharthi, S., Alzaidy, A., Tarek, K., Ghani, S.: Shahed: a mapreduce-based system for querying and visualizing spatio-temporal satellite data. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 1585–1596. IEEE (2015)

  10. Eldawy, A., Mokbel, M.F., Jonathan, C.: Hadoopviz: a mapreduce framework for extensible visualization of big spatial data. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 601–612. IEEE (2016)

  11. Earthdata Cloud Evolution. http://www.naturalearthdata.com/downloads/

  12. Guo, T., Feng, K., Cong, G., Bao, Z.: Efficient selection of geospatial data on maps for interactive and visualized exploration. In: Proceedings of the ACM International Conference on Management of Data, SIGMOD, pp. 567–582. ACM (2018)

  13. Apache Hadoop. http://hadoop.apache.org/

  14. Hughes, J.N., Annex, A., Eichelberger, C.N., Fox, A., Hulbert, A., Ronquest, M.: Geomesa: a distributed architecture for spatio-temporal fusion. In: SPIE Defense+Security, pp. 94730F–94730F. International Society for Optics and Photonics (2015)

  15. Kefaloukos, P.K., Salles, M.A.V., Zachariasen, M.: Declarative cartography: in-database map generalization of geospatial datasets. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 1024–1035. IEEE (2014)

  16. Kini, A., Emanuele, R.: Geotrellis: adding geospatial capabilities to spark. Spark Summit (2014)

  17. Lins, L., Klosowski, J.T., Scheidegger, C.: Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Trans. Vis. Comput. Graph. TVCG 19(12), 2456–2465 (2013)

    Article  Google Scholar 

  18. Liu, Z., Jiang, B., Heer, J.: immens: real-time visual querying of big data. In: Computer Graphics Forum, vol. 32, pp. 421–430. Wiley Online Library (2013)

  19. Apache Livy. https://livy.apache.org/

  20. Lu, J., Güting, R.H.: Parallel secondo: boosting database engines with hadoop. In: International Conference on Parallel and Distributed Systems, ICPADS, pp. 738–743. IEEE (2012)

  21. Mahdian, M., Schrijvers, O., Vassilvitskii, S.: Algorithmic cartography: placing points of interest and ads on maps. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, SIGKDD, pp. 755–764 (2015)

  22. Mostak, T.: An overview of MAPD (massively parallel database). White paper, Massachusetts Institute of Technology (2013)

  23. OpenStreetMap. Map Zoom Level. http://wiki.openstreetmap.org/wiki/Zoom_levels

  24. Park, Y., Cafarella, M.J., Mozafari, B.: Visualization-aware sampling for very large databases. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 755–766. IEEE (2016)

  25. Rahman, S., Aliakbarpour, M., Kong, H., Blais, E., Karahalios, K., Parameswaran, A.G., Rubinfeld, R.: I’ve seen “enough”: incrementally improving visualizations to support rapid decision making. Proc. VLDB Endow. PVLDB 10(11), 1262–1273 (2017)

    Article  Google Scholar 

  26. Sarma, A.D., Lee, H., Gonzalez, H., Madhavan, J., Halevy, A.Y.: Efficient spatial sampling of large geographical tables. In: Proceedings of the ACM International Conference on Management of Data, SIGMOD, pp. 193–204 (2012)

  27. Satyanarayan, A., Moritz, D., Wongsuphasawat, K., Heer, J.: Vega-lite: a grammar of interactive graphics. IEEE Trans. Vis. Comput. Graph. TVCG 23(1), 341–350 (2017)

    Article  Google Scholar 

  28. Satyanarayan, A., Russell, R., Hoffswell, J., Heer, J.: Reactive vega: a streaming dataflow architecture for declarative interactive visualization. IEEE Trans. Vis. Comput. Graph. TVCG 22(1), 659–668 (2016)

    Article  Google Scholar 

  29. Scarsella, A., Stofega, W.: Worldwide Smartphone Forecast 2020–2024. Technical report, International Data Corporation (IDC) (2020). https://www.idc.com/getdoc.jsp?containerId=US46135620

  30. Apache Spark. http://spark.apache.org/

  31. Su, S., An, M., Perry, V., Jia, J., Kim, T., Chen, T., Li, C.: Visually analyzing A billion tweets: an application for collaborative visual analytics on large high-resolution display. In: Proceedings of International Conference on Big Data, BigData, pp. 3597–3606 (2018)

  32. Tang, M., Yu, Y., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: LocationSpark: a distributed in-memory data management system for big spatial data. Proc. VLDB Endow. PVLDB 9(13), 1565–1568 (2016)

    Article  Google Scholar 

  33. Wang, L., Christensen, R., Li, F., Yi, K.: Spatial online sampling and aggregation. Proc. VLDB Endow. PVLDB 9(3), 84–95 (2015)

    Article  Google Scholar 

  34. Weibel, R., Dutton, G.: Generalising spatial data and dealing with multiple representations. Geograph. Inf. Syst. 1, 125–155 (1999)

    Google Scholar 

  35. Wu, E., Battle, L., Madden, S.R.: The case for data visualization management systems. Proc. VLDB Endow. PVLDB 7(10), 903–906 (2014)

    Article  Google Scholar 

  36. Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: Proceedings of the ACM International Conference on Management of Data, SIGMOD, pp. 1071–1085. ACM (2016)

  37. Yu, J., Sarwat, M.: Indexing the pickup and drop-off locations of NYC taxi trips in postgresql—lessons from the road. In: Proceedings of the International Symposium on Advances in Spatial and Temporal Databases, SSTD, pp. 145–162 (2017)

  38. Yu, J., Sarwat, M.: Turbocharging geospatial visualization dashboards via a materialized sampling cube approach. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 1165–1176. IEEE (2020)

  39. Yu, J., Zhang, Z., Sarwat, M.: Geosparkviz: a scalable geospatial data visualization framework in the apache spark ecosystem. In: Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM, pp. 15:1–15:12 (2018)

  40. Yu, J., Zhang, Z., Sarwat, M.: Spatial data management in apache spark: the GeoSpark perspective and beyond. GeoInformatica 23(1), 37–78 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Yu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional GeoViz SQL specification

Additional GeoViz SQL specification

This section includes an additional specification about GeoViz SQL. This is complementary to the content in Sect. 4. We also give more examples to demonstrate how to assemble map effects.

1.1 Type and function specification

GeoViz SQL allows declarative SQL-like queries over structured RDDs. Each RDD has a schema which consists of a number of attributes. Each attribute has a type in Spark.

1.1.1 Types

GeoSparkViz adds two new types of objects: pixels and image. This way, Spark can understand and manipulate data for maps. In addition, GeoSpark itself adds a new type in Spark called Geometry to represent geospatial data.

Geometry [40] This is a generic data type which internally represents a variety of spatial objects, such as points, line strings, and polygons. It has several fields such as coordinates.

Pixel This type extends the Geometry type to support pixels and hence spatial query operators can process it directly. It is used by several map building operators: Pixel, Pixel aggregate and Render. Besides the original fields in Geometry, it has several additional fields: (1) resolution (2) tile id. A pixel can be considered as a point object.

Image This type is a serializable wrapper of Java BufferedImage class and actually holds the map tile data. It provides serialization functions to BufferedImage. Each map tile in GeoSparkViz is an Image type object.

1.1.2 Functions

ST_TileId Each pixel in GeoSparkViz has several internal attributes. The tile ID of a pixel is used to partition the pixels properly.

  • Input The function takes as input a pixel attribute.

  • Output It returns the tile ID of this pixel. The ID is a string type object.

ST_EncodeImage This function returns the base64 string representation of an image. This is a specific function for the server-client environment. For example, some client libraries such as Apache Zeppelin can directly display base64 images.

  • Input The function takes as input an image attribute.

  • Output It returns a base64 string of the image.

1.2 Additional GeoViz query examples

In this section, we provide more examples about how to assemble GeoViz queries. Another example, scatter plot of taxi trip pickup points, can be found in Sect. 4.2.

Spatial dataset We use the NYC taxi trip dataset mentioned in Fig. 5 as the running example in this section. The dataset is loaded into a structured Spatial RDD.

Heat map of taxi trip pick up points This shows a heat map of the distribution of pickup points of taxi trips. The color is in proportion to the density of pickup points. The max weight is 100 which means: if there are more than 100 trips picked up in a place, this place shows red color. The initial weight in Pixelize operator is 1 and the aggregation strategy is count(). Single-image map of taxi trip pick up points This shows a heat map of the distribution of pickup points of taxi trips in a map image which does not have any tiles. Other parameters are the same as the previous one. This is similar to the queries shown in Fig. 4. Heat map of trip fare This shows a heat map of trip fare. If trips picked up from a place cost more money, this place will show a red color. The max weight is 30 which means: if trips from a place cost more than 30 dollars, this place will show a red color. The initial weight in the Pixelize operator is the “trip fare” attribute and the aggregation strategy is avg().

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, J., Sarwat, M. GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data. The VLDB Journal 30, 237–258 (2021). https://doi.org/10.1007/s00778-020-00645-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-020-00645-2

Keywords

Navigation