Skip to main content
Log in

A fast scalable distributed kriging algorithm using Spark framework

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Environmental and climate models used for weather prediction require evenly spaced meteorological datasets at a very high spatial and temporal resolution to facilitate the analysis of recent climatic changes. However, due to the small number of weather stations available, often the data collected from them are scattered and inadequate for such model creation. For this reason, very high-resolution gridded meteorological surface is developed by interpolating the available scattered data points to fulfill the need of various ecological and climatic applications. Among various interpolation techniques, Ordinary Kriging (OK) is one of the most popular and widely used gridding methodologies with a sound statistical basis providing a possibility to obtain highly accurate results. However, OK interpolation on large unevenly spaced data points is computationally demanding and has a computational cost that scales as the cube of the number of data points as it involves multiplication and inversion of matrices of large cardinalities infeasible for computation on a single node. Additionally, its standard implementation involves complex model fitting and function minimization steps which make automatic kriging analysis from raw data a considerable challenge. Meanwhile, Apache Spark has emerged as a large-scale data processing engine with a dedicated Machine Learning Library (MLLib) for processing large matrices and thereby can be used for large-scale kriging analysis with considerable time. In this paper, we present a new fast distributed OK algorithm on Apache Spark framework and provide an efficient and simple distributed matrix inversion scheme to accelerate the execution of distributed OK algorithm. We have employed Strassen’s direct method for matrix inversion and the acceleration is achieved by exploiting the symmetry nature of the variance–covariance matrix of the OK equation to invert the matrix. We show experimentally that our distributed inversion scheme enables us to invert a \(16{,}000 \times 16{,}000\) matrix with 51% and 38% less wall clock time than distributed Spark-based LU and Strassen’s inversion scheme, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://www.cpc.ncep.noaa.gov/.

References

  1. Abatzoglou, J.T.: Development of gridded surface meteorological data for ecological applications and modelling. Int. J. Climatol. 33(1), 121–131 (2013)

    Google Scholar 

  2. ArcGIS: creating empirical semivariograms. http://desktop.arcgis.com/en/arcmap/latest/extensions/geostatistical-analyst/creating-empirical-semivariograms.htm/ (2016). Accessed 30 Dec 2019

  3. Caesar, J., Alexander, L., Vose, R.: Large-scale changes in observed daily maximum and minimum temperatures: creation and analysis of a new gridded data set. J. Geophys. Res. Atmos. 111(D5), D05101 (2006)

    Google Scholar 

  4. Chen, M., Shi, W., Xie, P., Silva, V., Kousky, V.E., Wayne Higgins, R., Janowiak, J.E.: Assessing objective techniques for gauge-based analyses of global daily precipitation. J. Geophys. Res. Atmos. 113(D4), D04110 (2008)

    Google Scholar 

  5. Cheng, T.: Accelerating universal kriging interpolation algorithm using CUDA-enabled GPU. Comput. Geosci. 54, 178–183 (2013)

    Google Scholar 

  6. Cheng, T., Li, D., Wang, Q.: On parallelizing universal kriging interpolation based on OpenMP. In: 2010 Ninth International Symposium on Distributed Computing and Applications to Business Engineering and Science (DCABES), pp. 36–39. IEEE (2010)

  7. Cressie, N.: Statistics for Spatial Data. Wiley, New York (2015)

    MATH  Google Scholar 

  8. Cressie, N., Kang, E.L.: High-resolution digital soil mapping: Kriging for very large datasets. In: ViscarraRossel, R.A., McBratney, A.B., Minasny, B. (eds.) Proximal Soil Sensing, pp. 49–63. Springer, New York (2010)

    Google Scholar 

  9. Deutsch, C.V., Journel, A.G., et al.: Geostatistical Software Library and User’s Guide, vol. 119, 147th edn. Oxford University Press, New York (1992)

    Google Scholar 

  10. Di Luzio, M., Johnson, G.L., Daly, C., Eischeid, J.K., Arnold, J.G.: Constructing retrospective gridded daily precipitation and temperature datasets for the conterminous united states. J. Appl. Meteorol. Climatol. 47(2), 475–497 (2008)

    Google Scholar 

  11. Gebhardt, A.: PVM kriging with R. In: Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Vienna. Citeseer (2003)

  12. Gribov, A., Krivoruchko, K., Ver Hoef, J.M.: Modified weighted least squares semivariogram and covariance model fitting algorithm. Stoch. Model. Geostat. Princ. Methods Case Stud. 2 (2000)

  13. Gyalistras, D.: Development and validation of a high-resolution monthly gridded temperature and precipitation data set for Switzerland (1951–2000). Clim. Res. 25(1), 55–83 (2003)

    Google Scholar 

  14. Hadoop, A.: Apache Hadoop project. https://hadoop.apache.org/ (2016). Accessed 30 Dec 2019

  15. Haylock, M., Hofstra, N., Klein Tank, A., Klok, E., Jones, P., New, M.: A European daily high-resolution gridded data set of surface temperature and precipitation for 1950–2006. J. Geophys. Res. Atmos. 113(D20), D20119 (2008)

    Google Scholar 

  16. Hernandez-Penaloza, G., Beferull-Lozano, B.: Field estimation in wireless sensor networks using distributed kriging. In: 2012 IEEE International Conference on Communications (ICC), pp. 724–729. IEEE (2012)

  17. Herrera, S., Gutiérrez, J.M., Ancell, R., Pons, M., Frías, M., Fernández, J.: Development and analysis of a 50-year high-resolution daily gridded precipitation dataset over Spain (Spain02). Int. J. Climatol. 32(1), 74–85 (2012)

    Google Scholar 

  18. Hu, H., Shu, H.: An improved coarse-grained parallel algorithm for computational acceleration of ordinary Kriging interpolation. Comput. Geosci. 78, 44–52 (2015)

    Google Scholar 

  19. Hutchinson, M.F., McKenney, D.W., Lawrence, K., Pedlar, J.H., Hopkinson, R.F., Milewska, E., Papadopol, P.: Development and testing of Canada-wide interpolated spatial models of daily minimum–maximum temperature and precipitation for 1961–2003. J. Appl. Meteorol. Climatol. 48(4), 725–741 (2009)

    Google Scholar 

  20. Isaaks, E.H., Srivastava, R.M.: An Introduction to Applied Geostatistics. Oxford University Press, Oxford (1989)

    Google Scholar 

  21. Jardak, C., Mahonen, P., Riihijarvi, J.: Spatial big data and wireless networks: experiences, applications, and research challenges. Netw. IEEE 28(4), 26–31 (2014)

    Google Scholar 

  22. Jardak, C., Riihijarvi, J., Oldewurtel, F., Mahonen, P.: Parallel processing of data from very large-scale wireless sensor networks. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 787–794. ACM (2010)

  23. Jeffrey, S.J., Carter, J.O., Moodie, K.B., Beswick, A.R.: Using spatial interpolation to construct a comprehensive archive of Australian climate data. Environ. Model. Softw. 16(4), 309–330 (2001)

    Google Scholar 

  24. Kaluzny, S.P., Vega, S.C., Cardoso, T.P., Shelly, A.A.: S+ SpatialStats: User’s Manual for Windows \(\textregistered \) and UNIX\(\textregistered \). Springer, New York (2013)

    MATH  Google Scholar 

  25. Kerry, K., Hawick, K.: Spatial interpolation on distributed, high-performance computers. In: Proceedings of High-Performance Computing and Networks (HPCN) Europe, vol. 98 (1997)

  26. Kerry, K., Hawick, K.A.: Kriging interpolation on high-performance computers. In: International Conference on High-Performance Computing and Networking, pp. 429–438. Springer (1998)

  27. Krishnamurti, T., Mishra, A., Simon, A., Yatagai, A.: Use of a dense gauge network over india for improving blended TRMM products and downscaled weather models. J. Meteorol. Soc. Jpn. 87, 395416 (2009)

    Google Scholar 

  28. Krishnan, S., Baru, C., Crosby, C.: Evaluation of MapReduce for gridding LIDAR Data. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), pp. 33–40. IEEE (2010)

  29. Lesniak, A., Porzycka, S.: Geostatistical computing in PSInSAR data analysis. In: Computational Science—ICCS 2009, pp. 397–405. Springer (2009)

  30. Liu, J., Liang, Y., Ansari, N.: Spark-based large-scale matrix inversion for big data processing. IEEE Access 4, 2166–2176 (2016)

    Google Scholar 

  31. Memarsadeghi, N., Raykar, V.C., Duraiswami, R., Mount, D.M.: Efficient kriging via fast matrix-vector products. In: 2008 IEEE Aerospace Conference, pp. 1–7. IEEE (2008)

  32. Meyer, T.H.: The discontinuous nature of Kriging interpolation for digital terrain modeling. Cartogr. Geogr. Inf. Sci. 31(4), 209–216 (2004)

    Google Scholar 

  33. Misra, C., Haldar, S., Bhattacharya, S., Ghosh, S.K.: Spin: A fast and scalable matrix inversion method in apache spark. In: Proceedings of the 19th International Conference on Distributed Computing and Networking, ICDCN ’18, pp. 16:1–16:10. ACM, New York (2018). https://doi.org/10.1145/3154273.3154300

  34. Oliver, M.A., Webster, R.: Kriging: a method of interpolation for geographical information systems. Int. J. Geogr. Inf. Syst. 4(3), 313–332 (1990)

    Google Scholar 

  35. Pebesma, E.J.: Multivariable geostatistics in S: the GSTAT package. Comput. Geosci. 30(7), 683–691 (2004)

    Google Scholar 

  36. Perry, M., Hollis, D.: The generation of monthly gridded datasets for a range of climatic variables over the UK. Int. J. Climatol. 25(8), 1041–1054 (2005)

    Google Scholar 

  37. Pesquer, L., Cortés, A., Pons, X.: Parallel ordinary kriging interpolation incorporating automatic variogram fitting. Comput. Geosci. 37(4), 464–473 (2011)

    Google Scholar 

  38. Rajeevan, M., Bhate, J.: A high resolution daily gridded rainfall dataset (1971–2005) for mesoscale meteorological studies. Curr. Sci 96(4), 558–562 (2009)

    Google Scholar 

  39. Rajeevan, M., Bhate, J., Jaswal, A.: Analysis of variability and trends of extreme rainfall events over India using 104 years of gridded daily rainfall data. Geophys. Res. Lett. 35(18), L18707 (2008)

    Google Scholar 

  40. Rajeevan, M., Bhate, J., Kale, J., Lal, B.: High resolution daily gridded rainfall data for the Indian region: analysis of break and active monsoon spells. Curr. Sci. 91(3), 296–306 (2006)

    Google Scholar 

  41. Riihijarvi, J., Mahonen, P.: Highly scalable data processing framework for pervasive computing applications. In: 2013 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), pp. 306–308. IEEE (2013)

  42. Rizki, P.N.M., Eum, J., Lee, H., Oh, S.: Spark-based in-memory DEM creation from 3D LiDAR point clouds. Remote Sens. Lett. 8(4), 360–369 (2017)

    Google Scholar 

  43. Rizki, P.N.M., Lee, H., Lee, M., Oh, S.: High-performance parallel approaches for three-dimensional light detection and ranging point clouds gridding. J. Appl. Remote Sens. 11(1), 016011 (2017)

    Google Scholar 

  44. Shepard, D.: A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM National Conference, ACM ’68, pp. 517–524. ACM, New York (1968). https://doi.org/10.1145/800186.810616

  45. Shi, X., Ye, F.: Kriging interpolation over heterogeneous computer architectures and systems. GISci. Remote Sens. 50(2), 196–211 (2013)

    MathSciNet  Google Scholar 

  46. Software, S.: Statistical analysis software, SAS/STAT. https://www.sas.com/en_us/software/stat.html (2016). Accessed 30 Dec 2019

  47. Spark, A.: Apache spark, lightning-fast cluster computing. https://spark.apache.org/ (2016). Accessed 30 Dec 2019

  48. Srinivasan, B.V., Duraiswami, R., Murtugudde, R.: Efficient kriging for real-time spatio-temporal interpolation. In: Proceedings of the 20th Conference on Probability and Statistics in the Atmospheric Sciences, pp. 228–235. American Meteorological Society, Atlanta (2010)

  49. Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York (2012)

    Google Scholar 

  50. Strzelczyk, J., Porzycka, S.: Parallel kriging algorithm for unevenly spaced data. In: Manninen, P., Öster, P. (eds.) Applied Parallel and Scientific Computing, pp. 204–212. Springer, New York (2012)

    Google Scholar 

  51. Strzelczyk, J., Porzycka, S., Lesniak, A.: Analysis of ground deformations based on parallel geostatistical computations of PSInSAR data. In: 2009 17th International Conference on Geoinformatics, pp. 1–6. IEEE (2009)

  52. Turk, F.J., Arkin, P., Sapiano, M.R., Ebert, E.E.: Evaluating high-resolution precipitation products. Bull. Am. Meteorol. Soc. 89(12), 1911–1916 (2008)

    Google Scholar 

  53. Wackernagel, H.: Multivariate Geostatistics: An Introduction with Applications. Springer, New York (2013)

    MATH  Google Scholar 

  54. Xiang, J., Meng, H., Aboulnaga, A.: Scalable matrix inversion using MapReduce. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 177–190. ACM (2014)

  55. Yatagai, A., Arakawa, O., Kamiguchi, K., Kawamoto, H., Nodzu, M.I., Hamada, A.: A 44-year daily gridded precipitation dataset for Asia based on a dense network of rain gauges. Sola 5, 137–140 (2009)

    Google Scholar 

  56. Yatagai, A., Xie, P., Alpert, P.: Development of a daily gridded precipitation data set for the middle east. Adv. Geosci. 12, 165–170 (2008)

    Google Scholar 

  57. Zhang, M., Wang, H., Lu, Y., Li, T., Guang, Y., Liu, C., Edrosa, E., Li, H., Rishe, N.: TerraFly GeoCloud: an online spatial data analysis and visualization system. ACM Trans. Intell. Syst. Technol. TIST 6(3), 34 (2015)

    Google Scholar 

  58. Zhuo, W., Paciorek, C., Kaufman, C., Bethel, W., et al.: Parallel kriging analysis for large spatial datasets. In: 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), pp. 38–44. IEEE (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chandan Misra.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 173 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Misra, C., Bhattacharya, S. & Ghosh, S.K. A fast scalable distributed kriging algorithm using Spark framework. Int J Data Sci Anal 10, 249–264 (2020). https://doi.org/10.1007/s41060-020-00215-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-020-00215-3

Keywords

Navigation