Abstract
Environmental and climate models used for weather prediction require evenly spaced meteorological datasets at a very high spatial and temporal resolution to facilitate the analysis of recent climatic changes. However, due to the small number of weather stations available, often the data collected from them are scattered and inadequate for such model creation. For this reason, very high-resolution gridded meteorological surface is developed by interpolating the available scattered data points to fulfill the need of various ecological and climatic applications. Among various interpolation techniques, Ordinary Kriging (OK) is one of the most popular and widely used gridding methodologies with a sound statistical basis providing a possibility to obtain highly accurate results. However, OK interpolation on large unevenly spaced data points is computationally demanding and has a computational cost that scales as the cube of the number of data points as it involves multiplication and inversion of matrices of large cardinalities infeasible for computation on a single node. Additionally, its standard implementation involves complex model fitting and function minimization steps which make automatic kriging analysis from raw data a considerable challenge. Meanwhile, Apache Spark has emerged as a large-scale data processing engine with a dedicated Machine Learning Library (MLLib) for processing large matrices and thereby can be used for large-scale kriging analysis with considerable time. In this paper, we present a new fast distributed OK algorithm on Apache Spark framework and provide an efficient and simple distributed matrix inversion scheme to accelerate the execution of distributed OK algorithm. We have employed Strassen’s direct method for matrix inversion and the acceleration is achieved by exploiting the symmetry nature of the variance–covariance matrix of the OK equation to invert the matrix. We show experimentally that our distributed inversion scheme enables us to invert a \(16{,}000 \times 16{,}000\) matrix with 51% and 38% less wall clock time than distributed Spark-based LU and Strassen’s inversion scheme, respectively.
Similar content being viewed by others
References
Abatzoglou, J.T.: Development of gridded surface meteorological data for ecological applications and modelling. Int. J. Climatol. 33(1), 121–131 (2013)
ArcGIS: creating empirical semivariograms. http://desktop.arcgis.com/en/arcmap/latest/extensions/geostatistical-analyst/creating-empirical-semivariograms.htm/ (2016). Accessed 30 Dec 2019
Caesar, J., Alexander, L., Vose, R.: Large-scale changes in observed daily maximum and minimum temperatures: creation and analysis of a new gridded data set. J. Geophys. Res. Atmos. 111(D5), D05101 (2006)
Chen, M., Shi, W., Xie, P., Silva, V., Kousky, V.E., Wayne Higgins, R., Janowiak, J.E.: Assessing objective techniques for gauge-based analyses of global daily precipitation. J. Geophys. Res. Atmos. 113(D4), D04110 (2008)
Cheng, T.: Accelerating universal kriging interpolation algorithm using CUDA-enabled GPU. Comput. Geosci. 54, 178–183 (2013)
Cheng, T., Li, D., Wang, Q.: On parallelizing universal kriging interpolation based on OpenMP. In: 2010 Ninth International Symposium on Distributed Computing and Applications to Business Engineering and Science (DCABES), pp. 36–39. IEEE (2010)
Cressie, N.: Statistics for Spatial Data. Wiley, New York (2015)
Cressie, N., Kang, E.L.: High-resolution digital soil mapping: Kriging for very large datasets. In: ViscarraRossel, R.A., McBratney, A.B., Minasny, B. (eds.) Proximal Soil Sensing, pp. 49–63. Springer, New York (2010)
Deutsch, C.V., Journel, A.G., et al.: Geostatistical Software Library and User’s Guide, vol. 119, 147th edn. Oxford University Press, New York (1992)
Di Luzio, M., Johnson, G.L., Daly, C., Eischeid, J.K., Arnold, J.G.: Constructing retrospective gridded daily precipitation and temperature datasets for the conterminous united states. J. Appl. Meteorol. Climatol. 47(2), 475–497 (2008)
Gebhardt, A.: PVM kriging with R. In: Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Vienna. Citeseer (2003)
Gribov, A., Krivoruchko, K., Ver Hoef, J.M.: Modified weighted least squares semivariogram and covariance model fitting algorithm. Stoch. Model. Geostat. Princ. Methods Case Stud. 2 (2000)
Gyalistras, D.: Development and validation of a high-resolution monthly gridded temperature and precipitation data set for Switzerland (1951–2000). Clim. Res. 25(1), 55–83 (2003)
Hadoop, A.: Apache Hadoop project. https://hadoop.apache.org/ (2016). Accessed 30 Dec 2019
Haylock, M., Hofstra, N., Klein Tank, A., Klok, E., Jones, P., New, M.: A European daily high-resolution gridded data set of surface temperature and precipitation for 1950–2006. J. Geophys. Res. Atmos. 113(D20), D20119 (2008)
Hernandez-Penaloza, G., Beferull-Lozano, B.: Field estimation in wireless sensor networks using distributed kriging. In: 2012 IEEE International Conference on Communications (ICC), pp. 724–729. IEEE (2012)
Herrera, S., Gutiérrez, J.M., Ancell, R., Pons, M., Frías, M., Fernández, J.: Development and analysis of a 50-year high-resolution daily gridded precipitation dataset over Spain (Spain02). Int. J. Climatol. 32(1), 74–85 (2012)
Hu, H., Shu, H.: An improved coarse-grained parallel algorithm for computational acceleration of ordinary Kriging interpolation. Comput. Geosci. 78, 44–52 (2015)
Hutchinson, M.F., McKenney, D.W., Lawrence, K., Pedlar, J.H., Hopkinson, R.F., Milewska, E., Papadopol, P.: Development and testing of Canada-wide interpolated spatial models of daily minimum–maximum temperature and precipitation for 1961–2003. J. Appl. Meteorol. Climatol. 48(4), 725–741 (2009)
Isaaks, E.H., Srivastava, R.M.: An Introduction to Applied Geostatistics. Oxford University Press, Oxford (1989)
Jardak, C., Mahonen, P., Riihijarvi, J.: Spatial big data and wireless networks: experiences, applications, and research challenges. Netw. IEEE 28(4), 26–31 (2014)
Jardak, C., Riihijarvi, J., Oldewurtel, F., Mahonen, P.: Parallel processing of data from very large-scale wireless sensor networks. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 787–794. ACM (2010)
Jeffrey, S.J., Carter, J.O., Moodie, K.B., Beswick, A.R.: Using spatial interpolation to construct a comprehensive archive of Australian climate data. Environ. Model. Softw. 16(4), 309–330 (2001)
Kaluzny, S.P., Vega, S.C., Cardoso, T.P., Shelly, A.A.: S+ SpatialStats: User’s Manual for Windows \(\textregistered \) and UNIX\(\textregistered \). Springer, New York (2013)
Kerry, K., Hawick, K.: Spatial interpolation on distributed, high-performance computers. In: Proceedings of High-Performance Computing and Networks (HPCN) Europe, vol. 98 (1997)
Kerry, K., Hawick, K.A.: Kriging interpolation on high-performance computers. In: International Conference on High-Performance Computing and Networking, pp. 429–438. Springer (1998)
Krishnamurti, T., Mishra, A., Simon, A., Yatagai, A.: Use of a dense gauge network over india for improving blended TRMM products and downscaled weather models. J. Meteorol. Soc. Jpn. 87, 395416 (2009)
Krishnan, S., Baru, C., Crosby, C.: Evaluation of MapReduce for gridding LIDAR Data. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), pp. 33–40. IEEE (2010)
Lesniak, A., Porzycka, S.: Geostatistical computing in PSInSAR data analysis. In: Computational Science—ICCS 2009, pp. 397–405. Springer (2009)
Liu, J., Liang, Y., Ansari, N.: Spark-based large-scale matrix inversion for big data processing. IEEE Access 4, 2166–2176 (2016)
Memarsadeghi, N., Raykar, V.C., Duraiswami, R., Mount, D.M.: Efficient kriging via fast matrix-vector products. In: 2008 IEEE Aerospace Conference, pp. 1–7. IEEE (2008)
Meyer, T.H.: The discontinuous nature of Kriging interpolation for digital terrain modeling. Cartogr. Geogr. Inf. Sci. 31(4), 209–216 (2004)
Misra, C., Haldar, S., Bhattacharya, S., Ghosh, S.K.: Spin: A fast and scalable matrix inversion method in apache spark. In: Proceedings of the 19th International Conference on Distributed Computing and Networking, ICDCN ’18, pp. 16:1–16:10. ACM, New York (2018). https://doi.org/10.1145/3154273.3154300
Oliver, M.A., Webster, R.: Kriging: a method of interpolation for geographical information systems. Int. J. Geogr. Inf. Syst. 4(3), 313–332 (1990)
Pebesma, E.J.: Multivariable geostatistics in S: the GSTAT package. Comput. Geosci. 30(7), 683–691 (2004)
Perry, M., Hollis, D.: The generation of monthly gridded datasets for a range of climatic variables over the UK. Int. J. Climatol. 25(8), 1041–1054 (2005)
Pesquer, L., Cortés, A., Pons, X.: Parallel ordinary kriging interpolation incorporating automatic variogram fitting. Comput. Geosci. 37(4), 464–473 (2011)
Rajeevan, M., Bhate, J.: A high resolution daily gridded rainfall dataset (1971–2005) for mesoscale meteorological studies. Curr. Sci 96(4), 558–562 (2009)
Rajeevan, M., Bhate, J., Jaswal, A.: Analysis of variability and trends of extreme rainfall events over India using 104 years of gridded daily rainfall data. Geophys. Res. Lett. 35(18), L18707 (2008)
Rajeevan, M., Bhate, J., Kale, J., Lal, B.: High resolution daily gridded rainfall data for the Indian region: analysis of break and active monsoon spells. Curr. Sci. 91(3), 296–306 (2006)
Riihijarvi, J., Mahonen, P.: Highly scalable data processing framework for pervasive computing applications. In: 2013 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), pp. 306–308. IEEE (2013)
Rizki, P.N.M., Eum, J., Lee, H., Oh, S.: Spark-based in-memory DEM creation from 3D LiDAR point clouds. Remote Sens. Lett. 8(4), 360–369 (2017)
Rizki, P.N.M., Lee, H., Lee, M., Oh, S.: High-performance parallel approaches for three-dimensional light detection and ranging point clouds gridding. J. Appl. Remote Sens. 11(1), 016011 (2017)
Shepard, D.: A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM National Conference, ACM ’68, pp. 517–524. ACM, New York (1968). https://doi.org/10.1145/800186.810616
Shi, X., Ye, F.: Kriging interpolation over heterogeneous computer architectures and systems. GISci. Remote Sens. 50(2), 196–211 (2013)
Software, S.: Statistical analysis software, SAS/STAT. https://www.sas.com/en_us/software/stat.html (2016). Accessed 30 Dec 2019
Spark, A.: Apache spark, lightning-fast cluster computing. https://spark.apache.org/ (2016). Accessed 30 Dec 2019
Srinivasan, B.V., Duraiswami, R., Murtugudde, R.: Efficient kriging for real-time spatio-temporal interpolation. In: Proceedings of the 20th Conference on Probability and Statistics in the Atmospheric Sciences, pp. 228–235. American Meteorological Society, Atlanta (2010)
Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York (2012)
Strzelczyk, J., Porzycka, S.: Parallel kriging algorithm for unevenly spaced data. In: Manninen, P., Öster, P. (eds.) Applied Parallel and Scientific Computing, pp. 204–212. Springer, New York (2012)
Strzelczyk, J., Porzycka, S., Lesniak, A.: Analysis of ground deformations based on parallel geostatistical computations of PSInSAR data. In: 2009 17th International Conference on Geoinformatics, pp. 1–6. IEEE (2009)
Turk, F.J., Arkin, P., Sapiano, M.R., Ebert, E.E.: Evaluating high-resolution precipitation products. Bull. Am. Meteorol. Soc. 89(12), 1911–1916 (2008)
Wackernagel, H.: Multivariate Geostatistics: An Introduction with Applications. Springer, New York (2013)
Xiang, J., Meng, H., Aboulnaga, A.: Scalable matrix inversion using MapReduce. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 177–190. ACM (2014)
Yatagai, A., Arakawa, O., Kamiguchi, K., Kawamoto, H., Nodzu, M.I., Hamada, A.: A 44-year daily gridded precipitation dataset for Asia based on a dense network of rain gauges. Sola 5, 137–140 (2009)
Yatagai, A., Xie, P., Alpert, P.: Development of a daily gridded precipitation data set for the middle east. Adv. Geosci. 12, 165–170 (2008)
Zhang, M., Wang, H., Lu, Y., Li, T., Guang, Y., Liu, C., Edrosa, E., Li, H., Rishe, N.: TerraFly GeoCloud: an online spatial data analysis and visualization system. ACM Trans. Intell. Syst. Technol. TIST 6(3), 34 (2015)
Zhuo, W., Paciorek, C., Kaufman, C., Bethel, W., et al.: Parallel kriging analysis for large spatial datasets. In: 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), pp. 38–44. IEEE (2011)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Misra, C., Bhattacharya, S. & Ghosh, S.K. A fast scalable distributed kriging algorithm using Spark framework. Int J Data Sci Anal 10, 249–264 (2020). https://doi.org/10.1007/s41060-020-00215-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-020-00215-3