Divide and recombine (D&R) data science projects for deep analysis of big data and high computational complexity

Tung, Wen-wen; Barthur, Ashrith; Bowers, Matthew C.; Song, Yuying; Gerth, John; Cleveland, William S.

doi:10.1007/s42081-018-0008-4

Divide and recombine (D&R) data science projects for deep analysis of big data and high computational complexity

Published: 15 May 2018

Volume 1, pages 139–156, (2018)
Cite this article

Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Wen-wen Tung ORCID: orcid.org/0000-0001-8627-1503¹,
Ashrith Barthur²,
Matthew C. Bowers¹,
Yuying Song⁴,
John Gerth³ &
…
William S. Cleveland⁴

992 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

The focus of data science is data analysis. This article begins with a categorization of the data science technical areas that play a direct role in data analysis. Next, big data are addressed, which create computational challenges due to the data size, as does the computational complexity of many analytic methods. Divide and recombine (D&R) is a statistical approach whose goal is to meet the challenges. In D&R, the data are divided into subsets, an analytic method is applied independently to each subset, and the outputs are recombined. This enables a large component of embarrassingly-parallel computation, the fastest parallel computation. DeltaRho open-source software implements D&R. At the front end, the analyst programs in R. The back end is the Hadoop distributed file system and parallel compute engine. The goals of D&R are the following: access to thousands of methods of machine learning, statistics, and data visualization; deep analysis of the data, which means analysis of the detailed data at their finest granularity; easy programming of analyses; and high computational performance. To succeed, D&R requires research in all of the technical areas of data science. Network cybersecurity and climate science are two subject-matter areas with big, complex data benefiting from D&R. We illustrate this by discussing two datasets, one from each area. The first is the measurements of 13 variables for each of 10,615,054,608 queries to the Spamhaus IP address blacklisting service. The second has 50,632 3-hourly satellite rainfall estimates at 576,000 locations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conquering Big Data Through the Usage of the Wrangler Supercomputer

Big Data

The Stratosphere platform for big data analytics

Article 06 May 2014

References

Arakawa, A., Jung, J. H., & Wu, C. M. (2011). Toward unification of the multiscale modeling of the atmosphere. Atmospheric Chemistry and Physics, 11(8), 3731–3742. https://doi.org/10.5194/acp-11-3731-2011.
Article Google Scholar
Arakawa, A., Jung, J. H., & Wu, C. M. (2016). Multiscale modeling of the moist-convective atmosphere. In R. G. Fovell & W. W. Tung (Eds.), Meteorological monographs (56th ed., pp. 16.1–16.17). Boston: American Meteorological Society. https://doi.org/10.1175/AMSMONOGRAPHS-D-15-0014.1.
Chapter Google Scholar
Barenblatt, G. I. (1996). Scaling, self-similarity and intermediate asymptotics. Cambridge: Cambridge University Press.
Book Google Scholar
Bjerknes, V. (1904). Das Problem der Wettervorhersage, betrachtet vom Standpunkte der Mechanik und der Physik (The problem of weather prediction, considered from the viewpoints of mechanics and physics, trans. and ed. by E. Volken and S. Brönmann, Meteorol. Z. 18 (2009)). Meteorologische Zeitschrift, 21, 1–7
Bowers, M. C., Gao, J. B., & Tung, W. (2013). Long range correlations in tree ring chronologies of the USA: Variation within and across species. Geophysical Research Letters, 40(September 12), 1–5. https://doi.org/10.1029/2012GL054011.
Article Google Scholar
Brillinger, D. R. (2002). John W. Tukey’s work on time series and spectrum analysis. Annals of Statistics, 30(6), 1595–1618. https://doi.org/10.1214/aos/1043351248.
Article MathSciNet MATH Google Scholar
Cleveland, W. S. (2001). Data science: An action plan for expanding the technical areas of the field of statistics. International Statistical Review, 4(5), 497–511.
MATH Google Scholar
Cleveland, W. S. (2005). Learning from data: Unifying statistics and computer science. International Statistical Review, 73(2), 217–221.
Article Google Scholar
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403), 596–610. https://doi.org/10.1080/01621459.1988.10478639.
Article MATH Google Scholar
Cleveland, W. S., & Hafen, R. (2014). Divide and recombine (D&R): Data science for large complex data. Statistical Analysis and Data Mining, 7(6), 425–433. https://doi.org/10.1002/sam.11242.
Article MathSciNet Google Scholar
Davis, C., Brown, B., & Bullock, R. (2006a). Object-based verification of precipitation forecasts. Part I: Methodology and application to mesoscale rain areas. Monthly Weather Review, 134(7), 1772–1784. https://doi.org/10.1175/MWR3145.1.
Article Google Scholar
Davis, C., Brown, B., & Bullock, R. (2006b). Object-based verification of precipitation forecasts. Part II: Application to convective rain systems. Monthly Weather Review, 134(7), 1785–1795. https://doi.org/10.1175/MWR3146.1.
Article Google Scholar
Frisch, U. (1995). Turbulence: The legacy of A.N. Kolmogorov. Cambridge: Cambridge University Press.
Book Google Scholar
Gao, J., Hu, J., Tung, W., Cao, Y., Sarshar, N., & Roychowdhury, V. (2006). Assessment of long-range correlation in time series: How to avoid pitfalls. Physical Review E, 73(1), 1–10. https://doi.org/10.1103/PhysRevE.73.016117.
Article Google Scholar
Gao, J., Hu, J., & Tung, W. (2011). Facilitating joint chaos and fractal analysis of biosignals through nonlinear adaptive filtering. PLoS ONE, 6(9), e24331. https://doi.org/10.1371/journal.pone.0024331.
Article Google Scholar
Gao, J. B., Cao, Y., Tung, W., & Hu, J. (2007). Multiscale analysis of complex time series: Integration of Chaos and random fractal theory, and beyond. Hoboken: Wiley. https://doi.org/10.1002/9780470191651.
Guha, S., Kidwell, P., Hafen, R. P., & Cleveland, W. S. (2009). Visualization databases for the analysis of large complex datasets. Journal of Machine Learning Research, 5, 193–200.
Google Scholar
Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., et al. (2012). Large complex data: Divide and recombine (D&R) with RHIPE. Stat, 1(1), 53–67. https://doi.org/10.1002/sta4.7.
Article Google Scholar
Hafen, R. P., Cleveland, W., & Sego, L. H. (2016). DeltaRho. www.deltarho.org
Huffman, G. J., Bolvin, D. T., Nelkin, E. J., Wolff, D. B., Adler, R. F., Gu, G., et al. (2007). The TRMM multisatellite precipitation analysis (TMPA): Quasi-global, multiyear, combined-sensor precipitation estimates at fine scales. Journal of Hydrometeorology, 8(1), 38–55. https://doi.org/10.1175/JHM560.1.
Article Google Scholar
Hurst, H. E. (1951). Long-term storage capacity of reservoirs. Transactions of the American Society of Civil Engineers, 116(1), 770–799.
Google Scholar
Lovejoy, S., & Mandelbrot, B. B. (1985). Fractal properties of rain, and a fractal model. Tellus A: Dynamic Meteorology and Oceanography, 37(3), 209–232. https://doi.org/10.1111/j.1600-0870.1985.tb00423.x.
Article Google Scholar
Mitasova, H., Harmon, R. S., Weaver, K. J., Lyons, N. J., & Overton, M. F. (2012). Scientific visualization of landscapes and landforms. Geomorphology, 137(1), 122–137. https://doi.org/10.1016/j.geomorph.2010.09.033.
Article Google Scholar
Nakazawa, T. (1988). Tropical super clusters within intraseasonal variations over the Western Pacific. Journal of the Meteorological Society of Japan, 66(6), 823–839.
Article Google Scholar
Peng, C. K., Buldyrev, S. V., Havlin, S., Simons, M., Stanley, H. E., & Goldberger, A. L. (1994). Mosaic organization of DNA nucleotides. Physical Review E, 49(2), 1685–1689.
Article Google Scholar
Sellars, S., Nguyen, P., Chu, W., Gao, X., Hsu, K. L., & Sorooshian, S. (2013). Computational earth science: Big data transformed into insight. EOS Transactions American Geophysical Union, 94(32), 277–278. https://doi.org/10.1002/2013EO320001.
Article Google Scholar
Sellars, S. L., Gao, X., & Sorooshian, S. (2015). An object-oriented approach to investigate impacts of climate oscillations on precipitation: A western United States case study. Journal of Hydrometeorology, 16, 830–842. https://doi.org/10.1175/JHM-D-14-0101.1.
Article Google Scholar
Shi, J., Qiu, Y., Minhas, U., Jiao, L., Wang, C., Reinwald, B., et al. (2015). Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proceedings of the VLDB Endowment, 8(13), 2110–2121. https://doi.org/10.14778/2831360.2831365.
Article Google Scholar
Simpson, J., Kummerow, C., Tao, W. K., & Adler, R. F. (1996). On the tropical rainfall measuring mission (TRMM). Meteorology and Atmospheric Physics, 60(1–3), 19–36. https://doi.org/10.1007/BF01029783.
Article Google Scholar
Sorooshian, S., Hsu, K. L., Gao, X., Gupta, H. V., Imam, B., & Braithwaite, D. (2000). Evaluation of PERSIANN system satellite-based estimates of tropical rainfall. Bulletin of the American Meteorological Society, 81(9), 2035–2046.
Article Google Scholar
Telesca, L., Pierini, J., & Scian, B. (2012). Investigating the temporal variation of the scaling behavior in rainfall data measured in central Argentina by means of detrended fluctuation analysis. Physica A: Statistical Mechanics and Its Applications, 391(4), 1553–1562. https://doi.org/10.1016/J.PHYSA.2011.08.042.
Article Google Scholar
Tung, W., Giannakis, D., & Majda, A. J. (2014). Symmetric and antisymmetric convection signals in the Madden-Julian oscillation. Part I: Basic modes in infrared brightness temperature. Journal of the Atmospheric Sciences, 71(9), 3302–3326. https://doi.org/10.1175/JAS-D-13-0122.1.
Article Google Scholar
van Vliet, M. T. H., Wiberg, D., Leduc, S., & Riahi, K. (2016). Power-generation system vulnerability and adaptation to changes in climate and water resources. Nature Climate Change, 6, 375–380. https://doi.org/10.1038/nclimate2903.
Article Google Scholar
Williams, J. K. (2014). Using random forests to diagnose aviation turbulence. Machine Learning, 95(1), 51–70. https://doi.org/10.1007/s10994-013-5346-7.
Article MathSciNet Google Scholar
World Economic Forum (2017) The global risks report 2017 12th edition. World Economic Forum, Geneva. https://doi.org/10.1017/CBO9781107415324.004
Yasunari, T. (1991). The monsoon year—a new concept of the climatic year in the tropics. Bulletin of the American Meteorological Society. https://doi.org/10.1175/1520-0477(1991)072<1331:TMYNCO>2.0.CO;2

Download references

Acknowledgements

D&R and DeltaRho were supported by the NSF/DHS Visual Analytics Program Award 0937123, the NSF CDS&E Big Data Program Award 1228348, and the DARPA XDATA Big Data Program Contract FA8750-12-2-0343. WT and MCB were partially supported by the NASA Earth and Space Science Fellowship Grant NASA-NNX16AO62H. The authors are grateful to Doug Crabill for helping maintain the DeltaRho software stack and administrating the Hadoop clusters. They thank Qi Liu for assisting with TRMM data ingestion. This research was supported in part through computational resources provided by Information Technology at Purdue, West Lafayette, Indiana.

Author information

Authors and Affiliations

Department of Earth, Atmospheric, and Planetary Sciences, Purdue University, West Lafayette, IN, 47907, USA
Wen-wen Tung & Matthew C. Bowers
CERIAS, Purdue University, West Lafayette, USA
Ashrith Barthur
Departments of Computer Science and Electrical Engineering, Stanford University, Stanford, USA
John Gerth
Department of Statistics, Purdue University, West Lafayette, USA
Yuying Song & William S. Cleveland

Authors

Wen-wen Tung
View author publications
You can also search for this author in PubMed Google Scholar
Ashrith Barthur
View author publications
You can also search for this author in PubMed Google Scholar
Matthew C. Bowers
View author publications
You can also search for this author in PubMed Google Scholar
Yuying Song
View author publications
You can also search for this author in PubMed Google Scholar
John Gerth
View author publications
You can also search for this author in PubMed Google Scholar
William S. Cleveland
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wen-wen Tung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tung, Ww., Barthur, A., Bowers, M.C. et al. Divide and recombine (D&R) data science projects for deep analysis of big data and high computational complexity. Jpn J Stat Data Sci 1, 139–156 (2018). https://doi.org/10.1007/s42081-018-0008-4

Download citation

Received: 26 February 2018
Accepted: 18 April 2018
Published: 15 May 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s42081-018-0008-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Divide and recombine (D&R) data science projects for deep analysis of big data and high computational complexity

Abstract

Access this article

Similar content being viewed by others

Conquering Big Data Through the Usage of the Wrangler Supercomputer

Big Data

The Stratosphere platform for big data analytics

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Divide and recombine (D&R) data science projects for deep analysis of big data and high computational complexity

Abstract

Access this article

Similar content being viewed by others

Conquering Big Data Through the Usage of the Wrangler Supercomputer

Big Data

The Stratosphere platform for big data analytics

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation