Skip to main content
Log in

Divide and recombine (D&R) data science projects for deep analysis of big data and high computational complexity

  • Published:
Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Abstract

The focus of data science is data analysis. This article begins with a categorization of the data science technical areas that play a direct role in data analysis. Next, big data are addressed, which create computational challenges due to the data size, as does the computational complexity of many analytic methods. Divide and recombine (D&R) is a statistical approach whose goal is to meet the challenges. In D&R, the data are divided into subsets, an analytic method is applied independently to each subset, and the outputs are recombined. This enables a large component of embarrassingly-parallel computation, the fastest parallel computation. DeltaRho open-source software implements D&R. At the front end, the analyst programs in R. The back end is the Hadoop distributed file system and parallel compute engine. The goals of D&R are the following: access to thousands of methods of machine learning, statistics, and data visualization; deep analysis of the data, which means analysis of the detailed data at their finest granularity; easy programming of analyses; and high computational performance. To succeed, D&R requires research in all of the technical areas of data science. Network cybersecurity and climate science are two subject-matter areas with big, complex data benefiting from D&R. We illustrate this by discussing two datasets, one from each area. The first is the measurements of 13 variables for each of 10,615,054,608 queries to the Spamhaus IP address blacklisting service. The second has 50,632 3-hourly satellite rainfall estimates at 576,000 locations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

Download references

Acknowledgements

D&R and DeltaRho were supported by the NSF/DHS Visual Analytics Program Award 0937123, the NSF CDS&E Big Data Program Award 1228348, and the DARPA XDATA Big Data Program Contract FA8750-12-2-0343. WT and MCB were partially supported by the NASA Earth and Space Science Fellowship Grant NASA-NNX16AO62H. The authors are grateful to Doug Crabill for helping maintain the DeltaRho software stack and administrating the Hadoop clusters. They thank Qi Liu for assisting with TRMM data ingestion. This research was supported in part through computational resources provided by Information Technology at Purdue, West Lafayette, Indiana.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wen-wen Tung.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tung, Ww., Barthur, A., Bowers, M.C. et al. Divide and recombine (D&R) data science projects for deep analysis of big data and high computational complexity. Jpn J Stat Data Sci 1, 139–156 (2018). https://doi.org/10.1007/s42081-018-0008-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42081-018-0008-4

Keywords

Navigation