Skip to main content

Big Data Science

Abstract

In ever more disciplines, science is driven by data, which leads to data analytics becoming a primary skill for researchers. This includes the complete process from data acquisition at sensors, over pre-processing and feature extraction to the use and application of machine learning. Sensors here often produce a plethora of data that needs to be dealt with in near-realtime, which requires a combined effort of implementations at the hardware level to high-level design of data flows. In this paper we outline two use-cases of this wide span of data analysis for science in a real-world example in astroparticle physics. We outline a high-level design approach which is capable of defining the complete data flow from sensor hardware to final analysis.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. The term has been used in Silicon Graphics by John Massey since 1998, but started to be popular only in 2012 with a trending peak in 2016 according to Google trends.

  2. Project C3 by Wolfgang Rhode, Katharina Morik, Tim Ruhe investigates astrophysical data from the IceCube project and Cherenkov telescopes. Project C5 by Bernhard Spaan and Jens Teubner discusses the data of the LHCb experiment at the Large Hadron Collider (LHC) facility in Geneva.

  3. The TU Dortmund university offers studies in data science within the statistics faculty since 2002. Within the computer science faculty, students may specialize on data science.

References

  1. Abeysekara AU et al (2012) On the sensitivity of the HAWC observatory to gamma-ray bursts. Astropart Phys 35:641–650. https://doi.org/10.1016/j.astropartphys.2012.02.001

    Article  Google Scholar 

  2. Bockermann C et al (2016) FACT-Tools—Processing high-volume telescope data. ADASS Conference Series - Astronomical Data Analysis Software & Systems

  3. Anderhub H, Backes M, Biland A, Boller A, Braun I, Bretz T, Commichau S, Commichau V, Domke M, Dorner D et al (2011) Fact—the first cherenkov telescope using a g-apd camera for tev gamma-ray astronomy. Nucl Instrum Methods Phys Res A 639:58–61

    Article  Google Scholar 

  4. Atkins R et al (2000) Milagrito, a tev air-shower array. Nucl Instrum Methods Phys Res 449:478–499

    Article  Google Scholar 

  5. Bacon DF, Rabbah R, Shukla S (2013) Fpga programming for the masses. Commun ACM 56(4):56–63

    Article  Google Scholar 

  6. Badanidiyuru A, Mirzasoleiman B, Karbasi A, Krause A (2014) Streaming submodular maximization: massive data summarization on the fly. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 671–680

  7. Bockermann C (2015) Mining big data streams for multiple concepts. Ph.D. Thesis, TU Dortmund University

  8. Bockermann C, Brügge K, Buss J, Egorov A, Morik K, Rhode W, Ruhe T (2015) Online analysis of high-volume data streams in astroparticle physics. In: Proceedings of the European conference on Machine Learning (ECML), Industrial Track. Springer, Berlin

  9. Courbariaux M, Bengio Y, David JP (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In: Advances in neural information processing systems, pp 3123–3131

  10. Cutting D et al (2007) Apache Hadoop. http://hadoop.apache.org/

  11. D’Addario M, Kopczynski D, Baumbach JI, Rahmann S (2014) A modular computational framework for automated peak extraction from ion mobility spectra. BMC Bioinf 15(25). http://www.biomedcentral.com/1471-2105/15/25

  12. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  13. Egorov A (2016) Distributed stream processing with the intention of mining. Master’s Thesis, TU Dortmund

  14. Fernandez RC, Pietzuch PR, Kreps J, Narkhede N, Rao J, Koshy J, Lin D, Riccomini C, Wang G (2015) Liquid: unifying nearline and offline big data integration. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, 2015, Online Proceedings

  15. Geppert L, Ickstadt K, Munteanu A, Quedenfeld J, Christian S (2015) Random projections for Bayesian regression. Stat Comput. https://doi.org/10.1007/s11222-015-9608-z

    MATH  Google Scholar 

  16. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press

  17. Hauck S, DeHon A (2008) Reconfigurable computing: the theory and practice of FPGA-based computation. Morgan Kaufmann, Burlington

    MATH  Google Scholar 

  18. IceCube Collaboration, Morik K (2014) Development of a general analysis and unfolding scheme and its application to measure the energy spectrum of atmospheric neutrinos with icecube. Eur Phys J 75(3):116. https://doi.org/10.1140/epjc/s10052-015-3330-z

  19. Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv:1609.04836 (preprint )

  20. Kieda DB, VERITAS Collab (2004) Status of the VERITAS ground based GeV/TeV gamma-ray observatory. In: High Energy Astrophysics Division, Bulletin of the American Astronomical Society, vol 36, p 910

  21. Krause A, Gomes RG (2010) Budgeted nonparametric learning from data streams. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 391–398

  22. Krause A, Guestrin CE (2012) Near-optimal nonmyopic value of information in graphical models. arXiv:1207.1394 (preprint)

  23. Lacey G, Taylor GW, Areibi S (2016) Deep learning on fpgas: past, present, and future. arXiv:1602.04283 (preprint)

  24. Lee S, Brzyski D, Bogdan M (2016) Fast saddle-point algorithm for generalized Dantzig selector and FDR control with the ordered l1-norm. In: Gretton A, Robert CC (eds) Proceedings of the 19th international conference on artificial intelligence and statistics (AISTATS), pp 780–789. JMLR W&CP. http://jmlr.org/proceedings/papers/v51/lee16b.html

  25. Lee S, Rahnenführer J, Lang M, de Preter K, Mestdagh P, Koster J, Versteeg R, Stallings R, Varesio L, Asgharzadeh S, Schulte J, Fielitz K, Heilmann M, Morik K, Schramm A (2014) Robust selection of cancer survival signatures from high-throughput genomic data using two-fold subsampling. PLoS One 9:e108818

    Article  Google Scholar 

  26. Marz N, Warren J (2014) Big data–principles and best practices of scalable realtime data systems. Manning Publications Co., Greenwich

    Google Scholar 

  27. Minoux M (1978) Accelerated greedy algorithms for maximizing submodular set functions. In: Optimization techniques. Springer, pp 234–243

  28. Molina A, Natarajan S, Kersting K (2017) Poisson sum-product networks: a deep architecture for tractable multivariate poisson distributions. In: Singh S, Markovitch S (eds) Proceedings of the 31st AAAI conference on artificial intelligence (AAAI). AAAI Press

  29. Muller LK, Indiveri G (2015) Rounding methods for neural networks with low resolution synaptic weights. arXiv:1504.05767 (preprint)

  30. Neugebauer O, Engel M, Marwedel P (2016) A parallelization approach for resource-restricted embedded heterogeneous MPSoCs inspired by OpenMP. J Syst Softw 125:439–448. https://doi.org/10.1016/j.jss.2016.08.069

  31. Ngiam J, Coates A, Lahiri A, Prochnow B, Le QV, Ng AY (2011) On optimization methods for deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 265–272

  32. Petry D et al (1999) The MAGIC telescope—prospects for GRB research. Astron Astrophys Suppl Ser 138:601–602. https://doi.org/10.1051/aas:1999369

    Article  Google Scholar 

  33. Piatkowski N, Lee S, Morik K (2016) Integer undirected graphical models for resource-constrained systems. Neurocomputing 173(1):9–23. http://www.sciencedirect.com/science/article/pii/S0925231215010449

  34. Pivato G et al (2013) Fermi LAT and WMAP observations of the supernova remnant HB 21. Astrophys J 779:179. https://doi.org/10.1088/0004-637X/779/2/179

    Article  Google Scholar 

  35. Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In: European conference on computer vision. Springer, pp 525–542

  36. Richter J, Kotthaus H, Bischl B, Marwedel P, Rahnenführer J, Lang M (2016) Faster model-based optimization through resource-aware scheduling strategies. In: Proceedings of the 10th international conference: learning and intelligent optimization (LION 10), Lecture notes in computer science (LNCS), vol 10079. Springer International Publishing, pp 267–273

  37. Stolpe M (2016) The internet of things: opportunities and challenges for distributed data analysis. SIGKDD Explor Newsl 18(1):15–34. http://doi.acm.org/10.1145/2980765.2980768

  38. William PH, Saul A, Vetterling WT, Flannery BP (2007) Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, USA

  39. Wulf N (2013) Speicherung und Analyse von BigData am Beispiel der Daten des FACT-Teleskops. Master’s Thesis, AI Group, Computer Science Department, TU Dortmund

Download references

Acknowledgements

This work has been supported by the DFG, Collaborative Research Center SFB 876 (http://sfb876.tu-dortmund.de/), projects C3 and A1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Bockermann.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Morik, K., Bockermann, C. & Buschjäger, S. Big Data Science. Künstl Intell 32, 27–36 (2018). https://doi.org/10.1007/s13218-017-0522-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13218-017-0522-8

Keywords

  • Big data
  • Data science