Advertisement

KI - Künstliche Intelligenz

, Volume 32, Issue 1, pp 27–36 | Cite as

Big Data Science

  • Katharina Morik
  • Christian BockermannEmail author
  • Sebastian Buschjäger
Technical Contribution

Abstract

In ever more disciplines, science is driven by data, which leads to data analytics becoming a primary skill for researchers. This includes the complete process from data acquisition at sensors, over pre-processing and feature extraction to the use and application of machine learning. Sensors here often produce a plethora of data that needs to be dealt with in near-realtime, which requires a combined effort of implementations at the hardware level to high-level design of data flows. In this paper we outline two use-cases of this wide span of data analysis for science in a real-world example in astroparticle physics. We outline a high-level design approach which is capable of defining the complete data flow from sensor hardware to final analysis.

Keywords

Big data Data science 

Notes

Acknowledgements

This work has been supported by the DFG, Collaborative Research Center SFB 876 (http://sfb876.tu-dortmund.de/), projects C3 and A1.

References

  1. 1.
    Abeysekara AU et al (2012) On the sensitivity of the HAWC observatory to gamma-ray bursts. Astropart Phys 35:641–650.  https://doi.org/10.1016/j.astropartphys.2012.02.001 CrossRefGoogle Scholar
  2. 2.
    Bockermann C et al (2016) FACT-Tools—Processing high-volume telescope data. ADASS Conference Series - Astronomical Data Analysis Software & SystemsGoogle Scholar
  3. 3.
    Anderhub H, Backes M, Biland A, Boller A, Braun I, Bretz T, Commichau S, Commichau V, Domke M, Dorner D et al (2011) Fact—the first cherenkov telescope using a g-apd camera for tev gamma-ray astronomy. Nucl Instrum Methods Phys Res A 639:58–61CrossRefGoogle Scholar
  4. 4.
    Atkins R et al (2000) Milagrito, a tev air-shower array. Nucl Instrum Methods Phys Res 449:478–499CrossRefGoogle Scholar
  5. 5.
    Bacon DF, Rabbah R, Shukla S (2013) Fpga programming for the masses. Commun ACM 56(4):56–63CrossRefGoogle Scholar
  6. 6.
    Badanidiyuru A, Mirzasoleiman B, Karbasi A, Krause A (2014) Streaming submodular maximization: massive data summarization on the fly. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 671–680Google Scholar
  7. 7.
    Bockermann C (2015) Mining big data streams for multiple concepts. Ph.D. Thesis, TU Dortmund UniversityGoogle Scholar
  8. 8.
    Bockermann C, Brügge K, Buss J, Egorov A, Morik K, Rhode W, Ruhe T (2015) Online analysis of high-volume data streams in astroparticle physics. In: Proceedings of the European conference on Machine Learning (ECML), Industrial Track. Springer, BerlinGoogle Scholar
  9. 9.
    Courbariaux M, Bengio Y, David JP (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In: Advances in neural information processing systems, pp 3123–3131Google Scholar
  10. 10.
    Cutting D et al (2007) Apache Hadoop. http://hadoop.apache.org/
  11. 11.
    D’Addario M, Kopczynski D, Baumbach JI, Rahmann S (2014) A modular computational framework for automated peak extraction from ion mobility spectra. BMC Bioinf 15(25). http://www.biomedcentral.com/1471-2105/15/25
  12. 12.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113.  https://doi.org/10.1145/1327452.1327492 CrossRefGoogle Scholar
  13. 13.
    Egorov A (2016) Distributed stream processing with the intention of mining. Master’s Thesis, TU DortmundGoogle Scholar
  14. 14.
    Fernandez RC, Pietzuch PR, Kreps J, Narkhede N, Rao J, Koshy J, Lin D, Riccomini C, Wang G (2015) Liquid: unifying nearline and offline big data integration. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, 2015, Online ProceedingsGoogle Scholar
  15. 15.
    Geppert L, Ickstadt K, Munteanu A, Quedenfeld J, Christian S (2015) Random projections for Bayesian regression. Stat Comput.  https://doi.org/10.1007/s11222-015-9608-z zbMATHGoogle Scholar
  16. 16.
    Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT PressGoogle Scholar
  17. 17.
    Hauck S, DeHon A (2008) Reconfigurable computing: the theory and practice of FPGA-based computation. Morgan Kaufmann, BurlingtonzbMATHGoogle Scholar
  18. 18.
    IceCube Collaboration, Morik K (2014) Development of a general analysis and unfolding scheme and its application to measure the energy spectrum of atmospheric neutrinos with icecube. Eur Phys J 75(3):116. https://doi.org/10.1140/epjc/s10052-015-3330-z
  19. 19.
    Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv:1609.04836 (preprint )
  20. 20.
    Kieda DB, VERITAS Collab (2004) Status of the VERITAS ground based GeV/TeV gamma-ray observatory. In: High Energy Astrophysics Division, Bulletin of the American Astronomical Society, vol 36, p 910Google Scholar
  21. 21.
    Krause A, Gomes RG (2010) Budgeted nonparametric learning from data streams. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 391–398Google Scholar
  22. 22.
    Krause A, Guestrin CE (2012) Near-optimal nonmyopic value of information in graphical models. arXiv:1207.1394 (preprint)
  23. 23.
    Lacey G, Taylor GW, Areibi S (2016) Deep learning on fpgas: past, present, and future. arXiv:1602.04283 (preprint)
  24. 24.
    Lee S, Brzyski D, Bogdan M (2016) Fast saddle-point algorithm for generalized Dantzig selector and FDR control with the ordered l1-norm. In: Gretton A, Robert CC (eds) Proceedings of the 19th international conference on artificial intelligence and statistics (AISTATS), pp 780–789. JMLR W&CP. http://jmlr.org/proceedings/papers/v51/lee16b.html
  25. 25.
    Lee S, Rahnenführer J, Lang M, de Preter K, Mestdagh P, Koster J, Versteeg R, Stallings R, Varesio L, Asgharzadeh S, Schulte J, Fielitz K, Heilmann M, Morik K, Schramm A (2014) Robust selection of cancer survival signatures from high-throughput genomic data using two-fold subsampling. PLoS One 9:e108818CrossRefGoogle Scholar
  26. 26.
    Marz N, Warren J (2014) Big data–principles and best practices of scalable realtime data systems. Manning Publications Co., GreenwichGoogle Scholar
  27. 27.
    Minoux M (1978) Accelerated greedy algorithms for maximizing submodular set functions. In: Optimization techniques. Springer, pp 234–243Google Scholar
  28. 28.
    Molina A, Natarajan S, Kersting K (2017) Poisson sum-product networks: a deep architecture for tractable multivariate poisson distributions. In: Singh S, Markovitch S (eds) Proceedings of the 31st AAAI conference on artificial intelligence (AAAI). AAAI PressGoogle Scholar
  29. 29.
    Muller LK, Indiveri G (2015) Rounding methods for neural networks with low resolution synaptic weights. arXiv:1504.05767 (preprint)
  30. 30.
    Neugebauer O, Engel M, Marwedel P (2016) A parallelization approach for resource-restricted embedded heterogeneous MPSoCs inspired by OpenMP. J Syst Softw 125:439–448. https://doi.org/10.1016/j.jss.2016.08.069
  31. 31.
    Ngiam J, Coates A, Lahiri A, Prochnow B, Le QV, Ng AY (2011) On optimization methods for deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 265–272Google Scholar
  32. 32.
    Petry D et al (1999) The MAGIC telescope—prospects for GRB research. Astron Astrophys Suppl Ser 138:601–602.  https://doi.org/10.1051/aas:1999369 CrossRefGoogle Scholar
  33. 33.
    Piatkowski N, Lee S, Morik K (2016) Integer undirected graphical models for resource-constrained systems. Neurocomputing 173(1):9–23. http://www.sciencedirect.com/science/article/pii/S0925231215010449
  34. 34.
    Pivato G et al (2013) Fermi LAT and WMAP observations of the supernova remnant HB 21. Astrophys J 779:179.  https://doi.org/10.1088/0004-637X/779/2/179 CrossRefGoogle Scholar
  35. 35.
    Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In: European conference on computer vision. Springer, pp 525–542Google Scholar
  36. 36.
    Richter J, Kotthaus H, Bischl B, Marwedel P, Rahnenführer J, Lang M (2016) Faster model-based optimization through resource-aware scheduling strategies. In: Proceedings of the 10th international conference: learning and intelligent optimization (LION 10), Lecture notes in computer science (LNCS), vol 10079. Springer International Publishing, pp 267–273Google Scholar
  37. 37.
    Stolpe M (2016) The internet of things: opportunities and challenges for distributed data analysis. SIGKDD Explor Newsl 18(1):15–34. http://doi.acm.org/10.1145/2980765.2980768
  38. 38.
    William PH, Saul A, Vetterling WT, Flannery BP (2007) Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, USAGoogle Scholar
  39. 39.
    Wulf N (2013) Speicherung und Analyse von BigData am Beispiel der Daten des FACT-Teleskops. Master’s Thesis, AI Group, Computer Science Department, TU DortmundGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  1. 1.TU DortmundDortmundGermany

Personalised recommendations