Advertisement

Open Data Science

  • Leo LahtiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11191)

Abstract

The increasing openness of data, methods, and collaboration networks has created new opportunities for research, citizen science, and industry. Whereas openly licensed scientific, governmental, and institutional data sets can now be accessed through programmatic interfaces, compressed archives, and downloadable spreadsheets, realizing the full potential of open data streams depends critically on the availability of targeted data analytical methods, and on user communities that can derive value from these digital resources. Interoperable software libraries have become a central element in modern statistical data analysis, bridging the gap between theory and practice, while open developer communities have emerged as a powerful driver of research software development. Drawing insights from a decade of community engagement, I propose the concept of open data science, which refers to the new forms of research enabled by open data, open methods, and open collaboration.

Keywords

Algorithmic data analysis Open data science Open collaboration Open research software 

Notes

Acknowledgements

I am grateful to the rOpenGov contributors, in particular Joona Lehtomäki, Markus Kainu, and Juuso Parkkinen, and our close collaborator Mikko Tolonen. The work has been partially funded by Academy of Finland (decisions 295741, 307127).

References

  1. 1.
    Blondel, E.: rsdmx: Tools for Reading SDMX Data and Metadata (2018).  https://doi.org/10.5281/zenodo.1173229 (R package)
  2. 2.
    Boettiger, C., Chamberlain, S., Hart, E., Ram, K.: Building software, building community: lessons from the rOpenSci project. J. Open Res. Softw. 3 (2015).  https://doi.org/10.5334/jors.bu
  3. 3.
    Carpenter, B., et al.: Stan: a probabilistic programming language. J. Stat. Softw. 76 (2017).  https://doi.org/10.18637/jss.v076.i01
  4. 4.
    Gandrud, C.: Reproducible research with R and R Studio. Chapman & Hall/CRC, Boca Raton (2013)Google Scholar
  5. 5.
    Huber, W., et al.: Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).  https://doi.org/10.1038/nmeth.3252CrossRefGoogle Scholar
  6. 6.
    Lahti, L., Huovari, J., Kainu, M., Biecek, P.: Retrieval and analysis of eurostat open data with the eurostat package. R J. 9, 385–392 (2017). https://journal.r-project.org/archive/2017/RJ-2017-019/index.html
  7. 7.
    Lahti, L., Ilomäki, N., Tolonen, M.: A quantitative study of history in the english short-title catalogue (ESTC) 1470–1800. LIBER Q. 25, 87–116 (2015).  https://doi.org/10.18352/lq.10112CrossRefGoogle Scholar
  8. 8.
    Lahti, L., da Silva, F., Laine, M.P., Lhteenoja, V., Tolonen, M.: Alchemy & algorithms: perspectives on the philosophy and history of open science. RIO J. 3, e13593 (2017).  https://doi.org/10.3897/rio.3.e13593CrossRefGoogle Scholar
  9. 9.
    Laine, H., Lahti, L., Lehto, A., Ollila, S., Miettinen, M.: Beyond open access - the changing culture of producing and disseminating scientific knowledge. In: Proceedings of the 19th International Academic Mindtrek Conference in Tampere, Finland, September 22–24. AcademicMindTrek’15: Proceedings of the 19th International Academic Mindtrek Conference, ACM, ACM New York, NY, USA (2015). http://dl.acm.org/citation.cfm?id=2818187
  10. 10.
    Leo, L., Juuso, P., J.L., Kainu, M.: rOpenGov: open source ecosystem for computational social sciences and digital humanities (2013). http://ropengov.github.io, ICML/MLOSS workshop (Int’l Conf. on Machine Learning - Open Source Software workshop)
  11. 11.
    McMurdie, J., Holmes, S.: phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8, e61217 (2013).  https://doi.org/10.1371/journal.pone.0061217CrossRefGoogle Scholar
  12. 12.
    McTaggart, R., Daroczi, G., Leung, C.: Quandl: API wrapper for quandl.com (2015). http://CRAN.R-project.org/package=Quandl, R package version 2.7.0
  13. 13.
    Reinhart, A.: pdfetch: fetch economic and financial time series data from public sources (2015). http://CRAN.R-project.org/package=pdfetch, R package version 0.1.7
  14. 14.
    Salvatier, J., Wiecki, T., Fonnesbeck, C.: Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci. 2, e55 (2016).  https://doi.org/10.7717/peerj-cs.55CrossRefGoogle Scholar
  15. 15.
    Toivonen, H., Gross, O.: Data mining and machine learning in computational creativity. Wiley Int. Rev. Data Min. Knowl. Disc. 5, 265–275 (2015).  https://doi.org/10.1002/widm.1170Google Scholar
  16. 16.
    Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. SIGKDD Explor. Newsl. 15, 49–60 (2014)CrossRefGoogle Scholar
  17. 17.
    Weinert, K.: datamart: unified access to your data sources (2014). http://CRAN.R-project.org/package=datamart, R package version 0.5.2
  18. 18.
    Wickham, H.: Tidy data. J. Stat. Softw. 59 (2014).  https://doi.org/10.18637/jss.v059.i10
  19. 19.
    Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer, New York (2016). http://ggplot2.org
  20. 20.
    Wickham, H.: tidyverse: easily install and load the ‘Tidyverse’ (2017). https://CRAN.R-project.org/package=tidyverse, R package

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of TurkuTurkuFinland

Personalised recommendations