Abstract
The increasing openness of data, methods, and collaboration networks has created new opportunities for research, citizen science, and industry. Whereas openly licensed scientific, governmental, and institutional data sets can now be accessed through programmatic interfaces, compressed archives, and downloadable spreadsheets, realizing the full potential of open data streams depends critically on the availability of targeted data analytical methods, and on user communities that can derive value from these digital resources. Interoperable software libraries have become a central element in modern statistical data analysis, bridging the gap between theory and practice, while open developer communities have emerged as a powerful driver of research software development. Drawing insights from a decade of community engagement, I propose the concept of open data science, which refers to the new forms of research enabled by open data, open methods, and open collaboration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
The gisfin and helsinki packages; see http://ropengov.github.io.
- 6.
References
Blondel, E.: rsdmx: Tools for Reading SDMX Data and Metadata (2018). https://doi.org/10.5281/zenodo.1173229 (R package)
Boettiger, C., Chamberlain, S., Hart, E., Ram, K.: Building software, building community: lessons from the rOpenSci project. J. Open Res. Softw. 3 (2015). https://doi.org/10.5334/jors.bu
Carpenter, B., et al.: Stan: a probabilistic programming language. J. Stat. Softw. 76 (2017). https://doi.org/10.18637/jss.v076.i01
Gandrud, C.: Reproducible research with R and R Studio. Chapman & Hall/CRC, Boca Raton (2013)
Huber, W., et al.: Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015). https://doi.org/10.1038/nmeth.3252
Lahti, L., Huovari, J., Kainu, M., Biecek, P.: Retrieval and analysis of eurostat open data with the eurostat package. R J. 9, 385–392 (2017). https://journal.r-project.org/archive/2017/RJ-2017-019/index.html
Lahti, L., Ilomäki, N., Tolonen, M.: A quantitative study of history in the english short-title catalogue (ESTC) 1470–1800. LIBER Q. 25, 87–116 (2015). https://doi.org/10.18352/lq.10112
Lahti, L., da Silva, F., Laine, M.P., Lhteenoja, V., Tolonen, M.: Alchemy & algorithms: perspectives on the philosophy and history of open science. RIO J. 3, e13593 (2017). https://doi.org/10.3897/rio.3.e13593
Laine, H., Lahti, L., Lehto, A., Ollila, S., Miettinen, M.: Beyond open access - the changing culture of producing and disseminating scientific knowledge. In: Proceedings of the 19th International Academic Mindtrek Conference in Tampere, Finland, September 22–24. AcademicMindTrek’15: Proceedings of the 19th International Academic Mindtrek Conference, ACM, ACM New York, NY, USA (2015). http://dl.acm.org/citation.cfm?id=2818187
Leo, L., Juuso, P., J.L., Kainu, M.: rOpenGov: open source ecosystem for computational social sciences and digital humanities (2013). http://ropengov.github.io, ICML/MLOSS workshop (Int’l Conf. on Machine Learning - Open Source Software workshop)
McMurdie, J., Holmes, S.: phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8, e61217 (2013). https://doi.org/10.1371/journal.pone.0061217
McTaggart, R., Daroczi, G., Leung, C.: Quandl: API wrapper for quandl.com (2015). http://CRAN.R-project.org/package=Quandl, R package version 2.7.0
Reinhart, A.: pdfetch: fetch economic and financial time series data from public sources (2015). http://CRAN.R-project.org/package=pdfetch, R package version 0.1.7
Salvatier, J., Wiecki, T., Fonnesbeck, C.: Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci. 2, e55 (2016). https://doi.org/10.7717/peerj-cs.55
Toivonen, H., Gross, O.: Data mining and machine learning in computational creativity. Wiley Int. Rev. Data Min. Knowl. Disc. 5, 265–275 (2015). https://doi.org/10.1002/widm.1170
Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. SIGKDD Explor. Newsl. 15, 49–60 (2014)
Weinert, K.: datamart: unified access to your data sources (2014). http://CRAN.R-project.org/package=datamart, R package version 0.5.2
Wickham, H.: Tidy data. J. Stat. Softw. 59 (2014). https://doi.org/10.18637/jss.v059.i10
Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer, New York (2016). http://ggplot2.org
Wickham, H.: tidyverse: easily install and load the ‘Tidyverse’ (2017). https://CRAN.R-project.org/package=tidyverse, R package
Acknowledgements
I am grateful to the rOpenGov contributors, in particular Joona Lehtomäki, Markus Kainu, and Juuso Parkkinen, and our close collaborator Mikko Tolonen. The work has been partially funded by Academy of Finland (decisions 295741, 307127).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Lahti, L. (2018). Open Data Science. In: Duivesteijn, W., Siebes, A., Ukkonen, A. (eds) Advances in Intelligent Data Analysis XVII. IDA 2018. Lecture Notes in Computer Science(), vol 11191. Springer, Cham. https://doi.org/10.1007/978-3-030-01768-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-01768-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01767-5
Online ISBN: 978-3-030-01768-2
eBook Packages: Computer ScienceComputer Science (R0)