Abstract
Research in information science and scholarly communication strongly relies on the availability of openly accessible datasets of scholarly entities metadata and, where possible, their relative payloads. Since such metadata information is scattered across diverse, freely accessible, online resources (e.g. Crossref, ORCID), researchers in this domain are doomed to struggle with (meta)data integration problems, in order to produce custom datasets of often undocumented and rather obscure provenance. This practice leads to waste of time, duplication of efforts, and typically infringes open science best practices of transparency and reproducibility of science. In this article, we describe how to generate DOIBoost, a metadata collection that enriches Crossref with inputs from Microsoft Academic Graph, ORCID, and Unpaywall for the purpose of supporting high-quality and robust research experiments, saving times to researchers and enabling their comparison. To this end, we describe the dataset value and its schema, analyse its actual content, and share the software Toolkit and experimental workflow required to reproduce it. The DOIBoost dataset and Software Toolkit are made openly available via Zenodo.org. DOIBoost will become an input source to the OpenAIRE information graph.
Keywords
- Scholarly communication
- Open science
- Data science
- Data integration
- Crossref
- ORCID
- Unpaywall
- Microsoft Academic Graph
This is a preview of subscription content, access via your institution.
Buying options


Notes
- 1.
Crossref APIs, https://www.crossref.org/services/metadata-delivery/rest-api.
- 2.
Microsoft Academic Graph, https://aka.ms/msracad.
- 3.
ORCID, http://orcid.org.
- 4.
Unpaywall, http://unpaywall.org.
- 5.
OpenAIRE EXPLORE, http://explore.openaire.eu.
- 6.
GRID database, https://www.grid.ac.
- 7.
The field “access-rights” can assume the values OPEN, EMBARGO, RESTRICTED, CLOSED, UNKNOWN.
- 8.
Apache Oozie, http://oozie.apache.org.
- 9.
Affero General Public License, https://en.wikipedia.org/wiki/Affero_General_Public_License.
- 10.
Crossref REST API - GitHub, https://github.com/Crossref/rest-api-doc.
- 11.
MAG Schema, https://microsoftdocs.github.io/MAG/Mag-ADLS-Schema.
- 12.
Unpaywall data format, https://unpaywall.org/data-format.
- 13.
Levenshtein Distance, https://en.wikipedia.org/wiki/Levenshtein_distance.
References
Chawla, D.S.: Unpaywall finds free versions of paywalled papers. Nature News (2017)
Sinha, A., et al.: An overview of Microsoft Academic Service (MAS) and applications. In: Proceedings of the 24th International Conference on World Wide Web (WWW 2015 Companion), pp. 243–246. ACM, New York (2015)
Haak, L.L., Fenner, M., Paglione, L., Pentz, E., Ratner, H.: ORCID: a system to uniquely identify researchers. Learn. Publish. 25, 259–264 (2012). https://doi.org/10.1087/20120404
Manghi, P., Bolikowski, L., Manold, N., Schirrwagen, J., Smith, T.: OpenAIREplus: the European scholarly communication data infrastructure. D-Lib Mag. 18(9), 1 (2012)
Fortunato, S., et al.: Science of science. Science 359(6379), eaao0185 (2018)
La Bruzzo, S., Manghi, P., Mannocci, A.: DOIBoost Dataset Dump (Version 1.0) [Data set]. Zenodo (2018). http://doi.org/10.5281/zenodo.1438356
La Bruzzo, S.: DOIBoost Software Toolkit (Version 1.0). Zenodo, 1 October 2018. http://doi.org/10.5281/zenodo.1441058
Acknowledgements
This work could be delivered thanks to the Open Science policies enacted by Microsoft, Unpaywall, ORCID, and Crossref, which are allowing researchers to openly collect their metadata records for the purpose of research under CC-0 and CC-BY licenses. The MAG dataset is available with ODC-BY license thanks to the Azure4research sponsorship signed between Microsoft Research and KMi. This work was partially funded by the EU projects OpenAIRE2020 (H2020-EINFRA-2014-1, grant agreement: 643410) and OpenAIRE-Advance H2020 project (grant number: 777541; call: H2020-EINFRA-2017) [4].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
La Bruzzo, S., Manghi, P., Mannocci, A. (2019). OpenAIRE’s DOIBoost - Boosting Crossref for Research. In: Manghi, P., Candela, L., Silvello, G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, Cham. https://doi.org/10.1007/978-3-030-11226-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-11226-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11225-7
Online ISBN: 978-3-030-11226-4
eBook Packages: Computer ScienceComputer Science (R0)