Abstract
The advent of open science calls for open data platforms with high data quality. As a fully open catalog of the global research system launched in January 2022, OpenAlex features two main advantages of easy data accessibility and broad data coverage, which has been widely used in quantitative science studies. Remarkably, OpenAlex is adopted as an important data source for Leiden university ranking. However, there is a severe data quality problem of missing institutions in journal article metadata in OpenAlex. This study investigates the possible reasons for the problem and its consequences and solutions by defining three types of institutional information—full institutional information (FII), partially missing institutional information (PMII) and completely missing institutional information (CMII). Our results show that the problem of missing institutions occurs in more than 60% of the journal articles in OpenAlex. The problem is particularly widespread in metadata from the early years and in the social sciences and humanities. Using sub-samples of the data, we further explore the possible reasons for the problem, the risk it might represent for distorted results, and possible solutions to the problem of missing institutions. The aim is to raise the importance of data quality improvements in open resources, and thus to support the responsible use of open resources in quantitative science studies and also in broader contexts.
Similar content being viewed by others
Notes
OpenAlex API overview: https://docs.openalex.org/how-to-use-the-api/api-overview.
OpenAlex snapshot: https://docs.openalex.org/download-all-data/openalex-snapshot.
The CWTS Leiden Ranking 2023: https://www.leidenmadtrics.nl/articles/the-cwts-leiden-ranking-2023.
Information about the CWTS Leiden Ranking: https://www.leidenranking.com/information/general.
We have also analysed the phenomenon of missing institutions in authorships. 121,872,819 journal articles cover 366,851,172 authors in total, among whom about 47% have missing institutions. In particular, 112,510,937 first authors are identified, among whom about 53% have missing institutions. It can be learned that the data deficiency problem is prominent in authorships as well. However, since our major focus is missing institutions at the paper level, further discussion at the author level is not included in this study.
Missing institutions we refer to is a phenomenon existing in journal articles. However, OpenAlex provides a list where the information of each institution indexed in the database as an entity is given. Therefore, we can complete the PMII by matching institution entities in the list.
Clarivate InCites Help - Citation Topics: https://incites.help.clarivate.com/Content/Research-Areas/citation-topics.htm.
OpenAlex technical documentation - Concepts: https://docs.openalex.org/api-entities/concepts
OpenAlex technical documentation - Institutions: https://docs.openalex.org/api-entities/institutions
Web of Science Core Collection Help: https://images.webofknowledge.com/WOKRS535R52/help/WOS/hs_organizations_enhanced.html.
How affiliation profiles work in Scopus: https://service.elsevier.com/app/answers/detail/a_id/36052/supporthub/scopus/.
References
Aksnes, D. W., & Sivertsen, G. (2019). A criteria-based assessment of the coverage of Scopus and Web of Science. Journal of Data and Information Science, 4(1), 1–21.
AlShebli, B. K., Rahwan, T., & Woon, W. L. (2018). The preeminence of ethnic diversity in scientific collaboration. Nature Communications, 9(1), 5163. https://doi.org/10.1038/s41467-018-07634-8
Boulton, G. (2012). Open your minds and share your results. Nature, 486(7404), 441–441. https://doi.org/10.1038/486441a
Cao, Z., Zhang, L., Shang, Y., & Huang Y. (2023). Missing-institutions in OpenAlex: Possible reasons, impact and solutions of data deficiency. In: ISSI 2023, 19th International Conference on Scientometrics and Informetrics, Bloomington, 2 July 2023.
Chawla, D. S. (2022). Massive open index of scholarly papers launches https://www.nature.com/articles/d41586-022-00138-y
DITOs Consortium. (2018). Citizen Science & Open Science: Synergies & Future Areas of Work. Retrieved from https://discovery.ucl.ac.uk/id/eprint/10043574
Garfield, E. (1979). Citation indexing—Its theory and application in science, technology, and humanities. Wiley.
Han, P., Shi, J., Li, X., Wang, D., Shen, S., & Su, X. (2014). International collaboration in LIS: Global trends and networks at the country and institution level. Scientometrics, 98(1), 53–72. https://doi.org/10.1007/s11192-013-1146-x
Hodson, S., Jones, S., Collins, S., Genova, F., Harrower, N., Mietchen, D., Petrauskaité, R., & Wittenburg, P. (2018). FAIR Data Action Plan: Interim recommendations and actions from the European Commission Expert Group on FAIR data. https://doi.org/10.5281/zenodo.1285290
Huang, S., Yang, B., Yan, S., & Rousseau, R. (2014). Institution name disambiguation for research assessment. Scientometrics, 99, 823–838. https://doi.org/10.1007/s11192-013-1214-2
Lammey, R. (2020). Solutions for identification problems: A look at the Research Organization Registry. Sci Ed, 7(1), 65–69. https://doi.org/10.6087/kcse.192
Li, W., Zhang, S., Zheng, Z., Cranmer, S. J., & Clauset, A. (2022). Untangling the network effects of productivity and prominence among scientists. Nature Communications, 13(1), 4907. https://doi.org/10.1038/s41467-022-32604-6
Mirowski, P. (2018). The future(s) of open science. Social Studies of Science, 48(2), 171–203. https://doi.org/10.1177/0306312718772086
Molinari, J.-F., & Molinari, A. (2008). A new methodology for ranking scientific institutions. Scientometrics, 75(1), 163–174.
Ndungu, M. W. (2021). Scholarly journal publishing standards, policies and guidelines. Learned Publishing, 34(4), 612–621. https://doi.org/10.1002/leap.1410
OpenAlex. (2022). Work—OpenAlex documentation. https://docs.openalex.org/about-the-data/work#type
Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. Preprint retrieved from https://arxiv.org/abs/2205.01833.
Scheidsteger, T., & Haunschild, R. (2022). Comparison of metadata with relevance for bibliometrics between Microsoft Academic Graph and OpenAlex until 2020. Preprint retrieved from https://arxiv.org/abs/2206.14168
Scheidsteger, T., Haunschild, R., Hug, S. E., & Bornmann, L. (2018). The concordance of field-normalized scores based on Web of Science and Microsoft Academic data: A case study in computer sciences. Materials Today Chemistry, 29, 101417.
Tang, X., Li, X., & Ma, F. (2022). Internationalizing AI: Evolution and impact of distance factors. Scientometrics, 127(1), 181–205. https://doi.org/10.1007/s11192-021-04207-3
UNESCO. (2021). UNESCO Recommendation on Open Science. https://unesdoc.unesco.org/ark:/48223/pf0000379949.locale=en
Waltman, L., Calero Medina, C., Kosten, J., Noyons, E., Tijssen, R., van Eck, N. J., Van Leeuwen, T., Raan, T., Visser, M., & Wouters, P. (2012). The Leiden ranking 2011/2012: Data collection, indicators, and interpretation. Journal of the American Society for Information Science and Technology. https://doi.org/10.1002/asi.22708
Wang, D., Hu, L., Cheng, Q., & Bu, Y. (2022). Inequality of authors’ reference reuse. Journal of Information Science. https://doi.org/10.1177/01655515221111062
Woelfle, M., Olliaro, P., & Todd, M. H. (2011). Open science is a research accelerator. Nature Chemistry, 3(10), 745–748. https://doi.org/10.1038/nchem.1149
Wolszczak-Derlacz, J., & Parteka, A. (2011). Efficiency of European public higher education institutions: A two-stage multicountry approach. Scientometrics, 89(3), 887–917.
Zhao, Z., Bu, Y., & Li, J. (2021). Characterizing scientists leaving science before their time: Evidence from mathematics. Information Processing & Management, 58(5), 102661. https://doi.org/10.1016/j.ipm.2021.102661
Acknowledgements
The present study is an extended version of an article presented at the 19th International Conference on Scientometrics and Informetrics, Bloomington (USA), 2–5, July (Cao et al., 2023). This work was supported by the National Natural Science Foundation of China (Grant nos. 71974150, 72374160, 72004169), and the National Laboratory Center for Library and information Science in Wuhan University.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The first author (Lin Zhang) is Editor-in-Chief of Scientometrics.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, L., Cao, Z., Shang, Y. et al. Missing institutions in OpenAlex: possible reasons, implications, and solutions. Scientometrics (2024). https://doi.org/10.1007/s11192-023-04923-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11192-023-04923-y