In evaluative bibliometrics, one is frequently confronted with the task of comparing focal institutions with similar institutions (e.g., Frietsch et al., 2022). Institutional performance cannot be absolutely assessed but is meaningful only relative to other, similar institutions: by the comparison with similar institutions, one knows the standing of the focal institution. Although these comparisons are central to research evaluation, only a few attempts have been published to identify and select similar institutions based on bibliometric data. The most recent paper by Wang and Jeppsson (2022) has been published in Scientometrics: the authors “define an appropriate benchmark as a well-connected research environment, in which researchers investigate similar topics and publish a similar number of publications compared to a given research organization during the same period. Four essential attributes for the evaluation of benchmarks are research topics, output, connectedness, and scientific impact” (p. 7557). The approach by Wang and Jeppsson (2022) has been developed based on preliminary studies by Noyons et al. (1999), Carayol et al. (2012), and Andersen et al. (2017). Complementarily, higher education scholars extensively discussed how to identify and compare similar Higher Education Institutions (HEIs) using multicriteria approaches (Daraio et al., 2014), since it has been shown that comparisons of university performance are systematically affected by institutional characteristics such as subject profiles (Bornmann et al., 2013). A tool to identify similar HEIs combining multiple criteria has been developed by the U-MULTIRANK project (van Vught et al., 2012).

Although these approaches may lead to valid approaches for identifying and selecting benchmark institutions, the approaches are complex and need the access to well prepared (bibliometric and institutional) data (Lepori et al., 2022). In this Letter to the Editor, another approach is introduced which may be more simple and produce helpful results by giving hints to possible benchmark institutions. van Eck and Waltman (2024) used the GPT 3.5 Turbo LLM (OpenAI) for the task to label the research areas of a set of papers based on their titles. Since GPT produced promising results in their study, we requested ChatGPT 3.5 (the freely available variant) to provide a list of similar institutions for a focal institution. We asked ChatGPT to explain itself (“Please explain what ChatGPT is”) and received the following answer: “ChatGPT is a variant of the Generative Pre-trained Transformer (GPT) model developed by OpenAI, specifically tailored for conversational tasks. It’s designed to engage in dialogue with users in a manner that feels natural and human-like. Like other GPT models, ChatGPT is trained on a vast amount of text data to understand language patterns and generate coherent responses” (date of reply: March 19, 2024). The potential of ChatGPT for science studies has been the object of several studies in the informetric community (see Farhat et al., 2024; Lund & Wang, 2023; Ray, 2023; for a review) and a debate has emerged on its advantages and problems (Gordijn & ten Have, 2023; Korzynski et al., 2023; Rahman & Watanobe, 2023; Sandnes, 2024).

The goal of this letter is to open the debate on the potential of ChatGPT in identifying benchmark institutions at the level of institute, research center, and HEIs. As examples in this study, we selected as institute the Centre for Science and Technology Studies (CWTS) at Leiden University and as HEI the Leiden University. In both cases, we can assess the delivered results from ChatGPT for their appropriateness by direct knowledge of the field and of the institution. The authors work in quantitative science studies with a focus on research evaluation since around 20 years.

At the institute level, the request to ChatGPT was as follows:

The Centre for Science and Technology Studies at the Leiden University is active in science of science studies using bibliometric data in many research projects. In a benchmarking study, I would like to compare the research performance of the Centre for Science and Technology Studies with similar institutions. These institutions should be similar with respect to two characteristics: (1) They should be active in the same research areas in recent years. (2) They should have similar resources as the CWTS such as the number of researchers. Which institutions would you recommend for this benchmarking task? Please consider the following restrictions in your answer: (1) Only point to institutions other than the Centre for Science and Technology Studies. (2) List only the five most similar institutions. (3) The institutions should be active in 2022. (4) The institutions should be located in Europe. (5) The list should not contain programs at universities.

ChatGPT provided the following results (March 2024):

Based on the criteria provided, here are five institutions in Europe that could be suitable for benchmarking the Centre for Science and Technology Studies at Leiden University:

  1. 1.

    INGENIO (CSIC-UPV)—Institute for Innovation and Knowledge Management, Spain

  2. 2.

    Institute for Advanced Studies (IHS), Austria

  3. 3.

    Institute for Managing Sustainability (IMS), Vienna University of Economics and Business, Austria

  4. 4.

    Institute for Future Studies (IFF), Sweden

  5. 5.

    Fraunhofer Institute for Systems and Innovation Research ISI, Germany.

Against the backdrop of our experiences in the science of science field, the provided institutions from ChatGPT seem reasonable. Depending on the focus of research in the evaluation study, the institutions are more or less suitable. We miss, however, institutions such as the German Center For Higher Education Studies (DZHW) or the Expertise Centre for Research and Development monitoring (ECOOM) in Belgium.

At the HEI level, the request to ChatGPT was as follows:

The University of Leiden is a medium-size research university covering most subject domains and with a very good position in international rankings. It also includes a large medical center. In a benchmarking study, I would like to compare the research performance of the University of Leiden with similar institutions. These should be similar with respect of two characteristics: (1) they should be active in similar research domains, (2) they should have similar resources at the University of Leiden such as the number of professors or the budget. Please consider the following restrictions in your answer: (1) Only point to institutions other than the University of Leiden. (2) List only the five most similar institutions. (3) The institutions should be active in 2022. (4) The institutions should be located in Europe.

ChatGPT provided the following results (March 2024):

Certainly, considering your restrictions, here are five European universities similar to the University of Leiden in terms of research domains and resources:

  1. 1.

    University of Amsterdam (Netherlands): The University of Amsterdam is also a leading research university in the Netherlands with a wide range of subject domains covered. It has notable strengths in fields such as social sciences, humanities, and natural sciences. It has a comparable size and budget to the University of Leiden.

  2. 2.

    University of Utrecht (Netherlands): Utrecht University is another prominent research university in the Netherlands with strengths across various disciplines, including social sciences, natural sciences, and humanities. It boasts a similar research profile and resources to the University of Leiden.

  3. 3.

    University of Groningen (Netherlands): The University of Groningen is known for its strong research programs across different domains, including science, engineering, social sciences, and humanities. It is comparable to the University of Leiden in terms of research output and resources.

  4. 4.

    University of Copenhagen (Denmark): The University of Copenhagen is one of the top universities in Europe and offers a broad spectrum of research areas, including natural sciences, social sciences, and humanities. It is similar to the University of Leiden in terms of research diversity and resources.

  5. 5.

    University of Edinburgh (United Kingdom): The University of Edinburgh is a prestigious research-intensive institution in the UK, with strengths across various disciplines, including medicine, sciences, social sciences, and humanities. While not geographically in mainland Europe, it matches the criteria of being similar in research domains and resources.

These institutions are renowned for their research activities, cover a wide range of subject domains, and have resources comparable to the University of Leiden.

A cursory look and a comparison with data from the European Tertiary Education Register (Lepori et al., 2023) suggests that these are indeed good peers of the University of Leiden, with a similar position in international rankings, similar size and subject concentration, and having a medical school (like the focal university). ChatGPT was also effective in excluding highly-ranked universities which do not match some of the criteria indicated, such as technical universities or universities without medical schools. Of course, we could point to some additional peers such as University of Ghent or Gothenburg.

Our analysis of using ChatGPT has been purely illustrative, but it shows that, if a sensible querying strategy is adopted, ChatGPT might be able to suggest reasonable peers. We made the experience that one cannot expect valid results from ChatGPT in all cases. Another experience is missing reliability: The same query leads to different results at different time points (Thelwall, 2024). These experiences make clear that it is necessary to involve an expert in the search who inspect and assess the results. We strongly recommend that the institutions from ChatGPT should only be used as hints to possible benchmark institutions. ChatGPT opens several interesting questions, such as how to best design querying strategies, for which types of institutions results are (un)reliable, and whether approaches could be developed to combine data-based benchmarking tools with generative AI by exploiting the respective complementarities.

Some informetric scholars have expressed concerns on the use of ChatGPT and highlighted its limitations, arguing that it will never reach the precision of the quantitative approaches developed by our field (Gordijn & ten Have, 2023; Sandnes, 2024). While we respect this argument, it should be considered, first, that generative AI tools are developing rapidly and might provide much better results in the future; as of identifying peers, their results are already now reasonably good. The output of ChatGPT can be improved by giving (human) feedback “to produce more consistently useful or correct results. Feedback can also help it learn to avoid controversial or illegal responses” (Thelwall, 2024). Second, generative AI tools are much easier to use than data-based benchmarking tools and have attractive interactive features, where the user can easily dialogue with the tool; hence, they might turn to be a serious competitor for evaluative purposes than the tools developed by our community, particularly for students, institutional managers, and policy analysts.