1 Introduction

The beginning of the Coronavirus disease 2019 (COVID-19) pandemic was declared by the World Health Organization (WHO) on March 11, 2020. Later, new evidence showed that the first cases of human transmission of the virus occurred in late 2019 and were mainly located in Wuhan, China (Guan et al. 2020). To date, the number of cases reported by WHO worldwide has exceeded 700 million, with nearly 7 million deaths related to the virus (World Health Organization 2022). After more than three years, on May 5 2023, the global health emergency caused by COVID-19 was declared over. Almost all governments in the world have taken countermeasures to face the spread of this highly infectious virus, ranging from quarantine to wearing masks or implementing social distance. The consequences of the pandemic have been faced in several aspects of our life. It is therefore not surprising that a very high number of researchers in all specialties (both medical and non-medical) started contributing to this research. In less than three years, the number of articles published on this topic has grown exponentially and researchers from all over the globe have contributed to study different aspects related to COVID-19 (which is related to the disease) or SARS-CoV-2 (which indicates the virus) (Fassin 2021). Given the sheer number of single studies that focus on innumerable aspects related to COVID-19, we performed a scientometric analysis that aims to give a comprehensive picture of this exceptional phenomenon. Many authors have tried to give a definition of scientometric analysis and also to highlight the similitudes and differences with bibliometric or infometric analysis (Tague-Sutcliffe 1992; Van Raan 1997; Hood and Wilson 2001). In this paper, we consider a scientometric analysis as the quantitative study of the disciplines of science based on published literature and communication (The Thompson Corporation 2008). Some of the main results of a bibliometric analysis are the identification of new or emerging areas of scientific research, their development and trends over time, or the geographical and organizational distribution of research (The Thompson Corporation 2008). When a new topic of study arises, it is necessary to collect and analyze, after an adequate period of time, the literature related to that topic in order to have a general idea of how scholars from all over the world are analyzing that event or phenomenon. The main aims are to portray a comprehensive picture of the growth of the scientific literature, to illustrate the structure and the relationships between e.g. articles, authors, or institutions in a research field or in a specific topic. For all these reasons, a scientometric analysis is one of the most important approaches for evaluating the scientific production.

Several researchers have studied the literature on COVID-19 from different points of view, using a scientometric approach. Some authors focused on the repercussions of the pandemic on a specific aspect of the society, such as the economic impact or the effects on education. For example, Hashemi and colleagues (Hashemi et al. 2022) evaluated the impact in the management field. They retrieved all articles on COVID-19 on Web of Science or Scopus and selected only those related to business, management, or accounting, showing the main themes for 2020 and 2021 on individual, organizational and societal levels. They found that in 2020 researchers were more focused on experiences and coping with COVID-19, while in 2021 studies were mostly about the acceptance of new rules in the workplace and business environment. Su and colleagues (Su et al. 2022) analyzed articles related to the impact that COVID-19 had on financial, operational, and other aspects of enterprises’ management. Financial liquidity, market channel expansion, supply chain stability, and efficiency, were all aspects affected by COVID-19. The authors also highlighted the importance of a leadership capable of adapting to change and exploiting opportunities. Zhang et al. (Zhang et al. 2022) identified 1,061 documents that evaluated the effects of COVID-19 on online higher education in 103 countries. The authors focused on challenges of online education (in particular for students with impairments), innovative pedagogies in online learning, and distribution of literature, suggesting that open access represents a tool to reduce barriers in spreading knowledge. The study reported that, due to the pandemic, a higher number of scholars are exploring a wide range of topics related to the changes in online higher education. Gómez-Domínguez and colleagues (Gómez-Domínguez et al. 2022) focused on the analysis of the scientific production on teachers’ stress during the COVID-19 outbreak. This study showed a high interest in the topic of stress and burnout, highlighting that many studies were related to mental health, coping strategies and other measures to mitigate the effects of the pandemic and improve teachers’ well-being.

Other authors narrowed the analysis geographically, for example analyzing only a specific country, a group of countries, a (broad) area, or a continent. Shamsi and colleagues (Shamsi et al. 2020) analyzed publications from 3,450 researchers retrieved from Web of Science, PubMed, and Scopus. They mainly used graphical representation of networks of countries, authors, and words to describe Iranian publications and reported that, for example, compared to other countries, Iran had larger research teams. Other authors focused on Asian countries such as India, Arab Emirates, Korea or zones such as Southeast Asia. Raju and Patil (Patil et al. 2020) reported that the most cited Indian articles in 2020 were those related to virology, diagnosis, treatment or clinical features, while general studies on epidemiology or pandemic received less citations. Al-Omari and colleagues (Al-Omari et al. 2022) described the research activity of the United Arab Emirates-affiliated researchers from 2020 to 2022. They reported that most authors affiliated with the United Arab Emirates collaborated with colleagues from the same country and that the main international collaboration were with the United States of America (USA) and England. Kim and Jeong (Kim and Jeong 2021) reported results from a bibliometric analysis conducted on Korean articles to identify the collaborations between Korean and international authors and explore clusters of institutions, journals, and topics. They found that Korean authors mainly collaborated with USA authors. Tantengco (Tantengco 2021) analyzed more than 700 articles from Southeast Asia in 2020 and reported that Malaysia was the most active country with respect to the number of publications, while Singapore affiliated authors received a higher number of citations. Chiu and Ho (Chiu and Ho 2021) collected all studies based on Latin America published in 2020 and found that Brazil was one the most active countries, while Corrales-Reyes and colleagues (Corrales-Reyes et al. 2021) focused only on Cuba and found that publications with Cuban leadership were less likely to have a major impact. Another very recent article (Chatterjee et al. 2022) also focused on countries that are in the same geographical area. The authors described the scientific production of all the English-speaking Caribbean countries and observed that more than 50% of the research based on Caribbean originated from Trinidad and Tobago or Jamaica. Stojanovic (Stojanovic 2021) assessed which were the main areas of research during the first five months of 2020 in Canada. The author reported that the main topics were infection, prevention, therapeutics, among others, and that, at that time, there was a gap in the literature about diagnostic and vaccines. In 2021, Turatto (Turatto et al. 2021) analyzed the articles retrieved in PubMed and Scopus relatively to the early phase of COVID-19 in Italy, focusing mainly on those published by authors with an Italian affiliation but also on papers analyzing Italian data. The authors reported that many Italian articles were about the management aspect of the pandemic, while the non-Italian ones were mostly epidemiological studies. This study provided relevant information on the main topics covered in scientific publications in the first phases of the COVID-19 outbreak based on co-occurrence analysis of terms. It did not report any analysis aimed at identifying scientific collaboration clusters or citation trends.

Few articles analyzed the global literature. Some of these were about the early stages of COVID-19 outbreak (Aviv-Reuven and Rosenfeld 2021; Haghani et al. 2020; Furstenau et al. 2021; Tran et al. 2020). They all highlight the fact that in a few months the number of articles published on this issue increased exponentially, with hundreds or thousands of publications in a very short timeframe. Another relevant scientometric effect is described by (Ioannidis et al. 2022), who evaluated the citation impact of COVID-19 publications on all scientific works published in 2020–21 and assessed the repercussions in terms of citations on scientist profiles. The authors found that, across all scientific fields, 98 out of 100 most cited papers published in 2020-21 were related to COVID-19. In 2020, another bibliometric analysis by Hamidah and colleagues (Hamidah et al. 2020) reported that, since the beginning of the pandemic, China, UK and USA were among the top contributors to the COVID-19 literature, and also that the main topics were related to public health and laboratory studies. In 2021, Wang and colleagues (Wang and Tian 2021) analyzed data referred to 2020 through Web of Science and several preprint platforms (bioRxiv, medRxiv, Preprints, and SSRN) in order to show the global trends in COVID-19 research. They reported USA to be the most active country (followed by China) with respect to the number of contributions.

More recently, an important contribution was provided by a bibliometric analysis conducted by Damaševičius and Zailskaitė-Jakštė (2023). The authors investigated if the worldwide emergency of the COVID-19 pandemic had an influence on national and international research collaboration by examining research cooperation before and after the COVID-19 outbreak. This analysis was specifically focused on studies in the business area and included 14,824 articles published from 2019 till November 15, 2020. Results showed that cultures adaptation to changes and coping with uncertainty have significantly influenced the collaboration networks in the field of business and economics (Damaševičius and Zailskaitė-Jakštė 2023). In another study, Zyoud and Al-Jabi (Zyoud and Al-Jabi 2020) also proposed a preliminary analysis on the first wave of publications about COVID-19. They used Scopus to retrieve 19,044 articles from the beginning of the outbreak to June 2020, and showed that almost half of the outputs were articles and that USA, China, Italy and UK were the first countries in terms of the amount of scientific contributions. Many journals, especially those with high Journal Impact Factor (JIF), realized thematic issues on COVID-19 or gave priority to the release of articles related to this topic, offering also free access to them. Finally, Yu and colleagues (Yu et al. 2020) presented a bibliometric analysis of global literature on COVID-19 published between 2019 and 2020, collected using the Web of Science database. They identified a total of 3,626 publications, and found that China, USA, England, and Italy were the most active countries. The analysis suggests an expected increase in the number of COVID-19 publications, with future hotspots potentially revolving around disease treatment, spike protein, and vaccines.

In our study, we followed a glocal approach. First, we present a comprehensive and exhaustive overview of the global literature on COVID-19, next we focus on the Italian case, as Italy was one of the first Western countries to be severely affected by COVID-19. The aims of this study were the following: 1) to conduct an extensive and up to date scientometric analysis of studies investigating different aspects related to the COVID-19 pandemic outbreak; 2) to investigate the main topics studied in these articles; 3) to identify the most active countries and institutions, as well as their relationships; 4) to conduct a case study on articles with authors affiliated with an Italian institution and 5) to investigate the associations between the number of citations and different characteristics of the selected articles.

2 Methods

2.1 Data collection and processing

We conducted a literature search on the Web of Science Core Collection (WoS) online database, updated to the 31st of May 2022, to retrieve any scientific article or review on COVID-19. Among different databases available for scientometric studies, WoS offers functions such as Keywords Plus and research areas to better identify the content of retrieved articles. We used the following search strategy: covid* OR “corona$virus disease *19” OR sars-cov-2. We searched for studies mentioning these terms in the Topic field (title, abstract, author keywords and keywords plusFootnote 1). Only scientific articles and reviews, written in English, published in 2020, 2021 or 2022 were retained. For each research output, we extracted the following characteristics: title, abstract, keywords plus, authors’ affiliations, year of publication, type of publication (article or review), journal title, journal category or categories based on the classification from the Journal citation report (JCR), and the number of citations. The JCR classification includes 254 research categories, which are further assigned to 21 groupsFootnote 2. Each journal can be assigned to one or more research categories and each research category can be part of one or more groups. We linked each article to one or more of the 21 JCR groups based on the JCR research categories of the Journal where they were published. In addition, we retrieved the Journal Impact Factor based on the JCR 2021.

2.2 Scientometric analysis

The scientometric analyses were conducted with the Bibliometrix package (Aria and Cuccurullo 2017) version 3.1 in R version 4.1.2. (R Core Team 2021) and the Biblioshiny shiny app (Aria and Cuccurullo 2017). Bibliometrix is an open-source R package that is widely used in bibliometric analysis due to the inclusion of different functions and methods. It allows to import and process articles downloaded by the main databases and to build data matrices for scientific collaboration and word co-occurrence analyses. We identified countries and institutions with the highest number of published articles and/or citations (based on the affiliation of the authors). The collaboration networks of countries and institutions were generated using the Louvain clustering algorithm implemented in Bibliometrix, setting the number of nodes to 20, for better clarity of representation. The Louvain algorithm is a hierarchical clustering technique that integrates communities recursively into a single node and performs modularity clustering on condensed graphs. It is a community detection approach for vast networks that computes a modularity score (a value between \(-\)0.5 and 1) for each community with the aim to obtain the highest score as an indicator of the correct assignment of that node to that community. This method is based on the hypothesis that there is a higher probability that nodes within a community are more densely connected than what would be expected by chance, i.e. in comparison with a random network. In a single community the modularity score can be computed as:

$$\begin{aligned} Q = \frac{1}{2m} \sum _{ij} \left[ d_{ij} - \frac{d_id_j}{2m} \right] \end{aligned}$$

where \(d_{ij}\) is the edge weight between nodes i and j, \(d_i\) and \(d_j\) are the sum of the weights of the edges attached to nodes i and j (respectively) and m is a constant that represents the sum of all the edge weights in the graph. The term \(\frac{d_id_j}{2m}\) therefore represents an approximated expression of the expected number of edges between nodes. A co-occurrence network of keywords plus was also generated using the 20 most frequent words, after excluding words present in the search query. Next, we conducted a case study on a subset of the data set including only articles with at least one author affiliated with an Italian institution. In order to do this, we selected only articles with at least one author for which the word Italy was reported in the address field. For these authors, we collected additional information relatively to the disciplinary scientific sector (SSD, in Italian “Settore Scientifico Disciplinare”) from the CINECA official database of employees’ website (Cineca 2022).Footnote 3 This database contains information only for authors affiliated with an Italian university. In the Italian Higher Education system every researcher and professor is, in fact, necessarily classified in one among 383 different sectors that describe their research domain. For each article we computed the total number of authors as well as the number of authors with or without an Italian affiliation.

2.3 Multiple correspondence analysis

Multiple Correspondence Analysis (MCA) is a type of factor analysis that is particularly useful for analyzing data with many categorical variables, it is similar to principal component analysis, but is specifically designed for categorical data. MCA is used to identify patterns and relationships in the data, and to reduce the dimensions of the data by projecting it onto a lower-dimensional space.

In the present study, MCA is used to study the association between the rate of citations per month, the 21 JCR research groups, JIF, Number of Authors, and Type of Article. The rate of citations has been computed as the ratio between the number of citations received by an article and the number of months from the publication date reported in the WoS database to the date in which the data were downloaded. For this purpose, the analyses were conducted considering the subset of articles for which it was registered at least one citation. We decided to use MCA because, although our dataset includes both qualitative and numerical variables, inasmuch we can categorize the numerical variables in a meaningful manner without significant loss of information. Furthermore, we needed a direct and simple method that could give a graphical representation of the relationship between these variables. For this reason, we preferred MCA with respect to multiple-factor analysis, as this methodology, by combining quantitative and qualitative variables, can add more complexity to the interpretation. Another alternative could have been to use principal component analysis which, however, requires only numerical variables. The first step in conducting MCA is to create a multi-way contingency table of the categorical variables, then the table is transformed into an indicator matrix or a Burt matrix and finally, a simple Correspondence Analysis is applied to one of them (Benzécri 1969). This method allows one to represent graphically the transformed data in a bi-plot where each point represents a category, and the position of the points reflects the relationship between the categories. This reduces the dimensionality of the data and helps to analyze the pattern of relationships among a multitude of categorical dependent variables.

2.4 Analysis of citations

We conducted regressive analyses to identify variables associated with the number of citations. In order to take into account the elapsed time from the publication of each article to the date of their retrieval, the number of citations was divided by the number of months from the publication date; this rate of citations was used as response in the models estimated. Predictors included the total number of authors, the JIF (Journal Impact Factor) based on the JCR Report 2021 as well as the 21 JCR research groups of journals in which articles were published. Considering the interest on understanding citation dynamics and their impact, articles published in journals with no JIF, not published from at least one month or articles with missing or no citations were not included in this analysis.

Since the beginning of the pandemic it was not hard to guess that the spread of the COVID-19 would have had an impact in medical and related research fields. However, it was less predictable how research in other fields would have been also highly affected by this phenomenon. With the aim of understanding the citation dynamics in the non-medical fields, we have removed from the dataset the articles related to Medical and Biological domains representing a share of 77% over the entire production. We have also decided to not consider articles classified in the more general Multidisciplinary group when no other group of research was specified. The inclusion of research groups in the set of predictors is useful in the determination of different citation trends across different domains. The classification within the Multidisciplinary group was mainly associated to papers simultaneously classified into multiple areas, therefore their inclusion would have not allowed to disentangle the magnitude of field specific effects. Articles classified in research domains that usually do not have a scientometric nature and for which there were not enough observations, such as history, literature and philosophy have also been removed. After the data cleaning, we passed from 21 research groups to 15. The final dataset consisted in 21,848 articles. The same analyses were repeated for Italian publications. In this case, after having removed the share of papers related to the fields of Biology and Medicine, those from the Multidiscipinary group and those from fields with less than 20 observations, we remain with a dataset of 765 articles and 15 research groups.

The asymmetry of the response suggested that instead of focusing on its conditional mean, through the estimation of a standard linear regression, it was more appropriate to analyze some of the quantiles of its distribution. This is done bymeans of linear quantile regression models (Koenker and Bassett 1978; Davino et al. 2013), which allow to estimate a different effect of the predictors for each selected quantile of the response. If we denote by \(\textbf{X}\) the set of predictors and RC (Rate of citations) as the response variable, a linear quantile regression model at \(\tau\)-th quantile (\(\tau \in (0,1)\)) can be formally expressed as

$$\begin{aligned} RC = \textbf{X} \mathbf{\beta }_{\tau } +\mathbf{\epsilon } \end{aligned}$$

where the conditional \(\tau\)-th quantile of the response is

$$\begin{aligned} Q_{RC}(\tau |\textbf{X})=\textbf{X} \mathbf{\beta }_{\tau } \end{aligned}$$

and \(\mathbf{\beta }_{\tau }\) is the vector of parameters associated to the \(\tau\)-th quantile. We estimated a separate model for the three quartiles of the rates of citations (i.e. we considered \(\tau =\{0.25, 0.50, 0.75\}\)); this can give an understanding of possible different behaviours for highly, medially or poorly cited papers.

The models were fitted through the quantreg package of R (Koenker 2009).

3 Results

3.1 Scientometric analysis on the global data set

Our search retrieved 209,124 articles and reviews. After removing documents that did not satisfy our inclusion criteria (publication year equal to 2020, 2021 or 2022 and written in English), analyses were conducted on a total of 184,098 documents. Table 1 shows the number of documents retrieved for each country based on the affiliation of the corresponding author. The main countries with respect to the number of articles are USA (40,060, 22%), China (19,938, 11%), United Kingdom (UK) (11,334, 6%) and Italy (11,232, 6%). A similar scenario is represented in Fig. 1, which shows the number of documents for each country based on the affiliation of any author.

For each of the top-ten countries based on the number of retrieved documents, we computed the number of single country publications (SCP, i.e. articles with no international collaborations), the number of multiple country publications (MCP, i.e. articles with international collaborations) as well as the ratio between MCP and the total number of documents. We can see that, among the top ten countries, Australia, UK and Germany showed the highest MCP ratio (i.e. a higher percentage of documents for these countries included international collaborations). The USA were the first country based on the number of articles, but only a small percentage of them included international collaborations (17.8% compared to e.g. 37.5% for e.g. Australia).

Fig. 1
figure 1

Countries with a higher number of documents based on the affiliation of any author are shown in darker blue. In this representation a document might be counted multiple times, once for each author

Table 1 Total number of articles retrieved in the search, SCP, MCP, and percentage of articles with international collaborations.
Fig. 2
figure 2

Country collaboration network. In red a cluster consisting of the following countries: Australia, Canada, China, Korea, India, Iran, Japan, Pakistan, Saudi Arabia, Turkey, US; in blue a cluster consisting of the following countries: Brazil, France, Germany, Italy, Netherlands, Poland, Spain, Switzerland, UK

Fig. 3
figure 3

Institution collaboration network. In red: institutions based in the USA; in blue: institutions from United Kingdom, Canada, Italy, Brazil, India, Hong Kong; in green: Chinese institutions; in purple: Iranian institutions. (Color figure online)

Studies were carried on by authors from 88,011 institutions located in 165 countries. We created a country (Fig. 2) and an institution (Fig. 3) collaboration network, based on the approach implemented in Bibliometrix (Aria and Cuccurullo 2017). In the graphical representation of these networks, each circle represents a country or an institution, the size of the circle is proportional to the number of documents and the thickness of the lines represents the strength of the relationship between two countries or institutions. With respect to the country collaboration network, we identified two main clusters shown in Fig. 2. One (reported in red in Fig. 2) included the USA, Australia, Canada and different Asian countries (e.g., China, Japan, India and Korea). The other one (reported in blue in Fig. 2) included UK, Brazil and different European countries (e.g., Italy, Germany, France and Spain). When constructing the institution collaboration network, we identified four clusters (Fig. 3). The cluster with the highest number of institutions (reported in red in Fig. 3) only included those based in the USA (e.g., Harvard, University of Washington, Stanford). The second largest cluster (reported in blue in Fig. 3) included a variety of institutions from United Kingdom (University of Oxford and Imperial College London), Canada, Italy, Brazil, India and Hong Kong, which are among the top countries based on the number of COVID-19 articles (Table 1). The two smaller clusters contained institutions from Asian countries (colored in purple and green in Fig. 3). Both Figs. 2 and  3 were created by limiting the number of nodes to show to 20.

Table 2 shows the 10 most relevant sources in terms of number of documents in the topic. The top three resulted to be the “International Journal of Environmental Research and Public Health”, “PLOS ONE” and “Sustainability”. All were open access journals.

Table 2 Top ten sources based on the number of published articles

In the top twenty documents of the collection of COVID-19 articles, the most cited document was a study from Zhou and colleagues (Zhou et al. 2020). This study analyzed retrospectively data from 191 adult COVID-19 patients admitted to two hospitals in Wuhan, China and found that more than 28% of patients died during hospitalization. The risk of death was higher in patients who were older or showed higher Sequential Organ Failure Assessment (SOFA) score (a diagnostic marker for sepsis and septic shock). The study was among the first to emphasize the relevance of early detection, rapid treatment, and thorough monitoring of COVID-19 patients.

As a last step, using the 50 most frequent keywords plus, we created a word co-occurrence network shown in Fig. 4.

Fig. 4
figure 4

Co-occurrence network constructed using the 50 most frequent keywords plus. In red: words related to the impact of the infection; in green: epidemiological or clinical aspects; in blue: words related to biological mechanisms

Each word is represented with a circle, and the size of the circles is proportional to the number of documents including the word. The degree of the relatedness between two words is indicated by the thickness of the line connecting the circles. As shown in Fig. 4, three main clusters of words were identified. One mostly included words related to the impact of the infection (e.g. risk, impact, care, health, mortality, outcomes, management) and in particular to the impact on mental health (depression, anxiety, stress, mental health, represented in red in Fig. 4). A second cluster included words related to epidemiological (e.g. transmission, outbreak, Wuhan, China) or clinical (pneumonia, acute respiratory syndrome) aspects and is represented in green in Fig. 4. The last cluster included words related to biological mechanisms (e.g. protein, expression, ace2, receptor, cells, replication and inflammation, represented in blue in Fig. 4).

3.2 Case study on articles with authors affiliated with an Italian institution

Among the 184,098 documents, 14,916 included at least one author with an Italian affiliation. Table 3 shows the countries for which we identified the highest number of collaborations among articles with Italian affiliated authors. The USA were the country for which the highest number of collaborations was identified, followed by UK and Spain (Table 3).

Table 3 International collaborations between authors with Italian affiliation and authors from other countries

Figure 5 shows the institution collaboration network based on authors with an Italian affiliation. Two main clusters of institutions were identified: one including mostly universities from northern Italy (shown in red) and one including universities from either northern, central or southern parts of Italy (shown in blue). The co-occurrence network constructed using the 50 most frequent keywords plus is shown in Fig. 6.

Fig. 5
figure 5

Network of institutions among articles with authors affiliated with Italian institutions

Fig. 6
figure 6

Co-occurrence network constructed using the 50 most frequent keywords plus among articles with authors affiliated with Italian institutions

The network included three clusters and was similar to the one constructed using the whole data set (Fig. 4). However, in the network constructed using the Italian subset, the two clusters previously identified as related to epidemiological aspects and biological mechanisms were merged in a single cluster (shown in green in Fig. 6). The third cluster included words related to the management and consequences of the infection (e.g. risk, mortality, management, outcomes, diagnosis) and is shown in blue in Fig. 6.

The inclusion of the scientific sector (SSD) of authors affiliated to Italian Institutions allowed to have a precise indication of research interests in the Italian scientific community during this historical period which has strongly affected the country. As shown in Fig. 7, medicine (MED) was by far the most represented sector, followed by biology (BIO). The legislative (IUS) and the economic (SECS-P) sectors were right behind in terms of the number of researchers dedicated to studying the effects of the pandemic on society. Figure 7 also shows that other important sectors involved were those related to chemistry, engineering, mathematics, and physics, with several studies proposing models explaining the spread of infection. For the complete list of the SSDs with their description (both in English and Italian) see the CUN (Italian National University Council) website.Footnote 4

Fig. 7
figure 7

Number of researchers in each SSD. Abbreviations: SSD, Settore Scientifico Disciplinare (disciplinary scientific sector)

3.3 Multiple correspondence analysis

In this section, by using MCA we investigate how the rate of citations per month and the 21 JCR research groups are associated with JIF, Number of Authors, and Type of Article. With that purpose we construct a factorial plane by using the active variables (i.e. JIF, Number of Authors, and Type of Article), project onto them the supplementary variables (i.e. Rate of Citations and Research groups) and interpret the results. We performed MCA through FactoMineR package (Lê et al. 2008) version 2.7 in R version 4.1.2 (R Core Team 2021).

Firstly, we categorized the variables (Table 4). The cutoffs are based on the quartiles for JIF and Rate of Citations. Concerning Number of Authors, the first cutoff divides papers with a single author, since they are the most desirable publications in most areas (Thatje 2016), whilst the second is the median.

Table 4 Active and supplementary variables used for Multiple Correspondence Analysis

Table 6, included in the Appendix, shows how much inertia is explained by each of the six dimensions extracted from the active variables. We decided then to use for our analysis the first two dimensions, which explain 38.3% of the total inertia. Table 7, included in the Appendix, shows the contribution of the active variables to the dimensions considered. Type of Article has almost no contribution in dimension 1, whilst it has the highest contribution for the second one. On the other hand, JIF and Number of Authors have a high relative contribution to both dimensions.

Figure 8 shows the bi-plot of the dimensions analyzed. The active variables JIF and Number of Authors appear strongly associated since the sorted categories of the latter are in between the sorted categories of the former, that is LowAuth is in between LowJIF and MediumIF, and so on. Consequently, scientific publications in journals with high impact factors tend to be signed by a large number of authors and vice versa; this is consistent with the previous literature on the topic (Thatje 2016; Uthman et al. 2013).

The first dimension can be interpreted as the complexity degree of the publication, in terms of the necessary work behind the publication both for organizing the contributions of many authors and for structuring the work for a journal with a high impact factor. Concerning the second dimension, instead, the main information is provided by Type of Article, in particular by the Review category. High values of this dimension correspond to review-type publications in journals with high impact factor, whilst its lowest values concern articles published in low-impact factor journals by just one author.

Onto this plane, we projected the two supplementary variables (Rate of Citations and Research Groups) to investigate their association with the active variables.

As the first dimension is characterized by the complexity degree of the publication, the cloud of the modalities of the Research Groups variable shows how as we move from the negative semi-plane to the positive one, we move from journal of areas (Arts, History, Literature) that usually have a low number of authors and published in low impact factor journals to topics such as Biology, Medicine or Chemistry, that are known to have articles with a large number of authors and refer to journals with high impact factors (Abramo and D’Angelo 2015).

Finally, the modalities of the Rate of Citations are mostly spread near the axes’ origin, however, we note that the highest rates are mostly associated with a high number of authors and high impact factor journals.

Fig. 8
figure 8

Multiple correspondence analysis bi-plot

3.4 Analysis of citations

We consider a regressive model to investigate the behaviour of the rate of citations per month of each paper with respect to the number of authors, the scientometric impact of the Journal of publication in terms of Impact Factor and the different JCR research groups. For the present analysis, only articles with at least one citation and published in journals with an impact factor were considered.

Three separate linear quantile regression models were estimated for the three quartiles (\(\tau =\{0.25, 0.50, 0.75\}\)). Standard errors were estimated bymeans of 100 bootstrap replicates. The application of such models allows to investigate the citation behaviours across its quartiles, by distinguishing the parameter estimates for highly cited papers from those corresponding to the less cited ones. Furthermore, the choice of such kind of models is suitable for dealing with the asymmetry of the response (rate of citations per month). In fact, we have observed that its values range from a minimum of 0.034 to a maximum of 93 with a median of 0.47 but a mean of 0.94.

The analysis of the number of citations, as it is well known, is useful to study the dynamics of diffusion of research in scientific communities. From our analysis it is confirmed that, also when focusing on the literature related to SARS-CoV-2 pandemic, there is a strong positive association of citations with the number of collaborators and the impact of the journal of publication. From Table 5 it can be seen how, in agreement with the literature in the field (see for instance Abramo and D’Angelo 2015), the impact of the number of authors tends to become negligible above certain levels of citations, i.e. the effect of this variable appeared to be not statistically significant when considering articles in the third quartile of the response distribution.

What is even more interesting is that, once controlled for the effect of the number of authors and the JIF, there is a different distribution of citations based on the JCR research groups, which also evolves across its quartiles. It emerges that, the fact of being classified in a specific discipline area can affect the rate of citations over time. In particular, we observe that there are areas for which the estimated effect is positive and there is also a natural increasing trend with higher quartiles (see Fig. 9 and Table 5), among them Physics, Ecology, Economics, Psychiatry and Social Sciences are the most influential. As opposite to them there are areas such as Arts, Chemistry, Engineering and Materials (see Fig. 10), for which we observe a decrease in the citation indicator. Furthermore, we can observe also differences in the magnitude of estimates.

A similar analysis was run for the Italian case study. Again, a linear quantile regression for the three quartiles was estimated with these data. What was found is that some of the effects have changed with respect to what observed in the overall data set, both in sign but also in magnitude. For instance, the effect of the number of authors on the rate of citations results to be negligible even for articles with few citations. What remains similar is the positive significant impact of the Physics group for articles with a median rate of citations, as well as the negative one in the Materials group, with the same decreasing trends along with the increase in citations. It is interesting to observe also how in the Computer Science community the literature on COVID-19 seems to be negatively associated to the rate of citations. In the overall data set we observed similar estimates in terms of magnitude but they were not statistically significant. The results of the estimates obtained are reported in the Appendix in Table 8.

Table 5 Quantile regression model results in the world global dataset. Estimates and relative inference are specified separately for each of the three quartiles considered: \(\tau =0.25\), \(\tau =0.50\) and \(\tau =0.75\)
Fig. 9
figure 9

Graphical trend of predictors’ effects for JCR research groups across the quartiles estimated. Positive trends

Fig. 10
figure 10

Graphical trend of predictors’ effects for JCR research groups across the quartiles estimated. Negative trends

4 Discussion and conclusions

In this paper we have inspected the impact of the outbreak of the COVID-19 pandemic in the global scientific literature available from the scientometric archive of Web of Science under different points of view. We presented an updated picture of the COVID-19 literature both at global and local level by conducting a specific case study on Italy, which was highly affected in the early phases of the pandemic. As expected, the research domains most affected by this phenomenon were associated to the medical fields, the epidemiology and health in general. Globally, the scientific production is led by the US and China (in accordance with previous literature). While the US was the country with the highest number of articles published, we found its rate of collaboration with other countries to be below 18%. Countries with fewer published articles were observed to be more prone to collaboration (e.g. UK, Canada, Germany and Australia have collaboration rates ranging from 32% to 37%). This aspect is confirmed by Fig. 3 which shows how the institutions in the US are all grouped in the same cluster. A similar trend can be observed in the local cluster analysis focused on Italy, where the institutions are divided into two clusters that closely reflect the geographical position. In fact, Fig. 5 shows that the red cluster is formed only by institutions from northern Italy, while the blue cluster includes institutions from central-southern Italy.

Finally, using MCA and quantile regression, we evaluated the relationships between the number of citations and other variables (mainly the journal impact factor, the number of authors, and the topics of the journal). Using MCA, we observed that the highest citation rates are associated with a high number of authors and high-impact factor journals. In addition, we showed that the modalities of the variable JCR research group ranged from topics that usually have a low number of authors and are published in low impact factor journals (such as Arts, History, Literature) to areas often characterized by articles with a large number of authors and published in journals with high impact factor (such as Biology, Medicine or Chemistry). Quantile regression analysis confirmed the strong relationship of citations with the number of authors and the JIF in non-medical fields, but highlighted that above a certain threshold, i.e. for highly cited articles, the impact of the number of authors almost becomes irrelevant. The analysis showed also how the spread of citations on the topic varies according to the research domain with distinct magnitudes across the quartiles of the response.

Our contribution provides a comprehensive picture of the global scientific production on COVID-19 that allowed to observe trends in terms of country and institutional collaborations, as well as to define factors associated with a higher number of citations. To the best of our knowledge there are no updated contributions that have investigated research collaborations within the frame of COVID-19 considering simultaneously the world-wide geographic networks as well as the impact in the global scientific community without focusing on specific research topics. In the early phase of the outbreak something similar was done by Yu et al. (2020) and Zyoud and Al-Jabi (2020), however their results were limited to the first wave of the COVID outbreak and did not include the analysis of long-term repercussions in research production. The application of statistical models, such as association analyses through MCA and quantile regression, were completely new in the scientometric analysis involving the SARS-CoV-2 scientific literature. This offered an enlarged perspective over the broad and unpredictable influence that this phenomenon had in research domains other from the medical ones. In the local perspective, previous studies involving the global Italian research community were restricted to the early phase of the epidemic spread and thus analyzed a limited number of scientific contributions. Furthermore, this was the first time that the Italian SSD scientific classification was linked to the investigation of the impact of COVID-19 on research production.

This work presents some limitations. The exact publication date of some articles (e.g. articles published with an early publication procedure) might be different from the one available in WoS (which only reports the publication date of the issue). This might result in an underestimation of the lifespan of a scientific article and therefore a higher parameterized number of citations. Moreover, as regards to the analyses concerning the data set of authors with an Italian affiliation, it was not possible to obtain the SSD for all authors as some do not work in universities (e.g. doctors, engineers) or their names/last names were misspelled or abbreviated. As future developments, we plan to further study the role of the impact factor in predicting the number of citations when considering articles focused on problems that might exert consequences equal or even more relevant compared to those of the COVID-19 pandemic, but spread in a longer timespan, such as the investigation of consequences of climate change.