Study outline and study outcomes
Types of datasets used and their purpose in the present study are outlined in Fig. 1. We defined three study objectives based on their relevance (objective 1—most relevant, objective 3—least relevant) and the anticipated level of susceptibility to bias/confounding (objective 1—least susceptible, objective 3—most susceptible), given the nature of the study (observational) and the type of data (metadata). The 1st objective was to investigate whether COVID-19-related preprints were favored (over non-related) for publication in peer-reviewed journals. Preprints were used as proxies of (hypothetical) “all generated manuscripts on the topics” since they were the only ones whose existence could be clearly verified and for which subsequent developments could be prospectively evaluated. We considered that the probability of publishing was the most informative outcome in this respect and that further insight would be obtained from analysis of submission-to-acceptance time. Therefore, we defined three outcomes of interest regarding the 1st study objective: (i) The primary outcome was defined as probability of publishing within 120 days since the deposition of the first preprint version. The time constraint was imposed to reduce the risk of bias arising from unequal “time-at-risk” (for publishing); hence,the analysis was restricted to preprints deposited till June 29 to allow all preprints to have 120 days “available” to be published between the deposition date and the date of data collection/analysis. Preprints that remained unpublished by day 120 were censored. We considered 120 days to be a reasonable period of time for the submission and review process to take place; (ii) secondary outcome was probability of publishing over the entire observed period (i.e., up to November 01, 2020) and it was assessed in a subset of preprints deposited before September 27, 2020. Manuscripts that remained unpublished by November 01, 2020, were censored. We considered this outcome to be complementary to the primary outcome, which was susceptible to bias arising from a possibility that not all preprinted manuscripts were actually submitted to journals within the same/similar timeframe, i.e., that some might have been purposely left in a preprint form over a longer period of time. To further reduce bias/confounding arising from unequal “time-at-risk”, preprints in both datasets were stratified into 15-day strata regarding the date of the first preprinted version. Due to the limited number of COVID-19 related preprints, the first two strata (Jan. 01–15 and Jan. 16–30) were merged into one stratum; (iii) tertiary outcome was submission-to-acceptance time, considered a proxy of the length of the peer-review process, and was assessed in a subset of preprints that were published in peer-reviewed journals during the observed period and submitted to journals after January 01, 2020. Here, we anticipated potential confounding arising from varying interest in COVID-19-related and non-related topics over time, so manuscripts were stratified into 15-day strata regarding the journal submission dates (Fig. 1). The 2nd objective was to illustrate preprinting trends of COVID-19-related and non-related manuscripts on bioRxiv and medRxiv, their usage statistics and to estimate the extent of the public peer-review (i.e., pre-submission peer-review) using the number of posted comments and Altmetric data as proxies. This was assessed using all preprints deposited at the two platforms between January 01 and December 05, 2020 (Fig. 1). The 3rd objective was to evaluate a possible association between the publication topic (COVID-19-related or non-related) and quality issues related to the published papers using notifications on retraction or issuance of concerns or corrections as proxies. This was assessed using data on all published papers indexed in PubMed and all notifications issued in the Retraction Watch Database (Retraction Watch Database, n.d.) between January 01 and December 05, 2020 (Fig. 1).
Data collection
All data retrieval and management were done in R (version 4.0.2) (R Core Team, 2020). Rbiorxiv and medrxivr packages were used to access bioRxiv and medRxiv application programming interfaces (APIs) and to collect metadata and usage statistics on the preprinted manuscripts (January 01–December 05, 2020). COVID-19 preprints were identified using the search terms COVID-19 OR SARS-CoV-2 OR Coronavirus disease 19 OR 2019-nCoV. Furthermore, CORD-19 dataset was used to identify additional COVID-19 related preprints. All other preprints were classified as non-COVID-19 preprints. Publication status was retrieved from bioRxiv and medRxiv services. Publication dates of articles with bioRxiv preprints were provided by the bioRxiv API. Rcrossref was used to gather publication dates for journal-published preprints, initially deposited on the medRxiv. The DisqusR and rAltmetric packages were used to access the Disqus and Altmetric API and identify the number of comments and retrieve Altmetric data for each preprint deposited on bioRxiv and medRxiv servers. Submission and acceptance dates for published preprints were retrieved from PubMed with the RISmed package. The Retraction Watch Database was used to retrieve the number of COVID-19-related articles with issued retraction notice, expression of concern, or correction during 2020 (till December 05, 2020). Controls to COVID-19-related manuscripts in this analysis were manuscripts pertaining to four different viruses and their associated diseases: human immunodeficiency virus, hepatitis virus (any), herpes virus (any) and influenza virus. Four search phrases were constructed to retrieve the number of retraction notices, expression of concerns or corrections pertaining to these topics; two search phrases were used to retrieve the numbers pertaining to two related topics (immunology and epidemiology of viral infectious diseases); one search phrase was used to retrieve the number of retraction notices, expressions of concerns or corrections issued for all COVID-19-unrelated articles (obtained as a difference between the total number of retrieved items and the number retrieved for COVID-19-related papers).
The exact search methodology is depicted under Table 3. All numbers are expressed as a proportion of the total number of articles indexed by PubMed. To minimize the “time-at-risk” bias, only articles published during 2020 were compared. However, some risk of bias still remains due to the possibility of different publishing rates of articles during 2020. To avoid bias due to possible over- or underrepresentation of the COVID-19 articles in PubMed, in comparison with the Retraction Watch Database, we repeated the analysis using only data provided by the PubMed database. The search was conducted using the search term (Retracted Publication[PT]) AND ("2020/01/01"[Date—Publication]: "2020/12/05"[Date—Publication]), with and without “AND (COVID-19 OR SARS-CoV-2 OR “Coronavirus disease 19” OR 2019-nCoV)”, to identify the number of retractions of COVID-19 and non-COVID-19 articles during 2020.
Data analysis
All data visualization and analysis were performed in R (version 4.0.2) (R Core Team, 2020). Data on preprinting trends over time, preprint usage statistics, Altmetric data and Disqus comments on the manuscripts preprinted during the observed period, and Retraction Watch Database data on the published papers, were summarized by the preprinting platform and topic (COVID-19-related or not).
The probability of publishing within 120 days since the 1st preprint version and the probability of publishing over the entire observed period were analyzed by fitting stratified (in respect to preprint date) logistic regression (package survival, function clogit). Submission-to-acceptance time was analyzed by fitting a hierarchical (mixed) model with the submission date stratum as a random effect. Articles with submission dates before 2020 were excluded from the dataset that was used for the analysis submission-to-acceptance time. Fixed effects in all analyses were topic (COVID-19-related or non-related), preprinting platform (bioRxiv or medRxiv), and the number of preprinted versions (dichotomized as one or ≥ 2). The latter adjustment was introduced to account for potential bias arising from different intentions of the preprinting authors. For example, a preprint might have been submitted to a journal at the time of preprinting, or it might have been a work in progress with several versions and purposely kept (only) as a preprint over a longer period of time; more preprinted versions might have improved the quality of the final submitted version, hence peer-review process might have been shorter.
We conducted a supplemental analysis of preprinting/publishing trends and other aspects of bioRxiv and medRxiv preprints over a longer period of time that we found informative for discussion of the main study results (see Supplemental Material, Supplemental methods).
Submission-to-acceptance dataset validation
After all articles with submission dates before January 01, 2020, had been removed, we identified and excluded one article with the submission-to-acceptance time of—1 days (an obvious mistake). To validate the dataset, submission and acceptance dates were checked for 597 (10% of all) randomly selected articles. For 12 articles we could not find the submission and acceptance dates on the journal website, nor published pdf of the article. Five articles with erroneous date values were identified. We considered that the number and size of errors were acceptable and not likely to affect the conclusions of the study.