This section introduces the algorithm we designed for the day-to-day discovery of preprint–publication links. We first consider the links already established by medRxiv and gather knowledge about the most successful features to match publications to preprints. These features inform our original ‘search and prune’ strategy leveraging the Crossref API as a third-party academic search engine. The source code of the linker is released as supplementary material (Appendix 1) so that readers can replicate our results or seek new preprint–publication links in medRxiv or in any other preprint servers.
Collecting the medRxiv-established preprint–publication links
The Cold Spring Harbor Laboratory operating both bioRxiv and medRxiv offers an Application Programming Interface (API)Footnote 6 for programmatic access to the data hosted in both servers. We used it to collect the preprint–publication pairs for all medRxiv preprints. Figure 3 shows an excerpt of the resulting records: one per preprint version. As of July 14, medRxiv hosted 10,560 preprint versions corresponding to 8214 unique preprints. Filtering these records on the published field, we found 741 preprints with one linked publication.
We then retrieved preprint and publication metadata by querying the Crossref APIFootnote 7 with the DOIs listed earlier. Crossref provides the bylines with the ORCID of each author when available and each author’s complete identity. First names are given in full, which is more precise than initials for some bioRxiv preprints (see Costa, D. in Fig. 3). One DOI failed to resolve (https://doi.org/10.34171/mjiri.34.62) and we excluded the associated pair from the collection, which thus comprises 740 preprint–publication links.
Designing features to match publications with preprints
Based on the retrieved metadata for the 740 preprint–publication pairs, we designed three features to be used as criteria to match a candidate publication to a given preprint. The next sections detail the rationale and implementation of these features based on the timeline, title, and byline matching.
Timeline matching
According to the FAQ (medRxiv 2020), the first version of a preprint should predate the acceptation date of the linked publication:
Among the 740 collected medRxiv preprints, less than one percent (\(N=5\)) do not comply with this requirement (Table 1). This observation suggests that searching for publications with an acceptance date posterior to the preprint’s submission date works in most cases.
Table 1 Five outlying medRxiv preprints posted after the acceptation date of the linked publication Title matching
We hypothesised that the title of a preprint (in its latest version) and the title of its associated publication are likely to be very similar. Running through the 740 paired titles, we noticed that minor variations often occur. Some typographic markers differ between preprint and publication versions: hyphens get typeset as em- or en-dashes, for instance. In addition, acronyms in preprint titles are sometimes expanded in the publication counterparts. For instance, the strings USA and US were likely to appear as the United States of America and a few occurrences of SARS-CoV-2 were changed to severe acute respiratory syndrome coronavirus 2.
We used a 3-step method to measure the similarity between a preprint’s title and its associated publication’s title. First, both titles were pre-processed to expand acronyms and uniformise typographic markers. Second, the resulting titles were tokenised using whitespace as delimiter. Third, the Jaccard distance between the two resulting token lists was computed (Levandowsky and Winter 1971) to reflect the share of words in common compared to all words occurring in the preprint and publication titles. The resulting similarity value is the one-complement of this distance.
Perfect similarity occurred for 81% (\(N=600\)) of the 740 preprint–publication pairs. A similarity of 80% or more characterises 90% (\(N=626\)) of the pairs. A small fraction of 8% (\(N=58\)) of the pairs show a [0.5, 0.8[ similarity. One pair only has a similarity below 10%: the preprint title was recast before submission to the British Medical Journal. This example of a 5% inter-title similarity features very little words in common:
-
The preprint https://doi.org/10.1101/2020.05.02.20086231 in its latest version was titled: Trends in excess cancer and cardiovascular deaths in Scotland during the COVID-19 pandemic 30 December 2019 to 20 April 2020. (We note in passing that the metadata differs slightly from the title given in the PDF version of the preprint).
-
The subsequent publication https://doi.org/10.1136/bmj.m2377 was titled: Distinguishing between direct and indirect consequences of covid-19.
These tests suggest that most preprint–publication pairs show high to perfect similarity. Setting a 10% lower bound for inter-title similarities should filter irrelevant pairs out.
Byline matching
We hypothesised that the first author of a preprint (in its latest version) remains as first author in the published paper. There is only one counterexample among the 740 pairs: the first author of preprint https://doi.org/10.1101/2020.03.03.20030593 becomes third author in the associated publication https://doi.org/10.1001/jama.2020.6130 promoting the preprint authors ranked 10 and 2.
Comparing the ORCIDs of the preprint vs. publication first author is the most effective way when ORCIDs are provided. This occurred for 30% (\(N=219\)) of all pairs. As a fallback solution, we compared the identity (i.e., last name and first name) of paired authors. We noted several discrepancies hindering any matching on strict string equality, such as:
We designed an author–matcher algorithm that compares two authors’ ORCIDs or, when not available, their identity. Hyphens and accents were removed to uniformise the strings. Then, the family names and up to the top three letters of the first names were compared, as a way to overcome changes in middle initials. Tested on the 740 pairs, this approach showed a 97% (\(N=721\)) success rate. This suggests that first author comparison is effective for preprint–publication matching.
We tested another criterion that proved less effective: the number of preprint vs. publication authors. It appeared that 95% (\(N=708\)) pairs validate the following hypothesis: the number of publication authors is equal or greater than the number of preprint authors. We disregarded this criterion when combining the other more effective ones as presented in the next section.
Feature benchmarking on the medRxiv gold collection of ‘published preprints’
We combined these features to form a burden of proof, which is used to decide when a preprint–publication pair should be reported. The \( match (p,j)\in {\mathbb {B}}\) boolean function is true when a journal paper j is likely to be linked to a preprint p, such as:
$$\begin{aligned} \begin{aligned} match (p,j)&= simTitles (p,j)\geqslant 0.8\\&\,\vee \Big ( simTitles (p,j)\geqslant 0.1\\&\quad \,\wedge matchDates (p,j)\\&\quad\, \wedge \big ( matchORCIDs (p,j) \vee matchFirstAuthors (p,j)\big )\Big ) \end{aligned} \end{aligned}$$
(1)
where:
-
\( simTitles (p,j)\in [0,1]\) is the one-complement of the Jaccard distance between the titles.
-
\( matchDates (p,j)\in {\mathbb {B}}\) is true when the date of p is earlier of equal to the date of j.
-
\( matchORCIDs (p,j)\in {\mathbb {B}}\) is true when the ORCIDs of the first authors are identical.
-
\( matchFirstAuthors (p,j)\in {\mathbb {B}}\) is true when the identifiers of the first authors match.
Titles showing a 80% or higher similarity were found to be excellent evidence. This criteria circumvents the aforementioned timeline issues for the five problematic cases of Table 1 and for 18 out of 19 preprint–publication cases with non-matching first authors.
For titles with less than 80% similarity, candidate pairs should have title similarity of 10% at least, compatible dates (i.e, a preprint should be posted before its journal counterpart acceptance), and identical first authors (based on either ORCIDs or identity comparisons).
We applied equ. 1 on the 740 preprint–publication pairs of the medRxiv gold collection. The matching is almost perfect with 99% validated pairs (\(N=738\)). Failure analysis on the two missed pairs showed that:
The next section discusses the implementation of the tested search features as input parameters to the Crossref API and post-processing filters.
Implementation of the search features using the Crossref API
As a reminder, we tackle the following information retrieval task: for a given preprint, find all subsequently published articles. We need to comb the most comprehensive and up-to-date scholarly literature, looking for publications matching the features of the preprint under consideration. This section describes the preprint–publication linker we designed. It combs the scholarly literature for publications matching preprints using the daily-updated Crossref bibliographic source that comprised 117 million records as of October 2020.Footnote 8
We designed a two-step ‘query and prune’ process to retrieve any publication likely to be a follow-up of a given preprint.
First, the program queries the Crossref REST API with the parameters in Table 2. These reflect the features that we established and tested against the medRxiv gold collection of ‘published’ preprints. Exclusion filters delineate the search space based on two criteria. First, the publication’s date must be posterior or equal to the preprint’s first version. Second, the publication’s type must include materials published in journals, proceedings, and books. Crossref’s search engine uses a ‘best match’ approach to retrieve up to 20 records based on title and byline similarity. Each returned record comes with a score reflecting the similarity between the query (i.e., preprint) and the matching publication.
Table 2 Searching the literature for publications matching a given preprint: invocation of the Crossref REST API at https://api.crossref.org with parametrised works resource (see https://github.com/CrossRef/rest-api-doc#parameters) Second, the program prunes the publication records that are unlikely to be preprint follow-ups. Equation 1 is applied to discard publications whose titles and bylines fail to match those of the preprint under consideration. A final filter rejects Elsevier records from the Social Science Research Network (SSRN) preprint server whose DOIs starting with 10.2139/ssrn were incorrectly deposited with the journal-article type despite being preprints (Lin and Ram 2018). The surviving record(s) are shown to the user who is expected to validate the preprint–publication pair(s) tabulated by decreasing matching likelihood (Fig. 4).