There is a long tradition of evaluation in Information Retrieval, namely the field dealing with the design of search engines. So-called ‘test collections’ were designed to enable the automatic, systematic, reproducible evaluation of search engines (Voorhees 2007). A test collection is comprised of:
-
Input of the system under study: a textual corpus.
-
Ground truth: the expected output of the system under study.
-
Metrics to assess the quality of a given output regarding the expected output.
This framework enables one to:
-
1.
Benchmark the quality of a system compared with other systems, with a fixed input.
-
2.
Check the performance gain/loss of a new system configuration compared to a baseline configuration, with a fixed input.
-
3.
Benchmark the quality of a fixed system fed with varying inputs.
The ERC error-detection benchmark that we introduce in this section addresses these three purposes. In the subsequent experiments section, we report results according to point (3) above as we assess the quality of Seek&Blastn regarding varying inputs.
To the best of our knowledge, there is no existing benchmark dealing with the error detection in the biomedical literature. Our original benchmark depicted in Fig. 1 aims to address this issue. Let us introduce a handful of biomedical concepts prior to explaining this figure—they appear in italics in the following. We consider scientific papers in the field of life sciences reporting gene-related experiments. These papers mention reagents: sequences of nucleotides, each nucleotide being represented as a letter among A, T, C, G, and U. Each reagent may or may not bind to a target within the genome or transcriptome. The presence or absence of binding depends on the homology between the reagent and its target, which is typically a defined place/localisation in the genome, such as a named gene. A gene is identified with a standard name (e.g., NOB1 and TPD52L2) and the reagent–gene homology can be assessed using the BLASTN software (Altschul et al. 1997). A reagent is said to be:
The rule-based algorithm illustrated in Fig. 2 of the “Appendix” (Reagent–gene assessment of nucleotide sequence homology) assesses the significance of homology, leading to predictions of whether reagents are likely to be targeting or non-targeting. These rules have been determined and validated on the base of laboratory practices (BF, JAB, and TG) and on common siRNA and shRNA rule design (Yu et al. 2002). Deciding on the (non-)targeting nature of a reagent requires feeding this rule-based algorithm with the output of the BLASTN software (run with the reagent given as input). An endorsed fact (Computed Endorsed fact in Fig. 1) reflects the current knowledge provided from BLASTN and refined by the rule-based algorithm.
In biomedical publications, nucleotide sequence reagents are claimed to be (non-)targeting a gene. For various reasons a claim may be wrong (Labbé et al. 2019), such as through typographical errors, copy and paste errors or through limited understanding of the experiments described. Invalidating a claim requires a comparison of an automatically extracted claimed fact with the BLASTN endorsed fact. This fact-checking process tags each sequence with one of the fourFootnote 2 following classes:
-
\(\mathrm {Class}_0\): supported claim. The nucleotide sequence in the text and its associated claim is valid according to current knowledge.
-
\(\mathrm {Class}_6\): unsupported claim of targeting status. A nucleotide sequence is said to be targeting but is predicted to be non-targeting according to current knowledge.
-
\(\mathrm {Class}_7\): unsupported claim of non-targeting status. A sequence is said to be non-targeting but is predicted to be targeting according to current knowledge.
-
\(\mathrm {Class}_8\): targeting claim supported but incorrect target. A stated targeting nucleotide sequence is predicted to target a different gene or nucleotide sequence to that claimed.
Papers with unspecific targets or claimed status were removed from the ground truth.
The ERC benchmark aims to measure the quality of a fact-checking system by comparing its output to a test collection. The test collection stores nucleotide sequences that human experts tagged with a \(\mathrm {Class}_{\{0, 6, 7, 8\}}\). Benchmarking a fact-checking system consists of comparing its output to the expected answer for each sequence. The metrics we defined in ERC are run on the output of a fact-checking system: one of these classes (0, 6, 7, or 8) or a ‘no-decision’ answer for each sequence. This latter case occurs when the system is not able to provide an answer and reports this to end-users.
Metrics are defined to assess the overall performance of the system relying on three chained processes (CP1 to CP3) that one needs to evaluate separately. Each metric combines quantities from the set of variables introduced in Table 1. For each CP, we distinguish three metrics: either the system is successful (metric OK) or the system either fails to extract the information to be checked (metric KO1), or it fails to check correct information (metric KO2, wrong decision is made). These metrics in the [0, 1] range are computed as follows:
-
CP1.
Sequence:
-
Sequence_OK = \(\nicefrac {c}{s}\) is recall-oriented and reflects the ability to extract all the nucleotide sequences from the corpus.
-
Sequence_KO1 is not computed as the nucleotide sequence is either correctly extracted (i.e., Sequence_OK) or missed (i.e., Sequence_KO2).
-
Sequence_KO2 = \(\nicefrac {f}{(f+c)}\) is precision-oriented and reflects the trust that the user can place in the results of the sequence extraction task.
-
CP2.
Status for each correctly extracted nucleotide sequence:
-
Status_OK = \(\nicefrac {a}{c}\) reflects the ability to automatically assign the claimed status to a nucleotide sequence reagent: non-targeting vs targeting.
-
Status_KO1 = \(\nicefrac {n}{c}\) measures the proportion of nucleotide sequences for which the fact-checking tool failed to assign a specific status.
-
Status_KO2 = \(\nicefrac {w}{c}\) measures the proportion of nucleotide sequences for which the fact-checking tool misassigned the claimed status.
-
CP3.
Targeted gene or sequence for each targeting nucleotide sequence:
-
Gene_OK = \(\nicefrac {a'}{c}\) measures the proportion of correctly extracted sequences to which the claimed gene identifier was associated (none for non-targeting sequences).
-
Gene_KO1 = \(\nicefrac {n'}{c}\) measures the proportion of correctly extracted nucleotide sequences to which no gene identifier was associated (value unknown).
-
Gene_KO2 = \(\nicefrac {w'}{c}\) measures the proportion of correctly extracted sequences to which a wrong gene identifier was associated (e.g., the text mentions TPD52L2 whereas the fact-checking tool extracted a different identifier).
Table 1 Variables used to define the metrics of the ERC fact-checking benchmark The fact-checking system compares (1) the text stated and extracted fact with (2) the endorsed fact to produce an output \(\mathrm {Class}_{\{0, 6, 7, 8\}}\) for a given nucleotide sequence. An error while performing CP1, CP2, or CP3 is responsible for a false output from the fact-checking system. We measure the end-to-end performance of the benchmarked tool as:
-
Fact-check_OK = \(\nicefrac {o}{s}\) measures the proportion of correctly checked nucleotide sequences.
-
Fact-check_KO1 = \(\nicefrac {p}{s}\) measures the proportion of nucleotide sequences for which the fact-checking process did not produce a decision.
-
Fact-check_KO2 = \(\nicefrac {q}{s}\) measures the proportion of nucleotide sequences for which a wrong decision was made (e.g. \(\mathrm {Class}_0\) instead of \(\mathrm {Class}_8\)).
At this point, the metrics-defined can be used to answer the main question of this paper, namely what is the performance decay (if any) when providing inputs in PDF format compared to other, more structured, formats? We answer this question by comparing Fact-check_OK across all tested input formats. Running through the whole processing chain (CP1, CP2, and CP3) indicates where performance decreases. This helps system designers to decide where to focus their future efforts.
People who employ the fact-checking system include biologists, journal staff, and text miners. Detecting errors in the papers that they read is crucial to not trust erroneous literature. From the end-users’ perspective, flagging errors automatically can prove risky, as no system is perfect. Reporting a trust level for each output of the detector (i.e., \(\mathrm {Class}_{\{0, 6, 7, 8\}}\)) can help end-users to assess the trustworthiness of the system result. This is why the benchmark provides the following extra metrics regarding the success rate per error class.
As a reminder, each nucleotide sequence in the ground truth is tagged with one expected output (\(\mathrm {Class}_{\{0, 6, 7, 8\}}\)). Due to this partition, the number of sequences s is the sum of the number of sequences in each class, that is \(s = \sum _{i\in \{0, 6, 7, 8\}} s_i\) where \(s_i\) is the number of sequences for \(\mathrm {Class}_i\) in the test collection. The following numbers are useful to measure the accuracy of a system under test:
-
\(o_i\) is the number of nucleotide sequences of \(\hbox {Class}_i\) that were correctly reported by the system.
-
\(u_i\) is the number of nucleotide sequences of \(\hbox {Class}_i\) for which no decision was taken by the system.
-
\(m_i\) is the number of nucleotide sequences of \(\hbox {Class}_i\) that were incorrectly reported by the system.
For each \(\hbox {Class}_i\), we define the following metrics:
-
\(\mathrm {Class}_i\)_OK = \(\nicefrac {o_i}{s_i}\) is the proportion of nucleotide sequences for which the system output matches the expected class given in the Ground Truth.
-
\(\mathrm {Class}_i\)_KO1 = \(\nicefrac {u_i}{s_i}\) is the proportion of nucleotide sequences for which the system was unable to associate a class (‘no-decision’ was reported).
-
\(\mathrm {Class}_i\)_KO2 = \(\nicefrac {m_i}{s_i}\) is the proportion of nucleotide sequences for which the system produced a wrong class. The distribution of these wrong decisions among the different classes can then be computed to identify the most frequent erroneous pairs of \(\mathrm {Class}_i\) and \(\mathrm {Output}_i\).
\(\mathrm {Class}_i\)_KO2 is crucial to highlight situations when the fact-checking software confused one class for another one. For example, the \(\mathrm {Class}_0\) sequences (endorsed facts) may be confused with \(\mathrm {Class}_8\) sequences (gene mismatch). This measure reflects the likelihood of misclassifying a nucleotide sequence of a given class. This informs the end-user about the level of confidence (s)he may have with regards to each type of output.
The next section reports the results of the benchmark that we performed on the ERC_H_v2 test collection that we built and distributed for reproducibility concerns.