CsFEVER and CTKFacts: acquiring Czech data for fact verification

In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of Wikipedia corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses, propose a future strategy for their mitigation and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task—the CsFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2 M articles of Czech News Agency. We present an extended dataset annotation methodology based on the FEVER approach, and, as the underlying corpus is proprietary, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFactsNLI. We analyze both acquired datasets for spurious cues—annotation patterns leading to model overfitting. CTKFacts is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.


Introduction
In the current highly connected online society, the ever-growing information influx eases the spread of false or misleading news.The omnipresence of fake news motivated formation of fact-checking organizations such as AFP Fact Check, 1 International Fact-Checking Network,2 PolitiFact, 3 Poynter, 4Snopes, 5 and many others.At the same time, many tools for fake news detection and fact-checking are being developed: ClaimBuster [1], ClaimReview 6 or CrowdTangle;7 see [2] for more examples.Many of these are based on machine learning technologies aimed at image recognition, speech to text, or Natural Language Processing (NLP).This article deals with the latter, focusing on automated fact-checking (hereinafter also referred to as fact verification).
Automated fact verification is a complex NLP task [3] in which the veracity of a textual claim gets evaluated with respect to a ground truth corpus.The output of a fact-checking system gives a classification of the claim -conventionally varying between supported, refuted and not enough information available in corpus.For the supported and refuted outcomes it further supplies the evidence, i.e., a list of documents that explain the verdict.Fact-checking systems typically work in two stages [4].In the first stage, based on the input claim, the Document Retrieval (DR) module selects the evidence.In the second stage, the Natural Language Inference (NLI) module matches the evidence with the claim and provides the final verdict.Table 1 shows an example of data used to train the fact-checking systems of this type.
Current state-of-the-art methods applied to the domain of automated factchecking are typically based on large-scale neural language models [5], which are notoriously data-hungry.While there is a reasonable number of quality datasets available for high-profile world languages [2], the situation for the most other languages is significantly less favorable.Also, most available large-scale datasets are built on top of Wikipedia [4,[6][7][8].While encyclopedic corpora are convenient for dataset annotation, these are hardly the only eligible sources of the ground truth.
We argue that corpora of verified news articles used as claim verification datasets are a relevant alternative to encyclopedic corpora.Advantages are clear: the amount and detail of information covered by news reports are typically higher.Furthermore, the news articles typically inform on recent events attracting public attention, which also inspire new fake or misleading claims spreading throughout the online space.
On the other hand, news articles address a more varied range of issues and have a more complex structure from the NLP perspective.While encyclopedic texts are typically concise and focused on facts, the style of news articles can vary wildly between different documents or even within a single article.For example, it is common that a report-style article is intertwined with quotations and informative summaries.Also, claim validity might be obscured by complex temporal or personal relationships: a past quotation like "Janet Reno will become a member of the Cabinet." may or may not support the claim "Janet Reno was the member of the Cabinet." 8 This depends on, firstly, which date we verify the claim validity to, and secondly, who was or what was the competence of the quotation's author.Note that similar problems are less likely in encyclopedia-based datasets like FEVER [4].
The contributions of this paper are as follows: 1. FEVER localization scheme (and CsFEVER case study): We propose an experimental localization scheme of the large-scale FEVER [4] 8 And the veracity of such claim may be further nuanced by the affiliation and bias of the speaker.
In Section 4, we introduce the novel CTKFacts dataset.We describe its annotation methodology, data cleaning, and postprocessing, as well as analysis of the inter-annotator agreement.Section 5 analyzes spurious cues for both CsFEVER and CTKFacts.In Section 6, we present the baseline models.
Section 7 concludes with an overall discussion of the results and with remarks for future research.

Related work
This section describes datasets and models related to the task of automated fact-checking of textual claims.More general overview of the state-of-the-art can be found in [2] or [9].Emergent [10] dataset is based on news; it contains 300 claims and 2k+ articles, however, it is limited to headlines.Due to the dataset size, only simple models classifying to three classes (for, against, and observing) are presented.Described models are fed BoW vectors and feature-engineered attributes.
Wang in [11] presents another dataset of 12k+ claims, working with 5 classes (pants-fire, false, barely-true, half-true, mostly-true, and true).Each verdict includes a justification.However, evidence sources are missing.The models presented in the paper are claim-only, i.e., they deal with surfacelevel linguistic cues only.The author further experiments with speaker-related meta-data.
Fact Extraction and VERification (FEVER) [4] is a large dataset of 185k+ claims covering the overall fact-checking pipeline.It is based on abstracts of 50k most visited pages of English Wikipedia.Authors present complex annotation methodology that involves two stages: the claim generation in which annotators firstly create a true initial claim supported by a random Wikipedia source article with context extended by the dictionary constructed from pages linked from the source article.The initial claim is further mutated by rephrasing, negating and other operations.The task of the second claim labeling stage is to provide the evidence as well as give the final verdict: SUPPORTS, REFUTES or NEI, where the latter stands for the "not enough information" label.Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) [6] adds 87k+ claims including evidence based on Wikipedia table cells.The size of FEVER data facilitates modern deep learning NLP methods.The FEVER authors host annual workshops involving competitions, with results described in [5] and [12].
MultiFC [13] is a 34k+ claim dataset sourcing its claims from 26 fact checking sites.The evidence documents are retrieved via Google Search API as the ten highest-ranking results.This approach significantly deviates from the FEVER-like datasets as the ground-truth is not limited by a closed-world corpus, which limits the trustworthiness of the retrieved evidence.Also, similar data cannot be utilized to train the DR models.
WikiFactCheck-English [8] is another recent Wikipedia-based large dataset of 124k+ claims and further 34k+ ones including claims refuted by the same evidence.The claims are accompanied by context.The evidence is based on Wikipedia articles as well as on the linked documents.
Considering other than English fact-checking datasets, the situation is less favorable.Recently, Gupta et al. [14] released a multilingual (25 languages) dataset of 31k+ claims annotated by seven veracity classes.Similarly to the MultiFC, evidence is retrieved via Google Search API.The experiments with the multilingual Bert [15] model show that the gain from including the evidence is rather limited when compared to claim-only models.FakeCovid [16] is a multilingual (40 languages) dataset of 5k+ news articles.The dataset focuses strictly on the COVID-19 topic.Also, it does not supply evidence in a raw form -human fact-checker argumentation is provided instead.Kazemi et al. [17] released two multilingual (5 languages) datasets, these are, however, aimed at claim detection (5k+ examples) and claim matching (2k+ claim pairs).
In the Czech locale, the most significant machine-learnable dataset is the Demagog dataset [18] based on the fact-checks of the Demagog10 organisation.The dataset contains 9k+ claims in Czech (and 15k+ in Slovak and Polish) labeled with a veracity verdict and speaker-related metadata, such as name and political affiliation.The verdict justification is given in natural language, often providing links from social networks, government-operated webpages, etc.While the metadata is appropriate for statistical analyses, the justification does not come from a closed knowledge base that could be used in an automated scheme.
The work most related to ours was presented by the authors of [19,20], who published a Danish version of EnFEVER called DanFEVER.Unlike our CsFEVER dataset, DanFEVER was annotated by humans.Given the limited number of annotators, it includes significantly fewer claims than EnFEVER (6k+ as opposed to 185k+).

CsFEVER
In this section, we introduce a developmental CsFEVER dataset intended as a Czech localization of the large-scale English EnFEVER dataset.It consists of claims and veracity labels justified with pointers to data within the Czech Wikipedia dump.
A typical approach to automatically build such a dataset from the EnFEVER data would be to employ machine translation (MT) methods for both claims and Wikipedia articles.While MT methods are recently reaching maturity [21,22], the problem lies in the high computational complexity of such translation.While using the state-of-the-art MT methods to translate the claims (2.2M words) is a feasible way of acquiring data, the translation of all Wikipedia articles is a much costlier task, as only their abstracts have a total of 513M words corresponding to 452k pages (measuring the June 2017 dump used in [4]).
However, in NLP research, Wikipedia localizations are often considered a comparable corpus [23][24][25][26][27], that is, a corpus of texts that share a domain and properties.Furthermore, partial alignment is often revealed between Wikipedia locales, either on the level of article titles [26], or specific sentences [23] -much like in parallel corpora.We hypothesize there may be a sufficient document-level alignment between Czech and English Wikipedia abstracts that were used to annotate the EnFEVER dataset, as in both languages the abstracts are used to summarize basic facts about the same real-world entity.
In order to validate this hypothesis, and to obtain experimental large-scale data for our task, we proceed to localize the EnFEVER dataset using such an alignment derived from the Wikipedia interlanguage linking available on MediaWiki. 11In the following sections, we discuss the output quality and information loss, and we outline possible uses of the resulting dataset.

Method
Our approach to generating CsFEVER from the openly available EnFEVER dataset can be summarized by the following steps: 1. Fix a version of Wikipedia dump in the target language to be the verified corpus.2. Map each Wikipedia article referred in the evidence sets to a corresponding localized article using MediaWiki API. 12 If no localization is available for an article, remove all evidence sets in which it occurs.3. Remove all SUPPORTS and REFUTES data points having empty evidence.4. Apply MT method of choice to all claims.5. Re-split the dataset to train, dev, and test so that the dev and test veracity labels are balanced.Before we explore the data, let us discuss the caveats of the scheme itself.Firstly, the evidence sets are not guaranteed to be exhaustive -no human annotations in the target language were made to detect whether there are new ways of verifying the claims using the target version of Wikipedia (in fact, this does not hold for EnFEVER either, as its evidence-recall was estimated to be 72.36%[4]).
Secondly, even if our document-alignment hypothesis is valid on the level of abstracts, sentence-level alignment is not guaranteed.Its absence invalidates the EnFEVER evidence format, where evidence is an array of Wikipedia sentence identifiers.The problem could, however, be addressed by altering the evidence granularity of the dataset, i.e., using whole documents to prove or refute the claim, rather than sentences.Recent research on long-input processing language models [28][29][30] is likely to make this simplification less significant.Lastly, the step 5. might allow a slight leakage of information between the splits -while it is guaranteed that no claim appears in two splits simultaneously, two claims extracted from the same Wikipedia article may -however, such claims are typically its different mutations, independent of each other's veracity.This can not be fixed using publicly available data, if we're aiming for balanced dev, test splits.

Results
Following our scheme from Section 3.1, we used the June 2020 Czech Wikipedia dump parsed into a database of plain text articles using the wikiextractor 13 package and only kept their abstracts.
In order to translate the claims, we have empirically tested three available state-of-the-art English-Czech machine translation engines (data not shown here).Namely, these were: Google Cloud Translation API, 14 CUBBITT [22] and DeepL. 15As of March 17 th 2021 we observed DeepL to give the best results.Most importantly, it turned out to be robust w.r.t.homographs and faithful to the conventional translation of named entities (such as movie titles, which are very common amongst the 50k most popular Wikipedia articles used in [4]).
Finally, during the localization process, we have been able to locate Czech versions of 6,578 out of 12,633 Wikipedia articles needed to infer the veracity of all EnFEVER claims.Omitting the evidence sets that are not fully contained by the Czech Wikipedia and omitting SUPPORTS/REFUTES claims with empty evidence, we arrive to 127,328 claims that can hypothetically be fully (dis-)proven in at least one way using the Czech Wikipedia abstracts corpus, which is 69% of the total 185,445 EnFEVER claims.
We release the resulting dataset publicly in the HuggingFace datasets repository. 16In Table 2 we show the dataset class distribution.It is roughly proportional to that of EnFEVER.Similarly to [4], we have opted for label-balanced dev and test splits, in order to ease evaluation of biased predictors.

Validity
In order to validate our hypothesis that the Czech Wikipedia abstracts support and refute the same claims as their English counterparts, we have sampled 1% (1257) verifiable claim-evidence pairs from the CsFEVER dataset and annotated their validity.
Overall, we have measured a 66% transduction precision with a confusion distribution visualised in Figure 1   We, therefore, claim that the localization method, while yielding mostly valid datapoints, needs a further refinement, and the CsFEVER as-is is noisy and mostly appropriate for experimental benchmarking of model recall in the document-level retrieval task.In Section 4.4 we proceed to use this data for training Czech retrieval models for the task of dictionary generation vital for our CTKFacts annotations.With caution, it may also be used for NLI experiments.
We conclude that while the large scale of the obtained data may find its use, a collection of novel Czech-native dataset is desirable for finer tasks, and we proceed to annotate a CTKFacts dataset for our specific application case.As Figure 1 shows a common problem with NEI mislabeling, the dataset could also be further cleaned by a well-performing NLI model at an appropriate level of confidence.

CsFEVER-NLI
Alternative way to look at the EnFEVER data is to view them as contextquery pairs, where query is a claim, and context is a concatenation of the full texts of its evidence.This was examined for the NLI task in [31], and released as the FEVER-NLI dataset.Where no context was given (NEI datapoints without evidence), the authors uniformly sampled 3-5 sentences from the topranked Wikipedia abstract according to their retrieval model.
This interpretation of the EnFEVER data reduces the size of Wikipedia content that needs to be translated alongside the claims to 15M words at the cost of the ability to use the data for retrieval tasks.Therefore, we also CsFEVER and CTKFacts: Acquiring Czech data for fact verification publish a dataset we call CsFEVER-NLI that was generated independently on the scheme from Section 3.1 by directly translating 228k FEVER-NLI pairs published in [31] using DeepL.We conclude that by only using the relevant parts of English Wikipedia and translating these, we mitigate the problems found in Section 3.2.1 and provide a solid dataset for the NLI task on the fact-checking pipeline.

CTKFacts
In this section, we address collection and analysis of the CTKFacts dataset -our novel dataset for fact verification in Czech.The overall approach to the annotation is based on FEVER [4].Unlike other FEVER-inspired datasets [6,7,20] which deal with corpora of encyclopedic language style, CTKFacts uses a ground truth corpus extracted from an archive of press agency articles.
As the CTK archive is proprietary and kept as a trade secret, the full domain of all possible evidence may not be disclosed.Nevertheless, we provide public access to the derived NLI version of the CTKFacts dataset we call CTKFactsNLI.CTKFactsNLI is described in Section 4.8.

CTK corpus
For the ground truth corpus, we have obtained a proprietary archive of the Czech News Agency,17 also referred to as CTK, which is a public service press agency providing news reports and data in Czech to subscribed news organizations.Due to the character of the service -that is, providing raw reports that are yet to be interpreted by the commercial media -we hypothesize such corpus suffers from significantly less noise in form of sensational headlines, political bias, etc.
Using news corpus as a ground-truth database might be (rightfully) considered controversial.We stress that it is important to select only highly reliable sources for this purpose.Specifically, in the Czech media environment, the CTK is known to keep the high standard of news verification. 18 The full extent of data provided to our research is 3.3M news reports published between 1 January 2000 and 6 March 2019.We reduce this number by neglecting redundancies and articles formed around tables (e.g., sport results or stock prices).Ultimately, we arrive to a corpus of 2M articles with a total of 11M paragraphs.Hereinafter, we refer to it as to the CTK corpus, and it is to be used as the verified text database for our annotation experiments.

Paragraph-level documents
The FEVER shared task proposed a two-level retrieval model: first, a set of documents (i.e., Wikipedia abstracts) is retrieved.These are then fed to the sentence retrieval system which provides the evidence on the sentence level.This two-stage approach, however, does not match properties of the news corpora -in most cases, the news sentences are significantly less self-contained than those of encyclopedic abstract, which disqualifies the sentence-level granularity.
On the other hand, the news articles tend to be too long for many of the state-of-the-art document retrieval methods.FEVER addresses a similar issue by trimming the articles to their short abstracts only.Such a trimming can not be easily applied to our data, as the news reports come often without abstracts or summaries and scatter the information across all their length.
In order to achieve a reasonable document length, as well as to make use of all the information available in our corpus, we opt to work with our full data on the paragraph level of granularity, using a single-stage retrieval.From this point onwards, we refer to the CTK paragraphs also as to the documents.
We store meta-data for each paragraph, identifying the article it comes from, its order 19 and a timestamp of publication.

Source Document Preselection
In FEVER, every claim is derived from a random sentence of a Wikipedia article abstract sampled from the fifty thousand most popular articles [4].
With the news report archive in its place, the approach does not work well, as most paragraphs do not contain any check-worthy information.In our case, we were forced to include an extra manual preselection task (denoted T 0 , see Section 4.5) to deal with this problem.

Dictionary Generation
In EnFEVER Claim Extraction as well as in the annotation of Dan-FEVER [19], the annotator is provided with a source Wikipedia abstract and a dictionary composed of the abstracts of pages hyperlinked from the source.The aim of such dictionary is to 1) introduce more information on entities covered by the source, 2) extend the context in which the new claim is extracted in order to establish more complex relations to other entities.
With the exception of the claim mutation task (see below), annotators are instructed to disregard their own world knowledge.The dictionary is essential to ensure that the annotators limit themselves to the facts (dis-)provable using the corpus while still having access to higher-level, more interesting entity relations.
As the CTK corpus (and news corpora in general) does not follow any rules for internal linking, it becomes a significant challenge to gather reasonable dictionaries.The aim is to select a relatively limited set of documents to avoid overwhelming the annotators.These documents should be highly relevant to the given knowledge query 20 while covering as diverse topics as possible at the same time to allow complex relations between entities.
Our approach to generating dictionaries combines NER-augmented keyword-based document retrieval method and a semantic search followed by clustering to promote diversity.
The keyword-based search uses the TF-IDF DrQA [32] document retrieval method being a designated baseline for the EnFEVER [4].Our approach makes multiple calls to the DrQA, successively representing the query q by all possible pairs of named entities extracted from the q.As an example consider the query q = "Both Obama and Biden visited Germany.":N = {"Obama", "Biden", "Germany"} is the extracted set of top-level named entities.DrQA is then called |N | 2 = 3 times for the keyword queries q 1 = "Obama, Biden", q 2 = "Obama, Germany", q 3 = "Biden, Germany".
Czech Named Entity Recognition is handled by the model of [33].In the end, we select at most n KW (we use n KW = 4) documents having the highest score for the dictionary.This iterative approach aims to select documents describing mutual relations between pairs of NERs.It is also a way to promote diversity between the dictionary documents.Our initial experiments with a naïve method of simply retrieving documents based on the original query q (or simple queries constructed from all NERs in N ) were unsuccessful as journalists often rephrase, and the background knowledge can be found in multiple articles.Hence, the naïve approach often reduces to search for these rephrased but redundant textual segments.
The second part of the dictionary is constructed by means of semantic document retrieval.We use the M-Bert [15] model finetuned on CsFEVER (see Section 6.1), which initially retrieves rather large set of n PRE = 1024 top ranking documents for the query q.In the next step, we cluster the n PRE documents based on their [CLS] embeddings using k-means.Each of the k (k = 2 in our case) clusters then represents a semantically diverse set of documents (paragraphs) P i for i ∈ {1, . . ., k}.Finally, we cyclically iterate through the clusters, always extracting a single document p ∈ P i closest to q by means of the cosine similarity, until the target number of n SEM (we used n SEM = 4) documents is reached.The final dictionary is then a union of n KW and n SEM documents selected by both described methods.
During all steps of dictionary construction, we make sure that all the retrieved documents have an older timestamp than the source.Simply put, to each query, we assign a date of its formulation, and only verify it using the news reports published to that date.The combination of the keyword and semantic search, as well as the meta-parameters involved, are a result of empirical experiments.They are intended to provide a minimum neccessary context on the key actors of the claim and its nearest semantical neighbourhood.
In the following text, we denote a dictionary computed for a query q as d(q).In the annotation tasks, it is often desirable to combine dictionaries of two different queries (claim and its source document) or to include the source paragraph itself.For clarity, we use the term knowledge scope to refer to such entire body of information.

Annotation Workflow
The overall workflow is depicted in 2 and described in the following list: 1. Source Document Preselection (T 0 ) is the preliminary annotation step as described in Section 4.3, managed by the authors of this paper.2. Claim Extraction (T 1a ): • The system samples random paragraph p from the set of paragraphs preselected in the T 0 stage.• The system generates a dictionary d(p), querying for the paragraph p and its publication timestamp (see Section 4.4).
• A is further allowed to augment K by other paragraphs published in the same article as some paragraph already in K, in case the provided knowledge needs reinforcement.• A outputs a simple true initial claim c supported by K while disregarding their own world knowledge.

Claim Mutation (T 1b ):
The claim c is fed back to A, who outputs a set of claim mutations: M = {m 1 , . . .m n }.These involve the mutation types defined in [4]: rephrase, negate, substitute similar, substitute dissimilar, generalize, and specify.We use the term final claim interchangeably with the claim mutation in the following paragraphs.This is the only stage where A can employ own world knowledge, although annotators are advised to preferably introduce knowledge that is likely to be covered in the corpus.To catch up with the additional knowledge introduced by A, the system precomputes dictionaries d(m 1 ), . . .d(m n ). 4. Claim Labeling (T 2 ): • The annotation environment randomly samples a final claim m and presents it to A with a knowledge scope K containing the original source paragraph p, its T 1a dictionary d(p), as well as the additional dictionary d(m) retrieved for m in T 1b .The order of K is randomized (except for p which is always first) not to bias the time-constrained A. • A is further allowed to augment K by other paragraphs published in the same article as some paragraph already in K, in case the provided knowledge needs reinforcement • A is asked to spend ≤ 3 minutes looking for minimum evidence sets E m 1 , . . ., E m n sufficient to infer the veracity label which is expected to be the same for each set.
• If none found, A may also label the m as NEI.Note that FEVER defines two subtasks only: Claim Generation and Claim Labeling.The Claim Generation corresponds to our T 1 , while the Claim Labeling is covered by T 2 .

Annotation Platform
Due to notable differences in experiment design, we have built our own annotation platform, rather than reusing that of [4].The annotations were collected using a custom-built web interface.Our implementation of the interface and backend for the annotation workflow described in Section 4.5 is distributed under the MIT license and may be inspected online. 21We provide further information on our annotation platform in appendix A.

Annotators
Apart from T 0 , the annotation tasks were assigned to groups of bachelor and master students of Journalism from the Faculty of Social Sciences at the Charles University in Prague.We have engaged a total of 163 participants who have signed themselves for courses in AI Journalism and AI Ethics during the academic year 2020/2021.We used the resulting data, trained models and the annotation experiment itself to introduce various NLP mechanisms, as well as to obtain valuable feedback on the task feasibility and pitfalls.
The annotations were made in several waves -instances of the annotation experiment performed with different groups of students.This design allowed us to adjust the tasks, fullfilment quotas and the interface after each wave, iteratively removing the design flaws.

Cross-annotations
In the annotation labeling task, we advised the annotators to spend no more than 2-3 minutes finding as many evidence sets as possible within Wikipedia, so that the dataset can later be considered almost exhaustive [4].With our CTK corpus, the exhaustivity property is unrealistic, as the news corpora commonly contain many copies of single ground truth.For example, claim "Miloš Zeman is the Czech president" can be supported using any "'. . .', said the Czech president Miloš Zeman."clause occurring in corpus.
Therefore, we propose a different scheme: annotator is advised to spend 2-3 minutes finding as many distinct evidence sets as possible within the time needed for good reading comprehension.Furthermore, we have collected an average of 2 cross-annotations for each claim.This allowed us to merge the evidence sets across different T 2 annotations of the same claim, as well as it resulted in a high coverage of our cross-validation experiments in Section 4.6.1.

Dataset analysis and postprocessing
After completing the annotation runs, we have extracted a total of 3,116 multiannotated claims.47% were SUPPORTed by the majority of their annotations, REFUTES and NEI labels were approximately even, the full distribution of labels is listed in Table 3.Of all the annotated claims, 1,776, that is 57%, had at least two independent labels assigned by different annotators.This sample was given by the intrinsic randomness of T 2 claim sampling.In this section, we use it to asses the quality of our data and ambiguity of the task, as well as to propose annotation cleaning methods used to arrive to our final cleaned CTKFacts dataset.

Inter-Annotator Agreement
Due to our cross-annotation design (Section 4.5.3),we had generously sized sample of independently annotated labels in our hands.As the total number of annotators was greater than 2, and as we allowed missing observations, we have used the Krippendorff's alpha measure [34] which is the standard for this case [35].For the comparison with [4] and [20], we also list a 4-way Fleiss' κ-agreement [36] calculated on a sample of 7.5% claims.
We have calculated the resulting Krippendorff's alpha agreement to be 56.42% and Fleiss' κ to be 63%.We interpret this as an adequate result that testifies to the complexity of the task of news-based fact verification within a fixed knowledge scope.It also encourages a round of annotation cleaning experiments that would exploit the number of cross-annotated claims to remove common types of noise.

Manual annotation cleaning
We have dedicated a significant amount of time to manually traverse every conflicting pair of annotations to see if one or both violate the annotation guidelines.The idea was that this should be a common case for such annotations, as the CTK corpus does not commonly contain a conflicting pair of paragraphs except for the case of temporal reasoning explained in Section 4.7.
After separating out 14% (835) erroneously formed annotations, we have been able to resolve every conflict, ultimately achieving a full agreement between the annotations.We discuss the main noise patterns in Section 4.7.

Model-based annotation cleaning
Upon evaluating our NLI models (Section 6.2), we have observed that model misclassifications frequently occur at T 2 annotations that are counterintuitive for human, but easier to predict for a neural model.
Therefore, we have performed a series of experiments in model-assisted human-in-the-loop data cleaning similar to [37] in order to catch and manually purge outliers, involving an expertly trained annotator working without a time constraint: 1.A fold of dataset is produced using the current up-to-date annotation database, sampling a stratified test split from all untraversed claims -the rest of data is then divided into dev and train stratified splits, so that the overall train-dev-test ratio is roughly 8:1:1.2. Mark the test claims as traversed.3. A round of NLI models (Section 6.2) is trained using the current train split to obtain the strongest veracity classifier for the current fold.The individual models are optimized w.r.t. the dev split, while the strongest one is finally selected using test.4. test-misclassifications of this model are then presented to an expert annotator along with the model suggestion and an option to remove an annotation violating the rules and to propose a new one in its place.5. New annotations propagate into the working database and while there are untraversed claims, we proceed to step 1.Despite allowing several inconsistencies with the scheme above during the first two folds (that were largely experimental), this led to a discovery of another 846 annotations conflicting the expert annotator's labeling and a proposal of 463 corrective annotations (step 4.).

Common annotation problems
In this section, we give an overview of common misannotation archetypes as encountered in the cleaning stage (sections 4.6.2 and 4.6.3).These should be considered when designing annotation guidelines for similar tasks in the future.The following list is sorted by decreasing appearance in our data.
1. Exclusion misassumption is by far the most prevalent type of misannotation.The annotator wrongly assumes that an event connected to one entity implies that it cannot be connected to the other entity.E.g., evidence "Prague opened a new cinema."leads to "Prague opened a new museum."claim to be refuted.In reality, there is neither textual entailment between the claims, nor their negations.We attribute this error to confusing the T 2 with a reading comprehension22 task common for the field of humanities.
2. General misannotation: we were unable to find exact explanations for large part of the mislabelled claims.We traced the cause of this noise to both unclearly formulated claims and UI-based user errors.3. Reasoning errors cover failures in assessment of the claim logic, e.g., confusing "less" for "more", etc.Also, this often involved errors in temporal reasoning, where an annotator submits a dated evidence that contradicts the latest news w.r.t. the timestamp.4. Extending minimal evidence: larger than minimal set of evidence paragraphs was selected.This type of error typically does not lead to misannotation, nevertheless, it was common in the sample of dataset we were analyzing.5. Insufficient evidence where the given evidence misses vital details on entities.As an example: the evidence "A new opera house has been opened in Copenhagen." does not automatically support the claim "Denmark has a new opera house."if another piece of evidence connecting Copenhagen and Denmark is not available.This type of error indicates that the annotator of the claim extended the allowed knowledge scope with his/hers own world-knowledge.

CTKFactsNLI Dataset
Finally, we publish the resulting cleaned CTKFacts dataset consisting of 3,097 manually labeled claims, with label distribution as displayed in Table 3.
We opt for stratified splits due to the relatively small size of our data and make sure that no CTK source paragraph was used to generate claims in two different splits, so as to avoid any data leakage.
The full CTK corpus cannot be, unfortunately, released publicly.Nevertheless, we extract all of our 3,911 labeled claim-evidence pairs to form the CTKFactsNLI dataset.Claim-wise, it follows the same splitting as our DR dataset, and the NEI evidence is augmented by the paragraph that was used to derive the claim to enable inference experiments.
We have acquired the authorization from CTK to publish all evidence plaintexts, which we include in CTKFactsNLI and open for public usage.The dataset is released publicly on HuggingFace dataset hub 23 and provides its standard usage API to encourage further experiments.

Spurious Cue Analysis
In the claim generation phase T 1b , annotators are asked to create mutations of the initial claim.These mutations may have a different truth label than the initial claim or even be non-verifiable with the given knowledge database.During trials in [4], the authors found that a majority of annotators had a difficulty with creating non-trivial negation mutations beyond adding "not" to the original.Similar spurious cues may lead to models cheating instead of performing proper semantic analysis.
In [19], the authors investigated the impact of the trivial negations on the quality of the EnFEVER and DanFEVER datasets.Here we present similar analysis based on the cue productivity and coverage measures derived from work of [38].
In our case the cues extracted from the claims have a form of unigrams and bigrams.The definition of the productivity assumes a balanced dataset with respect to labels.The productivity of a cue k is calculated as follows: where C denotes the set of possible labels C = {SUPPORTS, REFUTES, NEI}, A is the set of all claims, A cue=k is the set of claims containing cue k and A class=c is the set of claims annotated with label c.Based on this definition the range of productivity is limited to π k ∈ [1/|C| , 1], for balanced dataset.The coverage of a cue is defined as a ratio We take the same approach as [19] to deal with the dataset imbalance: the resulting metrics are obtained by averaging over ten versions of the data based on random subsampling.We compute the metrics for both CsFEVER and CTKFacts datasets.Similarly to [19], we also provide the harmonic mean of productivity and coverage, which reflects the overall effect of the cue on the dataset.
The results in Table 4 show that the cue bias detected in EnFEVER claims [19] propagates to the translated CsFEVER, where the words "není" ("is not") and "pouze" ("only") showed high productivity of 0.57 and 0.55 and ended in the first 20 cues sorted by the harmonic mean.However, their impact on the quality of the entire dataset is limited as their coverage is not high, which is illustrated by their absence in the top-5 most influential cues.Similar results for the CTKFacts are presented in Table 5.

Baseline models
In this section, we explore the applicable models for both CsFEVER and CTKFacts.We train a round of currently best-performing document retrieval (DR) and natural language inference (NLI) models, to examine the difficulty of the task and to establish a baseline to our datasets, mainly CsFEVER-NLI and CTKFactsNLI.We also give results for EnFEVER providing a point of reference to the well-established dataset.

Document Retrieval
We provide four baseline models for the document retrieval stage: DrQA and Anserini represent classical keyword-search approaches, while multilingual BERT (M-Bert) and ColBert models are based on Transformer neural architectures.In line with FEVER [4], we employ document retrieval part of DrQA [32] model.The model was originally used for answering questions based on Wikipedia corpus, which is relatively close to the task of fact-checking.The DR part itself is based on TF-IDF weighting of BoW vectors while optimized by using hashing.We calculated the TF-IDF index using DrQA implementation for all unigrams and bigrams with 224 buckets.
Inspired by the criticism of choosing weak baselines presented in [39], we decided to validate our TF-IDF baseline against the proposed Anserini toolkit implemented by Pyserini [40].
We computed the index and then finetuned the k 1 and b hyper-parameters using grid search on defined grid k 1 ∈ [0.6, 1.2], b ∈ [0.5, 0.9], both with step 0.1.On a sample of 10,000 training claims, we selected the best performing parameter values: for CsFEVER these were k 1 = 0.9 and b = 0.9, while for EnFEVER and CTKFacts we proceed with k 1 = 0.6 and b = 0.5.
Another model we tested is the M-Bert [15], which is a representative of Transformer architecture models.We used the same setup as in [41] with an added linear layer consolidating the output into embedding of the required dimension 512.In the fine-tuning phase, we used the claims and their evidence as relevant (positive) passages.For multi-hop claims, based on combinations of documents, we split the combined evidence, so the queries are always constructed to relate to a single evidence document, only.Unlike in [41], we used a smaller training batch size of 128 and learning rates 10 −5 for the ICT+BFS tuning and 5 × 10 −6 for the fine-tuning stage.
We used this fine-tuned model to generate 512-dimensional embeddings of the whole document collection.In the retrieval phase, we used the FAISS library [42] and constructed PCA384 Flat index for CTKFacts and Flat index for CsFEVER data. 24 The last tested model was a recent ColBert, which provides the benefits of both cross-attention and two-tower paradigms [43].We have employed the implementation as provided by the authors, 25 changing the backbone model to M-Bert and adjusting for the special tokens.The training batch size was 32, learning rate 3 × 10 −6 , we have used masked punctuation tokens, mixed precision and L2 similarity metric.
The model was trained using triplets (query, positive paragraph, negative paragraph) with the objective to correctly classify paragraphs using a crossentropy loss function.We constructed the training triplets so that the claim created by a human annotator was taken as a query, a paragraph containing evidence as a positive and a random paragraph from a randomly selected document as a negative sample.
As already stated, For the CTKFacts, the number of claims is significantly lower than for CsFEVER.Therefore, we increased the number of CTK training triplets: instead of selecting negative paragraphs from a random document, we selected them from an evidence document with the condition that the paragraph must not be used directly in the evidence.The number of training triplets was still low, so we also generated synthetic triplets as follows.We generated a synthetic query by extracting a random sentence from a random paragraph.A set of the remaining sentences of this paragraph were designated a positive paragraph.The negative paragraph was, once again, selected as a random paragraph of a random document.Then the title was used as a query instead of a random sentence, and a random paragraph from the article was used as a positive.Negative paragraph was selected in the same way as above.
We tried two setups here, 32 and 128 dimensional term representation (denoted ColBert32 and ColBert128) with document trimming to a maximum of 180 tokens on FEVER datasets.The results are shown in Table 6.Methods are compared by means of Mean Reciprocal Rank (MRR) given k ∈ {1, 5, 10, 20} retrieved documents.For CsFEVER, the neural network models achieve significantly best results., with ColBert taking lead.In case of CTKFacts, both Anserini and ColBert are best performers.Interrestingly, M-Bert fails in this task.We found that this is mainly caused by M-Bert preference for shorter documents (including headings).As expected, the results for EnFEVER are comparable to those of CsFEVER: the Anserini performance improved, while ColBert performed slightly worse for the English corpus.Note that EnFEVER corpus is more than ten times larger (5.4M pages) than the CsFEVER one (452k pages).

Natural Language Inference
The aim of the final stage of the fact-checking pipeline, the NLI task, is to classify the veracity of a claim based on the previously retrieved evidence.We have fine-tuned several different Czech-compatible Transformer models on our data in order to provide a strong baseline for this task.
From the multilingual models, we have experimented with SlavicBERT and Sentence M-Bert [44] models in their cased defaults, provided by the DeepPavlov library [45], as well as with the original M-Bert from [15,46].
We have further examined two pretrained XLM-RoBERTa-large models, one fine-tuned on an NLI-related SQuAD2 [47] down-stream task, other on the crosslingual XNLI [48] task.These were provided by Deepset. 26and HuggingFace 27  Finally, we have performed a round of experiments with a pair of recently published Czech monolingual models.RobeCzech [49] was pretrained on a set of currated Czech corpora using the RoBERTa-base architecture.FERNET-C5 [50] was pretrained on a large crawled dataset, using Bert-base architecture.
We have fine-tuned the listed models using 4 fact-checking datasets adapted for the NLI task using their gold sets of evidence.CTKFactsNLI uses our data collected in Chapter 4 as pairs of strings containing the claim and its concatenated evidence (or the text of its source paragraph for NEI claims).FEVER-NLI was extracted from the original EnFEVER dataset in [31], obtaining the single-string context by concatenating the evidence sentences for each claim, or a sample of 3-5 results of the proposed DR model for the NEI claims.CsFEVER-NLI is its full Czech translation (see Section 3.3).CsFEVER (NearestP ) was obtained from the experimental dataset extracted in Chapter 3 using the entire Wiki abstracts for evidence, non-verifiable claim evidence was supplemented with the top result of our DrQA model established in Section 6.1.
For each dataset, we have fine-tuned all listed models using the sentence transformers implementation of the Cross-encoder [44] with two texts on input (concat-evidence, claim) and a single output value.We have experimented with multiple batch sizes for each model (varying from 2 to 10) and trained each using the Adam optimizer with 2 • 10 −5 default learning rate and weight decay of 0.01.The number of linear warmup steps was determined by 10-30% of the train size, varying per experiment.Ultimately, we kept the best performing models in terms of dev accuracy for each model-dataset pair.
We then evaluated all models using the test-splits of the dataset they were fine-tuned with.Results are presented in Table 7 and show dominance of the double-fine-tuned XLM-RoBERTa models.For reference, we also compare our results with previous research on English datasets -in case of FEVER-NLI (and its Czech translation), our models achieved superiority over the NSMNs published in [31] which scored an overall 69.5 macro-F1.Our baselines for CTKFactsNLI and CsFEVER (NearestP ) achieved and F-score of 76.9 and 83.2 percent, respectively.This is comparable with the [4] baseline which scored 80.82% accuracy in a sentence-level NearestP setting on EnFEVER, using the Decomposable Attention model.Interestingly, the CsFEVER (Near-estP ) experiment results are consistently higher than those of CsFEVER-NLI -we consider this estimate overly optimistic and attribute it to a possible partial information leakage discussed in Section 3.1 and the CsFEVER noise examined in Section 3.2.1.

Full Pipeline Results
Similarly to [4], we give baseline results for the full fact verification pipeline.The pipeline is evaluated as follows: 1) given a mutated claim m from the test set, k evidence paragraphs (documents) P = {p 1 , . . ., p k } are selected using document retrieval models as described in Section 6.1.Note that documents in P are ordered by decreasing relevancy.The paragraphs are subsequently fed to an NLI model of choice (see details below), and accuracy (for CsFEVER and EnFEVER) or F1 macro score (for the unbalanced CTKFacts) are evaluated.In case of supported and refuted claims, we analyze two cases: 1) for Score Evidence (SE), P must fully cover at least one gold evidence set, 2) for No Score Evidence (NSE) no such condition applies.No condition applies for NEI claims as well. 28hile our paragraph-oriented pipeline eliminates the need for sentence selection, we have to deal with the maximum input size of the NLI models (512 tokens in all cases), which gets easily exceeded for larger k.Our approach is to iteratively partition P into n consecutive splits S = {s 1 , . . ., s l }, where l ≤ n.Each split s i itself is a concatenation of successive documents s i = {p s , . . ., p e }, where 1 ≤ s ≤ e ≤ n.A new split is created for any new paragraph that would cause input overflow.If any single tokenized evidence document is longer than the maximum input length, it gets represented by a single split and truncated 29 .Moreover, each split is limited to at most k s successive evidence documents (k s = 2 for CsFEVER and EnFEVER, k s = 3 for CTKFacts), i for c ∈ {SUPPORTS, REFUTES, NEI}.This weighted average (we use λ = 1  2 in all cases) assigns higher importance to the higher-ranked documents.
The results are presented in Table 8.We evaluate Anserini and ColBert DR models followed by overall best performing XLM-RoBERTa @ SQuAD2 NLI model (RoBERTa-large, described in Section 6.2) for all datasets.For both FEVER-based datasets, ColBert document retrieval brings significantly the best results.For CTKFacts, Anserini and ColBert perform similarly with Anserini giving slightly better results overall which mimics the results of DR described in Section 6.1.Note that the SE to NSE difference is more pronounced for CTKFacts, which can be explained by high redundancy of CTKFacts paragraphs w.r.t.CsFEVER.Comparing the results of the EnFEVER baseline to [4] our models perform better in spite of discarding the sentence selection stage -the best reported EnFEVER accuracies for 5 retrieved documents were 52.09% fot NSE and 32.57% for SE [4].The improvement can be explained by more recent and sophisticated models used in our study (authors of [4] used Decomposable Attention [51] at the NLI stage).Note that EnFEVER full pipeline significantly outperforms CsFEVER, which we explain by the information leakage and noise of the dataset used to train the CsFEVER (NearestP ) model as discussed in the previous section.

Conclusion
With this article, we examined two major ways to acquire Czech data for automated fact-checking.
Firstly, we localized the EnFEVER dataset, using a document alignment between Czech and English Wikipedia abstracts extracted from the interlingual links.We obtain and publish the CsFEVER dataset of 127k machine-translated claims with evidence enclosed within the Czech Wikipedia dump.We then validate our alignment scheme and measure a 66% precision using hand annotations over a 1% sample of obtained data.Therefore, we recommend the data for models less sensitive to noise and we proceed to utilize CsFEVER for training non-critical retrieval models for our annotation experiments and for recall estimation of our baseline models.Furthermore, we publish a CsFEVER-NLI dataset of 228k context-query pairs directly translated from English to Czech that bypass its issue with noise for the subroutine task of Natural Language Inference.
Secondly, we executed a series of human annotation runs with 163 students of journalism to acquire a novel dataset in Czech.As opposed to similar annotations that extracted claims and evidence from Wikipedia [4,6,20], we annotated our dataset on top of a CTK corpus extracted from a news agency archive to explore this different relevant language form.We collected a raw dataset of 3,116 labeled claims, 57% of which have at least two independent cross-annotations.From these, we calculate Krippendorff's alpha to be 56.42% and 4-ways Fleiss' κ to be 63%.We proceed with manual and human-andmodel-in-the-loop annotation cleaning to remove conflicting and malformed annotations, arriving at the thoroughly cleaned CTKFacts dataset of 3,097 claims and their veracity annotations complemented with evidence from the CTK corpus.We release its version for NLI called CTKFactsNLI to maintain corpus trade secrecy.
Finally, we use our datasets to train baseline models for the full factchecking pipeline composed of Document Retrieval and Natural Language Inference tasks.

Future work
• The fact-checking pipeline is to be augmented by the check-worthiness estimation [52], that is, a model that classifies which sentences of a given text in Czech are appropriate for the fact verification.We are currently working on models that detect claims within the Czech Twitter, and a strong predictor for this task would also strengthen our annotation scheme from Section 4.5 that currently relies on hand-picked check-worthy documents.• While the SUPPORTS, REFUTES and NEI classes offer a finer classification w.r.t.evidence than binary true/false, it is a good convention of factchecking services to use additional labels such as MISINTERPRETED, that could be integrated into the common automated fact verification scheme if well formalised.
• The claim extraction schemes like that from [4] or Section 4.5 do not necessarily produce organic claims capturing the real-world complexity of fact-checking.For example, just the EnFEVER train set contains hundreds of claims of form "X is a person.".This problem does not have a trivial solution, but we suggest integrating real-world claims sources, such as Twitter, into the annotation scheme.
• While the FEVER localization scheme from Section 3.1 yielded a rather noisy dataset, its size and document precision encourage deployment of a model-based cleaning scheme like that from [53] to further refine its results.I. e., a well performing NLI model could do well in pruning the invalid datapoints of CsFEVER without further annotations.
-28% of our CsFEVER sample pairs were invalid due to NOT ENOUGH INFO in the proposed Czech Wikipedia abstracts, 5% sample claims were invalidated by an inadequate translation.

Fig. 1
Fig. 1 Confusion matrix of the CsFEVER localization scheme

Table 2
Label distribution in CsFEVER dataset as oposed to the EnFEVER.The test split of EnFEVER is not public.

Table 3
Label distribution in CTKFacts splits before and after cleaning.

Table 4
Productivity, coverage and their harmonic mean calculated on CsFEVER dataset claims sorted by the decreasing harmonic mean.

Table 5
Productivity, coverage and their harmonic mean calculated on ČTK dataset claims sorted by the harmonic mean.

Table 7
[31]acro score (%) comparison of our Bert-like models fine-tuned for the NLI task on CTKFactsNLI, CsFEVER, and CsFEVER-NLI datasets.Gold evidence was used as the NLI context for each claim.FEVER-NLI from[31]listed for comparison with other research.B stands for Bert architecture, R for RoBERTa.

Table 8
Full pipeline results.Accuracy (%) shown for CsFEVER, and EnFEVER, F1 macro score (%) for CTKFacts.Unlike NSE (No Score Evidence), SE (Score Evidence) demands correct evidence to be retrieved.sothe overall average input length is more akin to data used to train the NLI models.In the prediction phase, all split documents p s , . . ., p e are concatenated, and, together with the claim m, fed to the NLI model getting predictions y s , . . ., y e , where each y i = (y SUPPORTS