Crowdsourcing Truthfulness: The Impact of Judgment Scale and Assessor Bias
- 3.4k Downloads
News content can sometimes be misleading and influence users’ decision making processes (e.g., voting decisions). Quantitatively assessing the truthfulness of content becomes key, but it is often challenging and thus done by experts. In this work we look at how experts and non-expert assess truthfulness of content by focusing on the effect of the adopted judgment scale and of assessors’ own bias on the judgments they perform. Our results indicate a clear effect of the assessors’ political background on their judgments where they tend to trust content which is aligned to their own belief, even if experts have marked it as false. Crowd assessors also seem to have a preference towards coarse-grained scales, as they tend to use a few extreme values rather than the full breadth of fine-grained scales.
The credibility of information available online may vary and the presence of untrustworthy information has big implications on our safety online [5, 12, 15]. The recent increase of misinformation online is to be blamed on technologies that have enabled the next level of strategic politic propaganda. Social media platforms and their data allow for extreme personalization of content which makes it possible to individually customise information. Given that the majority of people access news from social media platforms  such strategies can be used towards the goal of influencing decision making processes [1, 14].
In this constantly evolving scenario, it is key to understand how people perceive the truthfulness of information presented to them. To this end, in this paper we collect data from US-based crowd workers and compare it with expert annotation data generated by fact-checkers such as PolitiFact. Our dataset contains multiple judgments of truthfulness of information collected from several non-expert assessors to measure agreement levels and to identify controversial content. We also collect judgments over two different judgment scales and collect information about assessors’ background that allows us to analyse assessment bias. The dataset we created is publicly available at https://github.com/KevinRoitero/crowdsourcingTruthfulness.
The results of our analysis indicate that: (1) crowd judgments can be aggregated to approximate expert judgments, (2) there is a political bias in crowd-generated truthfulness labels where crowd assessors tend to believe more to statements coming from speakers off the same political party they have voted for in the last election; and (3) there seems to be a preference for coarse-grained scales where crowd assessors tend to use the extreme values in the scale more often than other values.
2 Related Work
Crowdsourcing has been previously used as a methodology in the context of information credibility research. For example, Zubiaga and Heng  looked at how tweet credibility can be assessed by means of Amazon MTurk workers in the context of disaster management. Their results show that it is difficult for crowd workers to properly assess the truthfulness of tweets in this context, but that the reliability of the source is a good indicator for trusted information. Kriplean et al.  analyse how volunteer crowdsourcing can be used for fact-checking by simulating the democratic process. The Fact-checking Lab at CLEF [3, 9] looks at this problem by defining the task of ranking sentences according to their need to be fact-checked. Maddalena et al.  focus on the ability of the crowd to assess news quality along eight different quality dimensions. Roitero et al.  use crowdsourcing to study user perception of fake news statements. As compared to previous studies looking at crowdsourcing for information credibility tasks, we look at bias in the data due to the assessor and the rating scale used to collected labels in the context of the truthfulness of statements by US politicians.
3.1 Dataset Description
In our study1 we use the PolitiFact dataset constructed by Wang . This dataset contains 12800 statements by politicians with truth labels produced by expert fact-checkers on a 6-level scale: i.e., True, Mostly True, Half True, Barely True, False, and Lie.2 For this work, we selected a subset of 120 statements randomly sampled from the PolitiFact dataset to make sure that a balanced number of statements per class and per political party was included in the sample.
3.2 Crowdsourcing Setup
We crowdsourced 120 statements each judged by 10 distinct crowd workers across 400 HITs on Amazon MTurk asking US-based workers to label the truthfulness of statements from the dataset. Each HIT, rewarded $1.20 (i.e., $0.15 for each statement), consisted of 8 statements for which we asked an assessment either using the original 6-level scale (S6) or a 100-level scale (from 0 to 100) using a slider set by default at 50 (S100). The 8 statements contained 2 gold questions used to quality check the workers’ responses by means of providing judgments consistent with the expert ground truth. Other than gold questions, each HIT contained 3 statements by Republican party speakers and 3 by Democratic party speakers. More than the judgments, crowd workers where also asked to provide a justification for each of their judgments, and a URL pointing to the source of information supporting their judgment. At the beginning of the HIT each worker was asked to complete a demographics questionnaire; it also included questions about their political orientation, used to classify crowd assessors as aligned to the US Democratic party (Dem) or the US Republican party (Rep).
Most frequently used support URLs over both scales, with and without gold questions.
S6 \(+\) Gold
S6 − Gold
S100 \(+\) Gold
S100 − Gold
4.1 Judgment Distributions
4.2 Crowd vs. Experts
4.3 Crowd Assessor Bias
Figure 4 shows how crowd assessors labelled statements as compared to ground truth expert labels based on their political background. We can see that crowd assessors who voted for the Rep party tend to assign higher truthfulness scores, especially for the Lie and False ground truth labels, showing how, on average, they believe to content more than crowd assessors who voted for the Dem party.
When comparing how crowd workers assess statements differently based on who the speaker is, we can observe that True statements obtain higher scores from crowd assessors who voted for the speaker’s party. That is, Dem workers assigned an average score of 84.54 on S100 and 5.48 on S6 to True statements by Dem speakers and only 81.83 on S100 and 5.00 on S6 to True statements by Rep speakers. Rep workers assigned an average score of 81.89 on S100 and 5.35 on S6 to True statements by Rep speakers and only 73.24 on S100 and 4.73 on S6 to True statements by Dem speakers. While this is an expected behaviour, we also notice that Dem crowd assessors appear to be more skeptical than Rep crowd assessors by showing a lower average judgment score for untrue statements (e.g., Fig. 4, top row).
In this paper we presented a dataset of crowdsourced truthfulness judgments for political statements and compared the collected judgments across different crowd assessors, judgments scales, and with expert judgments.
Our results show that (1) crowd judgments, if properly aggregated, are comparable to expert ones (2) crowd assessors political background has an impact on how they label political statements: they show a tendency to be more lenient towards statements by politicians of the same political orientation as their own; and (3) crowd assessors seem to have a preference towards coarse-grained judgment scales for truthfulness judgements.
This work is partially supported by an Australian Research Council Discovery Project (DP190102141) and a Facebook Research award.
- 1.Bittman, L., Godson, R.: The KGB and Soviet Disinformation: An Insider’s View. Pergamon-Brassey’s (1985)Google Scholar
- 2.Cuzzocrea, A., Bonchi, F., Gunopulos, D. (eds.): Proceedings of the CIKM 2018 Workshops co-located with 27th ACM International Conference on Information and Knowledge Management (CIKM 2018), Torino, Italy, 22 October 2018, CEUR Workshop Proceedings, vol. 2482. CEUR-WS.org (2019). http://ceur-ws.org/Vol-2482
- 4.Han, L., Roitero, K., Maddalena, E., Mizzaro, S., Demartini, G.: On transforming relevance scales. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM) (2019)Google Scholar
- 6.Kriplean, T., Bonnar, C., Borning, A., Kinney, B., Gill, B.: Integrating on-demand fact-checking with public dialogue. In: Proceedings of CSCW, pp. 1188–1199 (2014)Google Scholar
- 7.Maddalena, E., Ceolin, D., Mizzaro, S.: Multidimensional news quality: a comparison of crowdsourcing and nichesourcing. In: Cuzzocrea et al. . http://ceur-ws.org/Vol-2482/paper17.pdf
- 8.Maddalena, E., Mizzaro, S., Scholer, F., Turpin, A.: On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Trans. Inf. Syst. 35(3), 19:1–19:32 (2017). https://doi.org/10.1145/3002172
- 10.Roitero, K., Demartini, G., Mizzaro, S., Spina, D.: How many truth levels? Six? One hundred? Even more? Validating truthfulness of statements via crowdsourcing. In: Cuzzocrea et al. . http://ceur-ws.org/Vol-2482/paper38.pdf
- 11.Roitero, K., Maddalena, E., Demartini, G., Mizzaro, S.: On fine-grained relevance scales. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, pp. 675–684. ACM, New York (2018). https://doi.org/10.1145/3209978.3210052
- 12.Self, C.C.: Credibility. In: An Integrated Approach to Communication Theory and Research, pp. 449–470. Routledge (2014)Google Scholar
- 13.Shearer, E., Gottfried, J.: News use across social media platforms 2017. Pew Research Center 7 (2017)Google Scholar
- 15.Viviani, M., Pasi, G.: Credibility in social media: opinions, news, and health information—a survey. Wiley Interdis. Rev.: Data Min. Knowl. Discov. 7(5), e1209 (2017)Google Scholar
- 16.Wang, W.Y.: “Liar, Liar Pants on Fire”: a new benchmark dataset for fake news detection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 422–426 (2017)Google Scholar