To perform repeated observations of the crowd annotating misinformation (i.e., doing a longitudinal study) with different sets of workers, we re-launched the HITs of the main experiment three subsequent times, each of them one month apart. In this section we detail how the data was collected (Section 6.1) and the findings derived from our analysis to answer RQ6 (Section 6.2), RQ7 (Section 6.3), and RQ8 (Section 6.4).
The longitudinal study is based on the same dataset (see Section 4.1) and experimental setting (see Section 4.2) of the main experiment. The crowdsourcing judgments were collected as follows. The data for main experiment (from now denoted with Batch1) has been collected on May 2020. On June 2020 we re-launched the HITs from Batch1 with a novel set of workers (i.e., we prevented the workers of Batch1 to perform the experiment again); we denote such set of data with Batch2. On July 2020, we collected an additional batch: we re-launched the HITs from Batch1 with novice workers (i.e., we prevented the workers of Batch1 and Batch2 to perform the experiment again); we denote such set of data with Batch3. Finally, on August 2020, we re-launched the HITs from Batch1 for the last time, preventing workers of previous batches to perform the experiment, collecting the data for Batch4. Then, we considered an additional set of experiments: for a given batch, we contacted the workers from previous batches sending them a $0.01 bonus and asking them to perform the task again. We obtained the datasets detailed in Table 2 where BatchXfromY denotes the subset of workers that performed BatchX and had previously participated in BatchY. Note that an experienced (returning) worker who does the task for the second time gets generally a new HIT assigned, i.e., a HIT different from the performed originally; we have no control on this matter, since HITs are assigned to workers by the MTurk platform. Finally, we also considered the union of the data from Batch1, Batch2, Batch3, and Batch4; we denote this dataset with Batchall.
RQ6: Repeating the experiment with novice workers
Worker background, behavior, bias, and abandonment
We first studied the variation in the composition of the worker population across different batches. To this aim, we considered a general linear mixture model (GLMM)  together with the analysis of variance (ANOVA)  to analyze how worker behavior changes across batches, and measured the impact of such changes. In more detail, we considered the ANOVA effect size ω2, an unbiased index used to provide insights of the population-wide relationship between a set of factors and the studied outcomes [14,15,16, 50, 70]. With such setting, we fitted a linear model with which we measured the effect of the age, school, and all other possible answers to the questions in the questionnaire (B) w.r.t. individual judgment quality, measured as the absolute distance between the worker judgments and the expert one with the mean absolute error (MAE). By inspecting the ω2 index, we found that while all the effects are either small or non-present , the largest effects are provided by workers’ answers to the taxes and southern border questions. We also found that the effect of the batch is small but not negligible, and is on the same order of magnitude of the effect of other factors. We also computed the interaction plots (see for example ) considering the variation of the factors from the previous analysis on the different batches. Results suggest a small or not significant  interaction between the batch and all the other factors. This analysis suggests that, while the difference among different batches is present, the population of workers which performed the task is homogeneous, and thus the different dataset (i.e., batches) are comparable.
Table 3 shows the abandonment data for each batch of the longitudinal study, indicating the amount of workers which completed, abandoned, or failed the task (due to failing the quality checks). Overall, the abandonment ratio is quite well balanced across batches, with the only exception of Batch3, that shows a small increase in the amount of workers which failed the task; nevertheless, such small variation is not significant and might be caused by a slightly lower quality of workers which started Batch3. On average, Table 3 shows that 31% of the workers completed the task, 50% abandoned it, and 19% failed the quality checks; these values are aligned with previous studies (see ).
Agreement across batches
We now turn to study the quality of both individual and aggregated judgments across the different batches. Measuring the correlation between individual judgments we found rather low correlation values: the correlation between Batch1 and Batch2 is of ρ = 0.33 and τ = 0.25, the correlation between Batch1 and Batch3 is of ρ = 0.20 and τ = 0.14, between Batch1 and Batch4 is of ρ = 0.10 and τ = 0.074; the correlation between Batch2 and Batch3 is of ρ = 0.21 and τ = 0.15, between Batch2 and Batch4 is of ρ = 0.10 and τ = 0.085; finally, the correlation values between Batch3 and Batch4 is of ρ = 0.08 and τ = 0.06.
Overall, the most recent batch (Batch4) is the batch which achieves the lowest correlation values w.r.t. the other batches, followed by Batch3. The highest correlation is achieved between Batch1 and Batch2. This preliminary result suggest that it might be the case that the time-span in which we collected the judgments of the different batches has an impact on the judgments similarity across batches, and batches which have been launched in time-spans close to each other tend to be more similar than other batches.
We now turn to analyze the aggregated judgments, to study if such relationship is still valid when individual judgments are aggregated. Figure 7 shows the agreement between the aggregated judgments of Batch1, Batch2, Batch3, and Batch4. The plot shows in the diagonal the distribution of the aggregated judgments, in the lower triangle the scatterplot between the aggregated judgments of the different batches, and in the upper triangle the corresponding ρ and τ correlation values. The plots show that the correlation values of the aggregated judgments are greater than the ones measured for individual judgments. This is consistent for all the batches. In more detail, we can see that the agreement between Batch1 and Batch2 (ρ = 0.87, τ = 0.68) is greater than the agreement between any other pair of batches; we also see that the correlation values between Batch1 and Batch3 is similar to the agreement between Batch2 and Batch3. Furthermore, it is again the case the Batch4 achieves lower correlation values with all the other batches.
Overall, these results show that (i) individual judgments are different across batches, but they become more consistent across batches when they are aggregated; (ii) the correlation seems to show a trend of degradation, as early batches are more consistent to each other than more recent batches; and (iii) it also appears that batches which are closer in time are also more similar.
Crowd accuracy: external agreement
We now analyze the external agreement, i.e., the agreement between the crowd collected labels and the expert ground truth. Figure 8 shows the agreement between the PolitiFact experts (x-axis) and the crowd judgments (y-axis) for Batch1, Batch2, Batch3, Batch4, and Batchall; the judgments are aggregated using the mean.
If we focus on the plots we can see that, overall, the individual judgments are in agreement with the expert labels, as shown by the median values of the boxplots, which are increasing as the ground truth truthfulness level increases. Nevertheless, we see that Batch1 and Batch2 show clearly higher agreement level with the expert labels than Batch3 and Batch4. Furthermore, as already noted in Fig. 1, it is again the case that for all the aggregation functions the pants-on-fire and false categories are perceived in a very similar way by the workers; this again suggests that workers have clear difficulties in distinguishing between the two categories. If we look at the plots we see that within each plot the median values of the boxplots are increasing when going from pants-on-fire to true (i.e., going from left to right of the x-axis of each chart), with the exception of Batch3 and in a more evident way Batch4. This indicates that, overall, the workers are in agreement with the PolitiFact ground truth and that this is true when repeating the experiment at different time-spans. Nevertheless, there is an unexpected behavior: the data for the batches is collected across different time-spans; thus, it seems intuitive that the more time passes, the more the workers should be able to recognize the true category of each statements (for example by seeing it online or reported on the news). Figure 8 however tells a different story: it appears that the more the time passes, the less agreement we found between the crowd collected labels and the experts ground truth. This behavior can be caused by many factors, which are discussed in the next sections. Finally, by looking the last plot of Fig. 8, we see that Batchall show a behavior which is similar to Batch1 and Batch2, indicating that, apart from the pants-on-fire and false categories, the median values of the boxplots are increasing going from left to right of the x-axis of each chart, thus indicating that also in this case the workers are in agreement with the PolitiFact ground truth.
From previous analysis we observed differences in how the statements are evaluated across different batches; to investigate if the same statements are ordered in a consistent way over the different batches, we computed the ρ, τ, and rank-biased overlap (RBO)  correlation coefficient between the scores aggregated using using the mean as aggregation function, among batches, for the PolitiFact categories. We set the RBO parameter such as the top-5 results get about 85% of weight of the evaluation . Table 4 shows such correlation values. The upper part of Table 4 shows the ρ and τ correlation scores, while the bottom part of the table shows the bottom- and top-heavy RBO correlation scores. Given that statements are sorted by their aggregated score in a decreasing order, the top-heavy version of RBO emphasize the agreement on the statements which are mis-judged for the pants-on-fire and false categories; on the contrary, the bottom-heavy version of RBO emphasize the agreement on the statements which are mis-judged for the true category.
As we can observe by inspecting Tables 4 and 5, there is a rather low agreement between how the same statements are judged across different batches, both when considering the absolute values (i.e., when considering ρ), and their relative ranking (i.e., when considering both τ and RBO). If we focus on the RBO metric, we see that in general the statements which are mis-judged are different across batches, with the exceptions of the ones in the false category for Batch1 and Batch2 (RBO top-heavy = 0.85), and the ones in the true category, again for the same two batches (RBO bottom-heavy = 0.92). This behavior holds also for statements which are correctly judged by workers: in fact we observe a RBO bottom-heavy correlation value of 0.81 for false and a RBO top-heavy correlation value of 0.5 for true. This is another indication of the similarities between Batch1 and Batch2.
Crowd accuracy: internal agreement
We now turn to analyze the quality of the work of the crowd by computing the internal agreement (i.e., the agreement among workers) for the different batches. Table 6 shows the agreement between the the agreement measured with α  and Φ  for the different batches. The lower triangular part of the table shows the correlation measured using ρ, the upper triangular part shows the correlation obtained with τ. To compute the correlation values we considered the α and Φ values on all PolitiFact categories; for the sake of computing the correlation values on Φ we considered only the mean value and not the upper 97% and lower 3% confidence intervals. As we can see from the Table 6, the highest correlation values are obtained between Batch1 and Batch3 when considering α, and between Batch1 and Batch2 when considering Φ. Furthermore, we see that Φ leads to obtain in general lower correlation values, especially for Batch4, which shows a correlation value of almost zero with the others batches. This is an indication that Batch1 and Batch2 are the two most similar batches (at least according to Φ), and that the other two batches (i.e., Batch3) and especially Batch4, are composed of judgments made by workers with different internal agreement levels.
Worker behavior: time and queries
Analyzing the amount of time spent by the workers for each position of the statement in the task, we found a confirmation of what already found in Section 5.5; the amount of time spent on average by the workers on the first statements is considerably higher than the time spent on the last statements, for all the batches. This is a confirmation of a learning effect: the workers learn how to assess truthfulness in a faster way as they spend time performing the task. We also found that as the number of batch increases, the average time spent on all documents decreases substantially: for the four batches the average time spent on each document is respectively of 222, 168, 182, and 140 s. Moreover, we performed a statistical test between each pair of batches and we found that each comparison is significant, with the only exception of Batch2 when compared against Batch3; such decreasing time might indeed be a cause for the degradation in quality observed while the number of batch increases: if workers spend on average less time on each document, it is plausible to assume they spend less time in thinking before assessing the truthfulness judgment for each document, or they spend less time on searching for an appropriate and relevant source of evidence before assessing the truthfulness of the statement.
In order to investigate deeper the cause for such quality decrease in recent batches, we inspect now the queries done by the workers for the different batches. By inspecting the number of queries issued we found that the trend to use a decreasing number of queries as the statement position increases is still present, although less evident (but not in a significant way) for Batch2 and Batch3. Thus, we can still say that the attitude of workers to issue fewer queries the more time they spend on the task holds, probably due to fatigue, boredom, or learning effects.
Furthermore, it is again the case that on average, for all the statement positions, each worker issues more than one query: workers often reformulate their initial query. This provides further evidence that they put effort in performing the task and suggests an overall high quality of the collected judgments.
We also found that only a small fraction of queries (i.e., less than 2% for all batches) correspond to the statement itself. This suggests that the vast majority of workers put significant effort into the task of writing queries, which we might assume is an indication of their willingness to perform a high quality work.
Sources of information: URL analysis
Figure 9 shows the rank distributions of the URLs selected as evidence by the workers when performing each judgment. As for Fig. 5, URLs selected less than 1% of the times are filtered out from the results. As we can see from the plots, the trend is similar for Batch1 and Batch2, while Batch3 and Batch4 display a different behavior. For Batch1 and Batch2 about 40% of workers select the first result retrieved by the search engine, and select the results down the rank less frequently: about 30% of workers from Batch2 and less than 20% of workers from Batch3 select the first result retrieved by the search engine. We also note that the behavior of workers from Batch3 and Batch4 is more towards a model where the user clicks randomly on the retrieved list of results; moreover, the spike which occurs in correspondence of the ranks 8, 9, and 10 for Batch4 can be caused by the fact that workers from such batch scroll directly down the user interface with the aim of finishing the task as fast as possible, without putting any effort in providing meaningful sources of evidence.
To provide further insights on the observed change in the worker behavior associated with the usage of the custom search engine, we now investigate the sources of information provided by the workers as justification for their judgments. Investigating the top 10 websites from which the workers choose the URL to justify their judgments we found that, similarly to Fig. 5, it is again the case that there are many fact check websites among the top 10 URLs: snopes is always the top ranked website, and factcheck is always present within the ranking. The only exception is Batch4, in which each fact-checking website appears in lower rank positions. Furthermore, we found that medical websites such as cdc are present only in two batches out of four (i.e., Batch1 and Batch2) and that the Raleigh area news website wral is present in the top positions in all batches apart from Batch3: this is probably caused by the location of workers which is different among batches and they use different sources of information. Overall, such analysis confirms that workers tend to use various kind of sources as URLs from which they take information, confirming that it appears that they put effort in finding evidence to provide reliable truthfulness judgments.
As further analysis we investigated the amount of change in the URLs as retrieved by our custom search engine, in particular focusing on the inter- and intra-batch similarity. To do so, we performed the following. We selected the subset of judgments for which the statement is used as a query; we can not consider the rest of the judgments because the difference in the URLs retrieved is caused by the different query issued. To be sure that we selected a representative and unbiased subset of workers, we measured the MAE of the two population of workers (i.e., the ones which used the statements as query and the ones who do not); in both cases the MAE is almost the same: 1.41 for the former case and 1.46 for the latter. Then, for each statement, we considered all possible pair of workers which used the statement as a query. For each pair we measured, considering the top 10 URLs retrieved, the overlap among the lists of results; to do so, we considered three different metrics: the rank-based fraction of documents which are the same on the two lists, the number of elements in common between the two lists, and RBO. We obtained a number in the [0,1] range, indicating the percentage of overlapping URLs between the two workers. Note that since the query issued is the same for both workers, the change in the ranked list returned is only caused by some internal policy of the search engine (e.g., to consider the IP of the worker which issued the query, or load balancing policies). When measuring the similarities between the lists, we considered both the complete URL, or the domain only; we focus on the latter option: in this way if an article moved for example from the landing page of a website to another section of the same website we are able to capture such behavior. The findings are consistent also when considering the full URL. Then, in order to normalize for the fact that the same queries can be issued by a diffident number of workers, we computed the average of the similarity scores for each statement among all the workers. Note that this normalization process is optional and findings do not change. After that, we computed the average similarity score for the three metrics; we found that the similarity of lists of the same batch is greater than the similarity of the lists from different batches; in the former case we have similarity scores of respectively 0.45, 0.64, and 0.72, while in the latter 0.14, 0.42, and 0.49.
We now turn to the effect of using different kind of justifications on the worker accuracy, as done in the main analysis. We analyze the textual justifications provided, their relations with the web pages at the selected URLs, and their links with worker quality.
Figure 10 shows the relations between different kinds of justifications and the worker accuracy, as done for Fig. 6. The plots show the prediction error for each batch, calculated at each point of difference between expert and crowd judgments. The plots show if the text inserted by the worker was copied or not from the selected web page. As we can see from the plots, while Batch1 and Batch2 are very similar, Batch3 and Batch4 present important differences. As we can see from the plots, statements on which workers make less errors (i.e., where x −axis = 0) tend to use text copied from the web page selected. On the contrary, statements on which workers make more errors (i.e., values close to +/− 5) tend to use text not copied from the selected web page. We can see that overall workers of Batch3 and Batch4 tend to make more errors than workers from Batch1 and Batch2. As it was for Fig. 6 the differences between the two group of workers are small, but it might be an indication that workers of higher quality tend to read the text from selected web page, and report it in the justification box. By looking at the plots we can see that the distribution of the prediction error is not symmetrical, as the frequency of the errors is higher on the positive side of the x-axis ([0,5]) for Batch1, Batch2, and Batch3; Batch4 shows a different behavior. These errors correspond to workers overestimating the truthfulness value of the statements. We can see that the right part of the plot is way higher for Batch3 with respect to Batch1 and Batch2, confirming that workers of Batch3 are of a lower quality.
RQ7: Analysis or returning workers
In this section, we study the effect of returning workers on the dataset, and in particular we investigate if workers which performed the task more than one time are of higher quality than the workers which performed the task only once.
To investigate the quality of returning workers, we performed the following. We considered each possible pair of datasets where the former contains returning workers and the latter contains workers which performed the task only once. For each pair, we considered only the subset of HITs performed by returning workers. For such set of HITs, we compared the MAE e CEM scores of the two sets of workers.
Figure 11 shows on the x-axis the four batches, on the y-axis the batch containing returning workers (“2f1” denotes Batch2from1, and so on), each value representing the difference in MAE (first plot) and CEM (second plot); the cell is colored green if the returning workers have a higher quality than the workers which performed the task once, red otherwise. As we can see from the plot, the behavior is consistent across the two metrics considered. Apart from few cases involving Batch4 (and with a small difference), it is always the case that returning workers have similar or higher quality than the other workers; this is more evident when the reference batch is Batch3 or Batch4 and the returning workers are either from Batch1 or Batch2, indicating the high quality of the data collected for the first two batches. This is somehow an expected result and reflects the fact that people gain experience by doing the same task over time; in other words, they learn from experience. At the same time, we believe that such a behavior is not to be taken for granted, especially in a crowdsourcing setting. Another possible thing that could have happened is that returning workers focused on passing the quality checks in order to get the reward without caring about performing the task well; our findings show that this is not the case and that our quality checks are well designed.
We also investigated the average time spent on each statement position for all the batches. We found that the average time spent for Batch2from1 is 190 s (was 169 s for Batch2), 199 s for Batch3from1or2 (was 182 s for Batch3), and 213 s for Batch4from1or2or3 (was 140 s for Batch4). Overall, the returning workers spent more time on each document with respect to the novice workers of the corresponding batch. We also performed a statistical test between of each pair of batches of new and returning workers and we found statistical significance (p < 0.05) in 12 tests out of 24.
RQ8: Qualitative analysis of misjudged statements
To investigate if the statements which are mis-judged by the workers are the same across all batches, we performed the following analyses. We sorted, for each PolitiFact category, the statements according to their MAE (i.e., the absolute difference between the expert and the worker label), and we investigated if such ordering is consistent across batches; in other words, we investigated if the most mis-judged statement is the same across different batches. Figure 12 shows, for each PolitiFact category, the relative ordering of its statements sorted according to decreasing MAE (the document with rank 1 is the one with highest MAE). From the plots we can manually identify some statements which are consistently mis-judged for all the PolitiFact categories. In more detail, those statements are the following (sorted according to MAE): for pants-on-fire: S2, S8, S7, S5, S1; for false: S18, S14, S11, S12, S17; for mostly-false: S21, S22, S25; for half-true: S31, S37, S33; for mostly-true: S41, S44, S42, S46; for true: S60, S53, S59, S58.
We manually inspected the selected statements to investigate the cause of failure. We manually checked the justifications for the 24 selected statements.
For all statements analyzed, most of the errors in Batch3 and Batch4 are given by workers who answered randomly, generating noise. Answers were categorized as noise when the following two criteria were met: (1) the chosen URL is unrelated to the statement (e.g. a Wikipedia page defining the word “Truthfulness” or a website to create flashcards online); (2) the justification text provides no explanation for the truthfulness value chosen (neither personal nor copied from a URL which is different from the selected one). We found that noisy answers become more frequent with every new batch and account for almost all the errors in Batch4. In fact, the number of judgments with a noisy answer for the four batches are respectively 27, 42, 102, and 166; conversely, the number of non-noisy answers for the four batches are respectively 159, 166, 97, and 54. The non-noise errors in Batch1, Batch2 and Batch3 seem to depend on the statement. By manually inspecting the justifications provided by the workers we identified the following main reasons of failure in identifying the correct label.
In four cases (S53, S41, S25, S14), the statements were objectively difficult to evaluate. This was because they either required extreme attention to the detail in the medical terms used (S14), they were an highly debated points (S25), or required knowledge of legislation (S53).
In four cases (S42, S46, S59, S60), the workers were not able to find relevant information, so they decided to guess. The difficulty in finding information was justified: the statements were either too vague to find useful information (S59), others had few official data on the matter (S46) or the issue had already been solved and other news on the same topic had taken its place, making the web search more difficult (S60, S59, S42) (e.g. truck drivers had trouble getting food in fast food restaurants, but the issue was solved and news outlets started covering the new problem “lack of truck drivers to restock supermarkets and fast food chains”).
In four cases (S33, S37, S59, S60), the workers retrieved information which covered only part of the statement. Sometimes this happened by accident (S60, information on Mardi Gras 2021 instead of Mardi Gras 2020) or because the workers recovered information from generic sites, which allowed them to prove only part of the claim (S33, S37).
In four cases (S2, S8, S7, S1), pants-on-fire statements were labeled as true (probably) because they had been actually stated by the person. In this cases the workers used a fact-checking site as the selected URL, sometimes even explicitly writing that the statement was false in the justification, but selected true as label.
In thirteen cases (S7, S8, S2, S18, S22, S21, S33, S37, S31, S42, S44, S58, S60), the statements were deemed as more true (or more false) than they actually were by focusing on part of the statement or reasoning on how plausible they sounded. In most of the cases the workers found a fact-checking website which reported the ground truth label, but they decided to modify their answer based on their personal opinion. True statements from politics were doubted (S60, about nobody suggesting to cancel Mardi Gras) and false statements were excused as exaggerations used to frame the gravity of the moment (S18, about church services not resuming until everyone is vaccinated).
In five cases (S1, S5, S17, S12, S11), the statements were difficult to prove/disprove (lack of trusted articles or test data) and they reported concerning information (mainly on how the coronavirus can be transmitted and how long it can survive). Most of the workers retrieved fact-checking articles which labeled the statements as false or pants-on-fire, but they chose an intermediate rating. In these cases, the written justifications contained personal opinions or excerpts from the selected URL which instilled some doubts (e.g., tests being not definitive enough, lack of knowledge on the behavior of the virus) or suggested it is safe to act under the assumption we are in the worst-case-scenario (e.g., avoid buying products from China, leave packages in the sunlight to try and kill the virus).
Following the results from the failure analysis, we removed the worst individual judgments (i.e., the ones with noise) according to the failure analysis; we found that the effect on aggregated judgments is minimal, and the resulting boxplots are very similar to the ones obtained in Fig. 1 without removing the judgments.
We investigated how the correctness of the judgments was correlated to the attributes of the statement (namely position, claimant and context) and the passing of time. We computed the absolute distance from the correct truthfulness value for each judgment in a batch and then aggregated the values by statement, obtaining the mean absolute error (MAE) and standard deviation (STD) for each statement. For each batch we sorted the statements in descending order according to MAE and STD, we selected the top-10 statements and analyzed their attributes. When considering the position (of the statement in the task), the wrong statements are spread across all positions, for all the batches; thus, this attribute does not have any particular effect. When considering the claimant and the context we found that most of the wrong statements have Facebook User as claimant, which is also the most frequent source of statements in our dataset. To investigate the effect of time we plotted the MAE of each statement against the time passed from the day the statement was made to the day it was evaluated by the workers. This was done for all the batches of novice workers (Batch1 to Batch4) and returning workers (Batch2from1, Batch3from1or2, Batch4from1or2or3). As Fig. 13 shows, the trend of MAE for each batch (dotted lines) is similar for all batches: statements made in April (leftmost ones for each batch) have more errors than the ones made at the beginning of March and in February (rightmost ones for each batch), regardless of how much time has passed since the statement was made. Looking at the top part of Fig. 13, we can also see that the MAE tends to grow with each new batch of workers (black dashed trend line). The previous analyses suggest that this is probably not an effect of time, but of the decreasing quality of the workers. This is also suggested by the lower part of the figure, which shows that MAE tends to remain stable in time for returning workers (which were shown to be of higher quality). We can also see that the trend of every batch remains the same for returning workers: statements made in April and at end of March keep being the most difficult to assess. Overall, the time elapsed since the statement was made seems to have no impact on the quality of the workers’ judgments.