1 Introduction

1.1 Social media risks

Although social media can be an excellent platform for knowledge sharing and co-creation, they can also lead to several threats and risks especially for adolescent users. These threats include fake news, fake identities, and conspiracy theories that can alter users’ perceptions of current events and news. On social networks, news posts often consist of a mixture of images and text, with images influential in persuading users. Web images can convey and justify the information. When users see an image supporting a claim they are more likely to believe the provided textual information. Today, images can be easily manipulated by AI, then apply them to spread false information (Zannettou et al., 2019).

Using social media as a source of information has increased, especially among the young generation (Shearer et al., 2018). Social media are used by individuals to seek information and share it (Jin & Liu, 2010). According to a study released by the European Commission, over four out of five individuals, 83% of the European Union, believe that fake news on the Internet is a problem for the union and its democracy. As part of the survey, respondents also emphasized the importance of quality media: respondents perceive traditional media as the most trusted source of news (radio 70%, TV 66%, and print 63%). As a result, online news sources and video hosting websites have the lowest trust ratings, with only 26% and 27% of respondents trusting them, respectively (Eurobarometer, 2018).

According to a Centre and Knight Foundation report, two-thirds of American adults get 62% of their daily news from social networks, and 20 percent get it regularly (Shearer & Gottfried, 2017). Silverman & Singer mentioned social media plays a significant role in the news ecosystem. Different types of false/fake information are available online, especially on social media, such as fake news, rumors, hoaxes, clickbait, and other shenanigans (Zannettou et al., 2019). A situation such as a crisis, in which there is a high level of uncertainty and a high demand for public information, is likely to lead to more widespread dissemination of misinformation through social media than other media channels (Spence et al., 2016).

Deep fakes, generated through machine learning-based artificial intelligence, can potentially deceive the human eye and scam millions of users. Interestingly, the most successful viral deep fake image was not created by disinformation warriors of a dictatorship or political consultants, but rather by a 31-year-old construction worker who enjoyed experimenting with the AI image generator Midjourney.Footnote 1 This particular deep fake was an image of Pope Francis wearing an ankle-length, belted white puffer coat, which was later revealed to be a fake (Fetters Maloy & Branigin, 2023).

One effect of crises such as COVID-19 on society was the rise of cybercrime, including but not limited to disinformation campaigns and the spread of fake news, which undermine the social fabric, cause civil unrest, and amplify the emotional consequences: Fear, anxiety, and insecurity.

On social media, users who do not have access to opposite opinions, especially in crises, may gather in separate biased groups, also called “filter bubbles”. The filter bubble is an area of intellectual isolation in which they are surrounded by ideas that are like theirs. Filter bubbles can result from people’s actions, such as only interacting with people with similar beliefs and values, or from algorithm manipulations performed automatically by social media platforms, which we may use in the long term, leading to biased beliefs about phenomena and information (Pariser, 2011). Unfortunately, social platforms present and filter posts that align with prevailing opinions to increase user engagement and exposure to advertisements (Bucher, 2016).

People are more likely to seek information that supports their beliefs rather the information that may challenge or contradict them. This is because they anticipate finding information that confirms their preexisting beliefs (Nickerson, 1998). Therefore, confirmation bias may prevent you from learning new things.

Given the threats originating from misinformation and fake news, the development of “social media literacy” has been identified as an important social and pedagogical goal (Livingstone, 2014). In our VLC system we intend to work towards this objective by providing a safe environment to deal with misinformation by detecting it and counter-acting to it, e.g., by providing different resources and opposite opinions, especially among junior school students (cross-referencing), to avoid creating bias poles in a community.

In this line of work, we will provide a solution targeting media sources using pictorial information. In this approach, we use the information around an image in a simulated social minienvironment to assist the young learners and guide them on how to make an adequate judgment based on different information sources that they can find on the web.

1.2 The Courage project

The Courage project is an ongoing European research partnership aiming to create a learning environment that assists and educates adolescents in dealing with discrimination and harmful material found on social media. This includes addressing issues like spreading false information, disinformation, promoting conspiracy theories, and engaging in harmful practices such as hate speech, bullying, and cyber-mobbing. The ability to self-protect or be resilient, as well as the ability to take care of oneself, is imperative for reducing harmful effects on the body. In dealing with fake news, Courage emphasizes fostering comprehension rather than resorting to avoidance or relying on external safeguards like censorship or filtering. The project focuses on nurturing critical thinking abilities in individuals. Courage provides a modular virtual learning companion (VLC) as a chatbot that is adaptable to different environments and focuses on addressing toxic content (Aprin et al., 2023).

As part of this study, we introduce a VLC that works in an environment that includes manipulated images that can be used as misinformation on social media. By combining “Reverse Image Search” (RIS) with images as cues, VLC can provide comparative contextual information. From an educational standpoint, we perceive the concept of learning from diverse contexts as a fresh approach to tackling the issue of fake news. Rather than simply combating it, the project aims to provide an educational framework that enables individuals to gain valuable insights and knowledge from various sources and contexts. RIS relies on a specific type of support to retrieve and provide access to other instances of the same image in different contexts. By using Natural Language Processing (NLP), the VLC delivers keywords and phrases from other websites. The companion also asks knowledge activation questions and motivates the learners to fact-check and remain skeptical about social media posts (Aprin et al., 2022).

General human-oriented approaches to identifying and combating fake news on social media are crucial to tackling misinformation. Such human-oriented strategies are to be distinguished from current AI-based techniques for automatically detecting fake news. Given the educational orientation of our project, human-oriented procedures are particularly relevant. We also comment on systems supporting learners/users in fake news detection and increasing awareness. Finally, we elaborate on the roots of our approach in the Intelligent Tutoring System (ITS) tradition of learning companion systems. In this approach, we would like to investigate the role of recommendations via interaction with VLC based on the logged data from the experiment. We analyzed the extracted data from the school trials, guided by the instructor in person and a companion through a chat platform. We are interested in the level of disagreement before and after a recommendation (Johnson & Johnson, 1979).

1.3 Specific research questions

The work reported on in this paper, and especially the design and evaluation of our classroom experiment, was guided by the following research questions:

  • RQ1: Was the VLC effective for engaging learners in a high school classroom in meaningful interactions with stimulus material in a simulated social media environment?

  • RQ2: Was the accuracy of the judgment positively influenced by the time spent on visiting or reading web pages related to the target picture?

  • RQ3: Did a revision of the initial judgment guided by the VLC lead to a higher agreement with the expert opinion and with other participants’ judgments (convergence)?

In our classroom evaluation, we use stimulus material containing authentic and manipulated images whose status (real or fake) was previously known based on provenance information and expert judgments.

1.4 Scientific contribution and relevance

  • Critical thinking across disciplines: The VLC nurtures skills essential for interpreting scientific and other claims propagated in social media.

  • Information / media literacy: In the digital age, the VLC approach offers a robust method for students to verify the credibility and relevance of data and develop media literacy skills.

  • Empowering independent learning: The VLC solution supports modern pedagogy's emphasis on self-directed learning, aiding students in information verification during independent study.

  • Ethical and responsible behavior: The tool equips students to be responsible digital citizens, facilitating ethical decision-making and informed debates.

  • Interdisciplinary applications: The VLC can be adapted for specific academic requirements, enhancing the rigor of student research across disciplines.

In summary, the research cultivates an informed, critically-thinking learner, fulfilling both media literacy objectives and broader educational goals across various subjects.

2 Background

In this chapter, we elaborate on the human perception of misinformation from the perspective of social science, its reasons, and its solutions. We will describe intelligent tutoring systems and learning companion systems, which are fundamental to our work. Finally, we will explore similar pedagogical approaches and tools that are comparable to our VLC system.

2.1 Perception and (Human) judgment of “Fake or Real”

Susceptible Host, Virulent Pathogen, and Conducive Environment are the three vertices in Victoria L. Rubin’s conceptual triangle model for disinformation and misinformation (Rubin, 2019). The model states that fake news only spreads when all three causative conditions occur simultaneously. She proposed three treatments to stop the interplay of the components mentioned: automation to neutralize the aggressive pathogen, education to disarm the susceptible host, and regulation to create a secure environment for behavior.

Automated, AI-based methods can identify harmful characteristics and label the information items. The LiT.RL News Verification Browser is a search engine for newsreaders, reporters, editors, or information specialists, is an illustration of this strategy. This tool examines the language used on digital news websites to evaluate whether they are clickbait (to 94% precision on a test set of 5670 texts for an asynchronous experiment), satirical news, or fake news. LiT.RL displays the information in color-coded groups to visualize the results. LiT. RL’s categorization is not always precise and may not be suitable for usage by the public. Specifications for multimedia are not supported (Rubin et al., 2019).

Automation news verification systems based on NLP methodologies can help content creators by assisting them in efficiently verifying common features of misinformation. By filtering out and highlighting suspect storylines, such strategies might assist information experts in reducing information overload for news consumers or assist schools in teaching critical content evaluation skills (Chen et al., 2015).

The current social media classification method is sensitive to any keyword that might represent fake news (e.g., COVID-19 or vaccine). However, it is difficult to trace and identify fraudulent information. For instance, extensive false image manipulation makes it challenging for both machines and people to recognize fake information (Nguyen et al., n.d.).

The second treatment is “Education”, but we would like to know which barriers can impact education efficiency and why learners resist changing their beliefs, although rational fact contradicts their beliefs.

Badke contends that rather than carefully examining news and facts, people merely pay attention to what they expect or wish to see. He claimed that this results from confirmation bias, the psychological tendency for people to look for information that supports and verifies their beliefs rather than critically analyzing all the available data (Badke, 2018).

The Dunning-Kruger effect refers to a cognitive bias that occurs when individuals hold a belief that they possess greater intelligence and competence than they do. Individuals with lower abilities lack the necessary skills to recognize their incompetence. This combination of limited self-awareness and lower cognitive capacity results in overestimating their capabilities (Pennycook et al., 2017).

The illusion of explanatory depth is another factor contributing to the difficulty of altering beliefs. Research shows that a significant proportion (over 40%) of non-expert participants made errors when asked to depict and answer questions about bicycles in an abstract manner. This suggests that individuals possess a vague, incomplete, and frequently erroneous comprehension of everyday objects (Lawson, 2006).

A related approach to enhancing learners’ critical thinking skills involves providing a “fact-checking awareness tool”. These tools are specifically designed to involve learners in the process of fact-checking. The American Library Association (ALA) suggests that unrestricted access to accurate information from various contexts, without censorship or filtering, is the most impactful strategy for combating disinformation and media manipulation (McDonald & Levine-Clark, 2017).

According to Kyza and colleagues, introduce some strategies to combat misinformation include critically and reflectively reviewing the information, acting appropriately, refusing to share or like content that has misinformation, flagging posts for review, and looking carefully at the post’s justification and its potential for misinformation (Kyza et al., 2020). Throughout their study, policymakers frequently mentioned that citizens should also be included in the effort to weed out false information. Several strategies were proposed, as well as software methods in their study, that has potential to increase regular users’ resiliency to false information. When inaccurate information is corrected, for instance, a social media post that contains a link to an inaccurate news article, then beside it should also provide information on fact-checked statistics and access to the officially changed content.

The final aspect within the triangle is the “Regulatory” component. A comprehensive legislative effort is needed as the global Internet Society (ISOC)Footnote 2 seeks to eradicate these pathogenic “Fakes” by disrupting their ability to reach and “contaminate” susceptible hosts in a digital environment conducive to them. The European Commission has warned that social media companies will face new regulations unless they tackle fake news urgently.Footnote 3 Some programs have been implemented under the leadership of the EU and international organizations to address disinformation related to the pandemic.Footnote 4

In this work, we focus on searching for credible techniques for the target group, which belongs to the field of ‘Education’ to disarm the vulnerable host rather than filtering content on social media.

2.2 Learning companion systems

Learning Companion Systems (LCS) are a specific type of Intelligent Tutoring Systems (ITS). According to an extensive definition, a learning companion can be described as a kind of educational agent who plays a non-authoritative role in a social learning environment. LCS are distinguished by the provision of tailored support and adaptive feedback via an explicit agent or partner (Chou et al., 2003). Learning companions stimulate student learning through competition and collaboration (Chan & Baskin, 1990). Implementing LCS can use multimedia elements, chatbot dialogue techniques, speech input/output, animation, virtual reality, or other interaction techniques.

Regarding inteligent support, LCS often incorporate machine learning and NLP techniques to facilitate communication between the LCS and the learners. Logging and tracking the student’s interactions with an LCS is used in student modeling. A practical approach could involve prompting the learner to articulate the reasoning behind their answers through reflective questions, particularly when it comes to assessing the potential authenticity of news articles. By incorporating self-explanation techniques, each task step would contribute to a more profound and comprehensive learning experience. Drawing from self-regulated learning strategies, learners would be encouraged to generate multiple responses and provide explanations that enhance their understanding and clarify any misconceptions they may have (Chi et al., 1989).

Adaptability is one of the most crucial difficulties for a teaching system incorporating LCS, according to the definitions of adaptive systems. Depending on the context and the learner’s prior replies, the system should be adaptable and react to the learner’s responses and activities. The agent’s feedback to the learner’s text input can be a conversational response, providing guidance or further discussion. Alternatively, the agent can offer a visual response in various formats, such as recommending an instructive movie or presenting relevant visual content that aligns with the learner’s needs and objectives (Aleven et al., 2013). For students with low and high prior knowledge, modeling students as a basis for adaptive feedback in LCS lesson dialogs can significantly accelerate learning (Katz et al., 2021).

In an instructional situation, an LCS can play a variety of roles. For instance, the position of a critic who questions learners’ proposals or a leader who proposes new ideas (Goodman et al., 2016).

Hietala and Niemirepo observed that learners could lose motivation if they constantly interact with a highly knowledgeable and flawless companion. This raises the question of how knowledgeable the learning companion agent should be to meet the learner’s expectations and maintain their motivation to engage with the agent. Interestingly, a companion that initially makes mistakes, similar to humans, can be more beneficial when facing a challenging task or addressing a novel problem (Hietala & Niemirepo, 1998).

Our Virtual Learning Companion (VLC) provides functions such as role-playing with the learner and offers adaptive feedback based on prior interactions and learners’ answers and, additionally, based on RIS links, delivering educational recommendations and analytical artifacts for evaluating environmental images and knowledge activation questions. The VLC system’s primary functions include displaying learning material and processing input.

2.3 Similar approaches in educational contexts

With a similar rationale, Kyza et al. (2021) present the Co-InformFootnote 5 a toolset that addresses issues of misinformation on the web and social media. The approach includes detection, raising awareness, linking fact-checking of articles and correction, and supporting the resistance against misinformation. It uses a Chrome plugin that works on Twitter, which combines artificial intelligence models and human input. They examined how technological, contextual, and personal factors affect whether people “like” or “share” misinformation.

The plugin helps address misinformation by examining tweets by using AI and a rule engine to assess their credibility. Then it presents the credibility measurements and explanations to the user via a blurring mechanism (‘blurry’), as shown in Fig. 1, which the user can turn off to discourage engagement with non-credible posts.

Fig. 1
figure 1

Left: non-credible text was blurred to take the attention of the user Right: Credibility scale for the user to add picture source (Image from Co-inform, an H2020 project that received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 770302 H2020-EU.3.6.—SOCIETAL CHALLENGES—Europe In A Changing World—Inclusive, Innovative And Reflective Societies)

In their evaluation work, eighty participants (n = 80) were divided into an experimental group (n = 40) with credibility labels and control group (n = 40) without credibility labels. Each group confronts a curated Twitter timeline with credible and non-credible posts. Results indicated that the significance of the technological intervention could be seen in the high correlation between the lack of the plugin and the acceptance of misinformation (by “liking”). Participants with stronger trust profiles (trusted the software plugin) were less likely to spread misinformation, and there was a correlation between trust in the technology and technological acceptance. The results show that social media users can be prevented from spreading misinformation in an authentic social media environment through a co-created technological solution. A dashboard was also developed for fact-checking journalists and policymakers, indicating which misinformation has been identified, where it originates, how it will spread soon, what the public’s current and expected impression is, and what major remarks have been made about it (Zschache, 2022).

Instead of filtering fake news on social media, there are games and tools to increase learners’ awareness. Through playing the role of Twitter Editor in the Harmony SquareFootnote 6 and Bad NewsFootnote 7 games Fig. 2, learners will learn about six common misinformation techniques involved in spreading disinformation. These negative techniques are impression, emotion polarization, conspiracy, discredit, and trolling.

Fig. 2
figure 2

In the "Bad News" game the user's Twitter behavior is simulated

In these games, players are exposed to fake news techniques, and the winners are the headlines with the most followers. They award the player with a badge for each technique they learn. The outcome of an empirical study for Bad News shows that learners’ ability to recognize deception tactics is improved when compared to a gamified control group. This game also raises people’s confidence in their judgment and can confer psychological resistance to typical online deception tactics in various nations with different background cultures (Basol et al., 2020).

To help teenagers critically reflect on digital advertising, Media Smarts (old name MNet)Footnote 8 has already prepared and suggested a serious mini-game platform and a booklet (De Jans et al., 2019). Some of Media Smarts’ free games for teenage learners are designed specifically to address prejudices and biases resulting from a lack of information. Critical thinking skills are promoted through these games. The organizers of Media Smarts explain in their agenda that their objective is to raise awareness among teenagers by encouraging them to examine the information and seek alternative viewpoints (Titley et al., 2015).

Media Smarts introduced a Reality CheckFootnote 9 game to increase users’ awareness about the source of information for online news. They can discover how to find evidence, such as where a story originated. They also learn how to compare a news source with others. How to use tools for fact-checking websites, tools, and reverse image searches were described for the users in narrative form. Like the Reality Check game, our method focuses on teaching fact-checking skills and motivating learners to fact-check by comparing different resources to eliminate dubious news and reduce bias. The Bad News and Harmony games also focus on improving their prejudging abilities.

As reported by Gabriel (2021), the TV/radio channel Südwestrundfunk (SWR) offers games on misinformation to foster discussion and education. To deficits in judging credibility of information in public media, two games were developed for both kids and a mature audience to enhance critical thinking. The children's game Reality checkFootnote 10 focuses on "picture trick," "advertise or not," and chain letters, guided by virtual companions. Feedback and examples are provided to deepen understanding (Gabriel, 2021).

In the mature version of fake finder,Footnote 11 users interact with predefined social media posts to determine their authenticity. They select a screenshot and an avatar, then respond to a character's query about whether a post is fake or real. Users often need to conduct research before making a judgment. Results and feedback are provided. Educators use this game to assess students' understanding of misinformation by observing their choices and analyzing responses to post-game questions. Students use pseudonyms for privacy and link them to their real names later.

Our focus in the Courage project is to test our Virtual Learning Companion (VLC) in classrooms with the supervision of teachers and provide and record the different stages of the learner inside the environment before and after interaction with external resources.

2.4 Measures of agreement/disagreement

In a prior study, we have looked at the categorization of social media items as potentially being of a “toxic” type using labels such as “hate speech” or “discrimination” (Malzahn et al., 2023). Here, we were particularly interested in controversial judgments as a potential trigger for classroom discussions. This led us to select an adequate measure of agreement or disagreement between raters (again, young learners) regarding their categorizations. In this context, the actual ratings (labels or tags) were defined on a nominal scale, which excluded the use of most dispersion measures from descriptive statistics. Among the remaining options were the “dispersion index” described by Walker (1999) and the measure of “group disagreement” that was genuinely conceived from the perspective of collaboration research (Whitworth, 2007). There is a direct correspondence of measures of disagreement (D) with measures of agreement (A). If these measures are normalized on a scale from 0 to 1, this correspondence is expressed by the equation D = 1 – A. This suggests that known measures of agreement, such as those used to calculate inter-rater reliability, could be used inversely. Accordingly, we also considered Fleiss’ kappa (Fleiss, 1971) as a possible option.

Mathematical analysis and comparison of these measures led to the insight that Fleiss’ kappa is the-measure that exactly corresponds to Whitworth’s group disagreement (GD), i.e., it equals 1—GD. The dispersion index DI (Walker, 1999) differs from this only in the normalization factor, yet gives a better behavior in the limit for a few categories. The measure is calculated in the following way:

$${\varvec{D}}{\varvec{I}}=\frac{K\left({N}^{2}-{\sum }_{k=1}^{K}\;{{f}_{k}}^{2}\right)}{{N}^{2}\left(K-1\right)}$$
N:

Number of raters

K:

Number of categories

fk:

Sum of ratings for each category (frequency)

fk2:

Sum of squared frequencies/ratings

Based on this preliminary analysis, we have applied the same measure to quantify the agreement/disagreement between the learners in our experiment regarding their ratings.

3 Approach

In this chapter, we will discuss the conceptual and technical architecture of the developed VLC system, provide an example scenario, and explain the stimuli used in the designed control of social media.

3.1 System architecture

We developed InstaCour, a simplified version of Instagram as a controlled social network environment. This environment allows for delivering content such as images and their captions in a controlled manner and enables basic social media interactions Fig. 3. It is appropriate for predefined images, captions, and comments. Administrators and researchers can add new Instagram-like content as required for their scenario. In the following scenario, “Fake or Real”, we did not develop and consider the functionality for like, share, and comment buttons.

Fig. 3
figure 3

Conceptual architecture of the Virtual Learning Companion system and InstaCour environment

The Virtual Learning Companion system and InstaCour environment are depicted in the Conceptual architecture. The learner will see the VLC interface and InstaCour as an actor in a single browser tab. The VLC was developed as a Chrome browser plugin interacting with InstaCour’s artifacts. It contains a chatbot that reacts to user input when they select or hover images in InstaCour. While the VLC system interacts through chat, providing queries and suggestions via the chatbot, learners are able to engage with the content in the environment.

The companion system is divided into two principal technical components in the frontend, the Chrome extension interacts with the InstaCour environment, and in the backend, there are internal and cloud-based microservices that communicate with middleware (NodeJSexpressFootnote 12) that handle analysis and combination wi different APIs. The middleware facilitates frontend-to-backend communication via REST API (Aprin et al., 2023).

A cloud base Wit.aiFootnote 13 service enables extending chatbot interaction with an AI-powered function that recognizes the user’s intent for each free text dialogue, based on the trained model. In contrast to the predefined underlined script, this feature was not utilized in the current experiment for chatbot dialogue.

The InstaCour handles the content (image links and captions) as a static JSON-formatted file. We developed a faction that can shuffle the content randomly when the user refreshes or opens the InstaCour.

The middleware manages these tabs and their corresponding time slots, filters tabs according to the analysis goal (e.g., recording unique tab in a specific time slot in JSON structure), and saves them to (MongoDBFootnote 14) collections. We produce the metadata information for each image and store it with other data, like user credentials and their chat history and models in MongoDB’s document-oriented database. The RIS module manages the communication with Google reverse image search API and sends back similar image links to the middleware module. The middleware module will store the harvested RIS links in the database.

The local text analysis module accesses the related RIS list, scrapes the remote website’s text content, and analyzes it. The output is an important set containing particular predefined keywords such as (fake, fact, credible, evidence, etc.) and keywords based on the TF-IDFFootnote 15 algorithm. To record the action logs in the standard Experience API (xAPI)Footnote 16 structure, we use the Learning Locker as the Learning Record Store (LRS).

Since the Chrome extension is completely independent of the InstaCour environment, maintenance, and development of the InstaCour will be easier and increase the modularity. More importantly, the Chrome extension allows us to monitor users’ browsing behavior. In addition to tracking when they open and close other tabs, the number of tabs they open, the unique tab names and titles they use, and the amount of time they spend on each tab is also monitored. The information handled by the mentioned functions was programmed in the Chrome extension Background.js, which can communicate with the middleware directly from the browser via REST API.

We successfully developed data structure in a standard xAPI statement format for storing chatbot dialogue, actions regarding the user interaction with the VLC and the InstaCour, and other browsed websites and storing them in the Learning Locker and Mongo DB simultaneously. Using the Chrome extension also has drawbacks, such as restricting the trial environment to the Chrome browser and the requirement to set up and configure the extension on every PC before each school trial ( Aprin et al., 2023).

4 Scenario

Each learner is associated with an anonymous user token stored by the browser plugin, and they can choose to sit at the computer whenever they want. The Chrome browser is opened on each PC, and learners can access the learning environment.

The companion instructs users to right-click on one image on simulated social media (InstaCour) and choose the icon for the Companion tool. The image (Stimulus material) may be manipulated or real and are depicted on InstaCour. Images are shuffled and arranged like Instagram posts from top to bottom so that users can scroll up and down for the posts.

Our research has focused on controversial cases (manipulated or real images) where it is challenging, even for adults to detect image manipulation without checking them on search engines, based on previous empirical evidence (Aprin et al., 2021, 2022).

4.1 Stimulus material

We have chosen the five following cases/stimuli (Image and their caption) with these statues and concrete categorization Table 1 based on expert judgment, Fake or Real:

Table 1 Heading: Stimulus name (Images and their caption) with their color code middle: representative images on InstaCour. Bottom: their category is based on expert judgment (label)

Fake refers to images that have been manipulated with software (like Adobe Photoshop), and Real implies the image is not manipulated and is obtained directly from the camera lens.

We use the terms image or stimulus in the following calculation instead of using Image and their caption.

The “Magazine” case focuses on stereotypes (Perfect woman), the woman depicted in the image does not exist in reality.Footnote 17 They used software to modify the model’s face (removing wrinkles, spots, enhancing colors, etc.).

The “Dali” case focuses on human bias judgments, which say that at first glance, the image is impossible, but it is a real image, the building is under construction, and a painting cover was used to conceal it.Footnote 18

The “John Lennon” case indicates that this musician works with revolutionary Che Guevara.Footnote 19 Still, in reality, they never met each other. This manipulated photo wants to trap those interested and followers of these people in a particular filter bubble (Left and Liberian political aspects lived during the 1970s and 1980s) in social media. According to our school trials, the students had no background knowledge about this case and were considered the most complicated case to decide.

The “Flying House” case depicts an image that seems unrealistic, but that was a project under National Geography.Footnote 20 They managed to attach a considerable number of helium balloons to lift a wooden house with balloons with two passengers, like in the move-up animation.

The “Mari” case highlights that Chris Hadfield, on an international spaceship, endorses the use of marijuana for the first two weeks after arrival. This is to cope with the side effects of being in space, such as sleep balance. In reality, he held just colorful eggs in his hands for the Easter ceremony preparation.Footnote 21

4.2 Scenario description

The teacher, as conductor, explains the scenario in the short video and shows students how to use the environment and VLC tool. The opened plugin tab shows how to activate the companion by right-clicking one image in the simulation Instagram environment.

After right-clicking on an image and selecting the VLC button, the companion automatically begins a chatbot-style conversation (e.g., “Hey, you made it!”) and asks some demographic questions (gender, age…), followed by knowledge activation questions that the learner must answer in free text, e.g., ask the learner to express their opinion about the selected image. In the subsequent stage, the companion asks the student if the InstaCour image is Fake or Real, and the learner must vote for one of the five labels (Fake, Probably Fake, Not Sure, Probably Real, Real). Then, the student must respond to questions that require them to justify their answers. After that, VLC was asked to classify the images without checking search engines or asking their classmatesFootnote 22 (First vote).

The companion then reveals a “Recommended” tab with links from Reverse Image Search (RIS) and Google Lens option from Google Chrome browser. Learners can click on these links to compare the keywords, metadata, bold phrases, and concise descriptions of each RIS link for the chosen post and even open the complete sources from the RIS list Fig. 4.

Fig. 4
figure 4

The InstaCour environment with an image and the caption (left side) and the VLC add-on with contextual information and a summary of the keywords from other recourses based on RIS in the sorted format (right side)

In this stage, the companion encourages and directs students to use an external tool to perform a free Internet search or to use the provided reverse image tools to get further context on the subject. The image is shown in various contexts and on websites in the plugin’s sidebar Fig. 5, and the companion will guide them step by step. After engaging with each image’s metadata and returning to the chatbot dialogue, the companion may issue a warning or feedback as an educational intervention. Suppose that students quickly evaluate or respond without consulting the suggested RIS. Here, they will get a message in the conversation, such as “Beware of prejudice and Bias, you answered quickly!” After exploring the RIS links, the companion delivers additional (modified) input and inquires whether the user wants to change their thoughts and vote again (revision) if they are not confident with their first judge. If the learner changes their vote or decides not to participate in the second vote (Revision), the companion requests an explanation and justification, the main steps of the companion and the learner is depicted in Fig. 6.

Fig. 5
figure 5

The InstaCour environment with an image and its caption (left side). The companion chat and the retrieved images from google lens, each of which represents a website (right side)

Fig. 6
figure 6

Flowchart for main points of the companion scenario for the Fake/Real image game

In the next stage, the companion poses contemplative questions (e.g., “How confident are you?” or “Have you examined the other resources provided?”). After engaging in this conversation, the VLC system will request that they check InstaCour for subsequently established stimulus. Finally, VLC awards badges and presents expert judgment alongside credible sources of each case. In the end, the VLC asks a few evaluation questions about whether the user enjoyed the system and if they have any suggestions for scenarios or improvements.

During the experiment, the VLC system records the learner’s responses, the companion’s chat responses, the learner’s interactions with environmental objects, the tabs that the learner accessed in the browser, and the number of visited tabs with their timeline in a standard xAPI format and MongoDB simultaneously. After the VLC trial, the constructor explains the truth behind each stimulus and conducts an oral examination and a few questions with researchers and organizers.

4.3 Future vision: Using image watermarking and blockchain-based metadata to identify manipulated content

We suggest employing blockchain technology to store the metadata of images on the web. Each image can be assigned a unique code as an ID and embedded on the image as a watermark before being shared on social media. Essential metadata such as the time of capture, photographer's name, and image's pixel information in condensed form can be stored on the blockchain immediately after capturing the image. Keeping this metadata on a blockchain enables us to find altered copies of photos and trace ownership changes.

Through this feature, media users will have access to similar images along with their metadata and history. This allows them to evaluate the image, find it in different contexts, and better estimate its credibility (independent from social media and search engine’s ranking). As a result, it also promotes awareness when consuming social media content. VLC, acting as a moderator, can play a role in educating users, especially young adults, about this metadata system.

In the near future, implementing blockchain-based metadata and watermarking will be crucial in combating human/AI-manipulated content online and fostering a more informed online community. Of course, this feature needs to be backed by regulation, ensuring that phone and camera manufacturers and producers commit to supporting the storage of image metadata after creation before granting access for sharing it.

5 Evaluation

This chapter outlines the design of a school trial experiment, including classroom conditions, data collection, types of collected data, analytical measures, and terminology. Then we will use logs and mathematical measurements to answer research questions.

5.1 Classroom experiment

A one-day school trial using the “FakeOrReal” setting of the InstaCour environment was conducted in June 2022 at Wolfskuhle school in Germany. The 22 students, 15 males and 7 females participating were all minors in the age range of 13 to 15 years. These students were admitted based on a data privacy agreement signed by their parents and received individual access tokens for anonymous participation. Additional five students of the same course had to be excluded because of a missing signature. Based on the later data analyses, three more participants had to be excluded because of deviations from the standard instructions.

The InstaCour environment is designed with the function of randomized stimuli presentation in order to push the learner to follow the task individually, as the nearest PC to the learner might have another stimulus order because of shuffling images. Also, randomizing helps equal distribution and avoid prioritizing specific stimulus to be always on the top. This experiment resulted in 95 recorded conversations with the VLC. Out of the 2056 unique user interactions recorded in the chatbot collection database, 228 unique tabs were opened, with an average of 12 tabs per user during the experiment, which lasted approximately 50 min, or approximately 10 min per image.

The data collected included chatbot responses, opened tab usage, timeline, and metadata on user interactions, which were analyzed to identify meaningful relationships, as described in the following chapters.

5.2 Feedback from learners

On completion of a scenario run, the companion asked users to provide feedback on the system’s functionality and possible enhancements. Nine learners responded to the first question about whether they learned something from the system, with six answering “yes” and three answering “no.” For the written question about system enhancements or criticisms, nine learners also responded, with seven giving positive feedback such as “All fine, you are a very helpful companion”, “No, I have no suggestion,” and “Not at all, you are perfect!” Two learners criticized the VLC’s chat procedure for not necessarily repeating the same questions for each stimulus. Each round, the conversation could be shorter, especially for the second to fifth rounds. Learners mentioned the VLC dialogue, and the companion window popped up with the same conversation each time, hindering quick research among RIS links and responses. These suggestions will be considered in our following school trials. After the experiment, we had an oral session with the learners, where they provided positive feedback on the VLC and scenario, finding it entertaining and informative. Learners had difficulty determining the credibility of John Lennon’s image and caption, as the RIS content lacked textual explanations. Despite the provided information, learners found it difficult to find the truth since they had no historical background on the Che Guevara and John Lennon case.

5.3 Analytic measures

This section will detail the analytical measures utilized to assess the school trial experiment. We’ll start by defining the essential terms and concepts needed for analysis, followed by describing the methods and calculations required.

5.3.1 Scores of the Vote

Scores are the predetermined numerical values assigned to each label, as shown in Table 2.

Table 2 Credibility classes and their scores

5.3.2 Expert Score (ES)

Expert Score is assigned to social media images reviewed and categorized by experts, we take the expert judgements as ground truth. For instance, the NASA-Marijuana image (the Mari stimulus) has been labeled fake by experts, and the expert score ES is (-2).

5.3.3 Vote Tuples and judgments

We assign user tokens and votes to distinct groups based on their participation in image evaluation (stimulus(index)), creating a unique tuple (a record) as follows:

$$Record/Tuple:\;\{stimulus\;(index),\;user-token,\;vote\}$$

To answer the defined research questions, we must consider the number of users and their votes for each stimulus. The user can participate in the first round of voting (Vote1). This first vote is taken before learners’ observation on recommendations provided by VLC. The second round of voting (Vote2 or revision) is after interaction with the provided RIS, recommendations, and metadata for each stimulus through companion conversation.

As mentioned, we had 22 students participate in the experiment, with 19 individuals included in the analysis after removing outliers. In total, we have 129 unique records or tuples after combining the stimulus, user-token, and vote information. Most users participated in each stimulus judgment for the first vote, resulting in 92 records. For the revision vote, we have 37 records. The number of records (tuples) for each stimulus and their judgments are shown in the Table 3 below:

Table 3 Number of records(tuples) in the first vote and revision

5.3.4 Expert distance (ED)

Expert distance is the absolute difference between the expert evaluation score and the learner’s vote score for a given stimulus.

$$ED =|{ ES}_{in}-Learner\mathrm{^{'}}s\;{Vote\;score}_{in} |$$

Max ED = 4 and Min ED = 0 (based on defined scores).

i:

the index of the image (stimulus)

n:

the index of the learner’s token

5.3.5 Improvement/deterioration per user

IPU is a measure per user per image that can be positive or negative (deterioration) for each stimulus. It is the subtraction of the second user’s vote score expert distance (\({\mathbf{E}\mathbf{D}}_{2}\)) from the first vote score expert distance (\({\mathbf{E}\mathbf{D}}_{1}\)).

$$\begin{array}{l}{\varvec{I}}{\varvec{P}}{\varvec{U}}={{\varvec{E}}{\varvec{D}}}_{2{\varvec{n}}{\varvec{i}}\boldsymbol{ }}-{{\varvec{E}}{\varvec{D}}}_{1{\varvec{n}}{\varvec{i}}}\\ -4<IP{U}_{ni}<4\end{array}$$
n:

the index of the learner’s token

i:

the index of the image (stimulus 1–5)

5.3.6 Improvement per Image for all Users (\({{\varvec{I}}{\varvec{P}}{\varvec{I}}}_{{\varvec{i}}}\))

Improvement per image is the sum of the IPU across all participants for that target stimulus.

$$Improvements\;per\;image\;\left({IPI}_i\right)=\sum\limits_{n=1}^N{IPU}_{ni}$$
i:

the index of the image (stimulus)

n:

the index of the learner’s token

N:

the total number of participants in the simulated environment (in this experiment, Nmax = 19)

5.3.7 The relative improvement per image (RIPI)

The Relative improvement per image is based on the sum of improvements per image, also considering the number of participants and votes for each stimulus. It is calculated as:

$$Relative\;IPI\;\left(RIPI\right)=IPI\ast\frac{r_i}R$$
i:

the index of the image (stimulus)

r:

sum of records (tuple) per stimulus for revision vote,

R:

total of records in revision vote (R = 37 total Table 3)

5.3.8 Improvement for all images for a user (\({{\varvec{I}}{\varvec{A}}{\varvec{I}}}_{{\varvec{n}}}\)):

Improvement per image per user is the sum of the IPU of a user for all images.

$${Improvments\;for\;all\;images\;({IAI}_{i})}=\sum\limits_{i=1}^{5}{IPU}_{ni}$$
i:

the index of the image (stimulus)

n:

the index of the learner’s token

5.3.9 Total Improvement (TI)

We find that Total Improvement (TI) can be calculated in various ways that yield the same result, as demonstrated in the following equation: The first aspect is the sum of Improvements Per Image (IPI) for five images, the second aspect is to calculate the sum of the Improvement Per User (IPU) for all 92 user’s vote and at the end we can calculate the sum of Improvement for All Images (IAI) for all users(19) that all lead to same total improvement.

$$TI= \sum\limits_{i=1}^{5}{IPI}_{i} =\sum\limits_{v=1}^{92}IP{U}_{l}= \sum\limits_{n=1}^{N}{IAI}_{n}$$
v:

the index of the learner’s vote

i:

the index of the image (stimulus)

n:

the index of the user

N:

the total number of participants in the simulated environment (in this experiment, Nmax = 19)

5.3.10 Correlation measures

Our analysis uses the Pearson correlation coefficient (Pearson, 1895) as the primary method for calculating the correlation between our indexes. However, the Spearman (Spearman, 1987) correlation coefficient is more efficient for small sample sizes. Therefore we employ this method besides the Pearson coefficient in some cases (Vargha et al., 2000).

5.4 Conclusions from the analysis

The total number of records in the first judgment/vote was 92 votes, and in the second judgment (revision), after VLC recommended checking other sources, we had 37 votes. Out of the 37 revision votes, in 12 cases, learners answered that they changed their idea because of the metadata provided by the Google Lens sidebar. Additionally, 22 of the votes were based on RIS, metadata, and keywords provided by the companion. As for the remaining three votes, the learners didn’t specify the reason for participating in the revision.

5.4.1 IPI calculation and results

IPI for all five images TI = 71, the average improvement for each image is 71/5 = 14.2

We consider RIPI as the main method to sort the improvement per image. As RIPI also considers the size of the participants per stimulus, we decided to use RIPI instead of the IPI index Table 4.

Table 4 Number of votes per image, IPI, and Relative IPI (RIPI)

According to oral questions after the classroom experiment, most students admitted the “Lennon” case was the most challenging image to judge. This is because that image contained less credible textual and visual information retrieved by both the RIS and the Google Lens, which is compatible with the calculated amount for RIPI Fig. 7.

Fig. 7
figure 7

Relative improvements per image Index (RIPI) as a bar chart

Another similar approach to calculating improvement is to consider the average of expert distances for each stimulus:

$${Average\;of\;Expert\;Distances}_{in}=AED= \frac{1}{N}* \sum\limits_{n=0}^{N}{ ED}_{in}$$
i:

The index of the image (stimulus)

N:

The total number of participants in the target stimulus

We consider the AED for the first vote as a measure to describe the difficulty of the stimulus. The analysis indicates that the order of AED difference for 92 tuples remains consistent with the improvement seen in analysis for (RIPI). It indicates that the two approaches have the same result in terms of improvement calculation.

This section reports the results of analyses related to the participants’ judgments and their interaction with the web content guided by the VLC in the context of image classification tasks.

The REV = 0 subgroup of the participants that were certain with their first judgment (vote) did not participate in revising for that specific stimulus in revision containing 55 unique records.

The REV = 1 subgroup was unsure about their first judgment, and they participated in revision after interaction with VLC recommendations containing 37 unique records.

As depicted in Table 5 in column AED(revision) with revision status “All records” to keep the tuples amount constant for AED calculation, we apply their first votes as their revision vote for those who did not participate in the revision of the calculations to keep the total amount of votes 92.

Table 5 Average of expert distance per image, REV = 1 subgroup participated in revision, and REV = 0 decided to not participate in the revision and other related indexes from the analysis

In column AED with “REV = 1” in Table 5, we considered the group of records that participated in revision in the calculation (REV = 1, 37 unique records).

Upon analyzing the information presented for the participants with the REV = 1, it is evident that for the cases “Mari” (with six votes) and “Magazine” (with nine votes), all of their second choices matched the expert vote which means the average expert distance is also zero AED = 0. Additionally, for the other stimuli, we have a reduction in the value of ED (improvement).

The average expert distance score for five stimuli was also calculated based on the total number of records.

Those who participated in the revision (37 votes), if we consider their first vote, had a higher average of AED (2.392) than the group with 55 records (REV = 0) with an amount of (1.734).

Those who participated in the revision (37 votes), if we consider their second vote(revision), had a lower average of AED (0.632) than the group with 55 records which for those only participated in the first vote with an amount of (1.734). This result indicates that participants who changed their vote based on VLC interaction had better results in terms of difference with expert judgment than participants who decided not to change their first vote.

If we apply the same formula, Total Improvement (TI) for subgroup REV = 0, TI = 1,43, and REV = 1 for their second vote TI is (0.45), it means almost one step improvement rate (0.98) difference toward expert judgment for those who participate in the revision.

Total Improvement (TI) for learners who only participated in the first vote during their initial judgment was 1.43 (REV = 0), whereas, for those who participated in revision (REV = 1), the TI for their first vote was 2.37.

Our analysis indicates a statistically significant positive correlation between “Revision Participation status REV = 0 or 1” for all records and their “Expert Distance” (ED) in the initial judgment (first vote), with a Pearson’s correlation coefficient of 0.313. This implies that there is evidence of a positive association between these two variables (p = 0.002351) (sample size = 92). This suggests that in REV = 0, learners who were confident in not participating in the revision (After VLC RIS recommendations) for a specific stimulus had a better score in their initial judgment compared to the initial judgment of the group that participated later in the revision as we calculated for expert distance (2.37 > 1.43). On the other hand, we could say that users (REV = 1) who had a higher expert distance in the first round on average had different feelings about their quality of judgment after VLC recommendations conducting the reverse image search compared to users who tended to not participate in the revision for a specific stimulus.

The difference in average expert distance between the first vote and revision can be seen in Fig. 8 in dark green. It also indicates that in two cases, “Flying House” and “Dali”, it was difficult to determine whether they were “real” before receiving information from RIS and metadata from a companion in comparison to the ground truth (ES). But because of the rich metadata on the web for these cases (textual story from credible resources), most of the learners who participated in revision could find the correct answer, which was “Real” in both cases.

Fig. 8
figure 8

Average of Expert Distances (AED) for all records (92), 0 < AED < 4, for the learners before interaction with VLC recommendation (first vote) and instruction and after that in revision for each of the five images also their difference

The statistical data for each image also shows that the VLC, which recommends RIS from the web, improves the learners’ judgments by reducing the distance from the expert judgment.

The following observations were made in the context of this analysis:

  • From 35 learners’ votes out of 37 revision votes (REV = 1), votes were toward expert judgment (improvement).

  • Each user’s revision, on average, improved by a 3.84 score according to expert judgments in the experiment.

  • 17 out of 19 users participated in the revision for at least one image. The max participation in revisions was in two users with four images, and the min was in five users with one image. On average, 2.1 images were revised out of 5 per user.

Deterministic votes count in different grouping: In the last row of the Table 5 shows how often the users use deterministic categories (Fake and Real) compared to doubtful and non-deterministic categories (probably fake, not sure, probably real). We can see changes in the revision subgroup (REV = 1) are changed from (24%) to (81%), which means we had (57%) increment in the use of deterministic votes in revision.

Relation of RIPI and consumed time consumption: The case with the lowest RIPI score, “Lennon”, has the minimum total spent time as dipicted in Fig. 9 , which was 2696 s. According to our measurements, “Flying House” had the highest total spent time, with 4295 s Table 6, and also had the highest improvement. This may indicate that spending more time on each stimulus leads to a more precise answer, as measured by expert distance ES in our study.

Fig. 9
figure 9

Time consumption based on xAPI statements longs in Learning locker data store

Table 6 IPU and time per user correlation based on user logs. Green label p-values pass the significance test and reject the null hypothesis

If we consider five images and calculate the correlation between improvements in IPU and time per user (sample size = 19), we have the following results:

On average, when considering all learners’ logs (92 records sample size) Fig. 10, the correlation between their improvement (IPU) and their time is moderate and positive, with a coefficient of 0.48. The t-value of 5.1985, p = 0.000001247.

Fig. 10
figure 10

The regression chart of improvement and time. The left plot shows the 92 records per user log (IPU) data and the right plot shows the improvement (IPI) for each image vs the total time per image

Overall, the correlation result suggests that as the time spent on a task increases, the improvement rate also tends to increase, and this correlation is statistically significant.

Another aspect of this conclusion is that the more time that the user spends on each stimulus, including interaction with VLC instructions based on RIS, leads to more improvement in the revision.

Regarding the agreement level, the result shows that interaction with web content guided by the VLC leads to consistent judgments and fewer divergent decisions (higher agreement), difference in agreement in the right section of Fig. 11. This was observed for each stimulus and, on average, for first votes and revision votes (REV = 0,1):

Fig. 11
figure 11

Left: the result of the agreement (1- dispersion index) for all records REV (0,1). Right: first and second vote with their differentiate

An interesting finding from the aforementioned calculation is that the most challenging task, “Lennon”, decreased the agreement index by (-0.11) after interaction with RIS. On the other hand, other tasks, which were relatively easier to find credible content from RIS, showed an increase in the agreement index. The “Flying House” case had an increment of (+ 0.213) in agreement level and Mari (+ 0.41).

Furthermore, if we examine the average of the agreement for subgroups, the population that participated in revision REV = 1 (37 records) has a considerably higher agreement (0.6) in compassion to the subgroup that did not participate in revision (REV = 0, 55 records) with the amount (0.15) as shown in the below Table 7:

Table 7 Agreement measure and agreement level differences for different subgroups

The correlation between the number of opened tabs and the time spent on each tab was analyzed using data from 92 votes. Results indicate that there is no significant correlation (correlation coefficient = 0.13) between the two variables. This may be because some reverse image opened tabs contain mostly textual content. In contrast, others contain mostly just visual content without any text and need different timelines to decide for a learner on each case.

As a result, it is recommended that the total time spent and the total number of opened tabs be considered separately for each stimulus in the analyses.

A total of 214 unique tabs were opened by our participants, on an average of (11.26) per user, and the distribution of tabs was close between male and female participants. Consequently, if learners discover reputable and credible textual resources from the first opened RIS and Google Lens tab, they will not tend to open more tabs to find the solution. On the other hand, the results indicate that when RIS content is more ambiguous and less textual, more external tabs are opened.

As shown in Table 8, the “Lennon” case had the highest number of opened tabs and the lowest total time and (RIPI) improvement rate, as most of the RIS provided for this stimulus contained insufficient textual information. On the other hand, for the “Flying House” stimulus, where credible resources were recommended, the number of opened tabs was the minimum, and the total spent time also improvements rate was the maximum.

Table 8 Comparison of the time, improvement index, and number of opened tabs (OT) per stimulus

From observing the amount for each image, we could see the negative relation between the changes in the number of opened tabs and the time for all stimuli except “Dali”.

Our findings suggest that the number of opened tabs is a useful indicator of the richness and quality of online information available for a particular stimulus. Opening more tabs may indicate insufficient textual information in the RIS, which could lead to a lower improvement rate.

5.5 Summary of results

In summary, the empirical findings allow us to answer the initial research questions in the following way:

  • RQ1: Was the VLC effective for engaging learners in a high school classroom in meaningful interactions with stimulus material in a simulated social media environment?

    The instructor and the learners indicate that the trial worked as planned and expected. They were engaged with the companion and the procedure based on the oral discussion. In addition, interaction with the stimuli in the provided InstaCour and the VLC worked as planned in the school lab. According to the log files, all participants could complete the initial judgment for the chosen stimulus and the total and average time consumption indicate that each learner consumed 16 min with VLC conversations and judgments. The system could successfully manage all the VLC and learners simultaneously. It could store their interaction logs in dedicated databases, cloud base learning locker, and MongoDB in the different predefined structures.

  • RQ2: Was the accuracy of the judgment positively influenced by the time spent on visiting or reading web pages related to the target picture?

    Before considering the effect of time consumption on improvement, we need to determine the improvement based on the expert judgment score (ES), which serves as a ground truth. This is especially relevant for challenging stimuli.

    To determine the difficulty of the stimulus based on learners' classification, we looked at their initial judgment scores. These were then compared to ES. The difference between the ES and the learners' initial votes was calculated to represent the "difficulty per image." This metric offers insights into how challenging it is for learners to correctly assess an image's credibility.

    According to these assumptions and calculations, we found that spending more time on evaluating results from the RIS feature offered by VLC can improve judgment. This is because credible resources often contain more detailed and rich textual content. This idea is supported by our findings, which showed a strong correlation between time spent and improvement in judgment for challenging tasks with the highest difficulty indices, such as the "Dali" and "flying house" stimuli. Additionally, an analysis of 92 logs revealed a moderately positive correlation (0.48) between improvement per user (IPU) and the time users spent on the task. Our findings suggest that taking extra time on provided recommendations to make decisions leads to more accurate outcomes.

  • RQ3: Did a revision of the initial judgment guided by the VLC lead to a higher agreement with the expert opinion and with other participants’ judgments (convergence)?

    Yes, according to the results, using VLC to guide interactions with web content not only resulted in more consistent assessments and fewer divergent choices among participants but also led to a substantial increase in the use of deterministic votes like “Fake” or “Real.” This shift away from more ambiguous, probabilistic choices such as “Probably Real,” “Probably Fake,” or “Not Sure” indicates higher levels of classification agreement among participants.

5.6 Transferability and limitations of the approach

The analytical methodologies, rooted in mathematical measures, that we employed to evaluate improvement and agreement in alignment with established ground truth benchmarks, can be readily adapted for additional investigations meeting analogous criteria. For example, this approach is versatile enough to accommodate various similar scenarios in which users must engage in classification of diverse stimuli across multiple evaluation points—denoted as vote1, vote2 (revisions), through to voteX.

The limited number of participants, with only 19 users contributing 2,056 logs and 129 classification answers in 95 conversions with VLC, serves as a constraint in this study. To corroborate our findings, we executed a supplementary experiment in the Italian language for LUMSAFootnote 23 University in Italy. This additional trial involved 32 active participants and yielded outcomes largely in alignment with those from our initial experiment conducted at a Wolfskuhle school in Germany. A detailed analysis of the findings, as well as a side-by-side comparison, will be presented in a separate report under the auspices of the same Courage project.

6 Conclusions and perspectives

We conducted school trials for our designed scenario, which included key elements such as InstaCour, a simulated version of Instagram featuring real and manipulated images, and a virtual learning companion (VLC) in the form of a browser extension. The VLC recommended reverse image search (RIS) to learners, summarized recommendations, and guided learners through the scenario. The scenario was divided into two phases: in the first, learners labeled images before receiving recommendations from the companion, and in the second, they decided to participate in the revision after interacting with the suggestions and recommendations from VLC. The trial was conducted as anticipated, and interactive statements were successfully collected and stored in cloud-based databases.

Our data analysis showed that providing RIS through the companion improved students’ judgment of image manipulation. The revised assessment was more precise than their initial assessment based on its concrete proximity to the expert’s judgments. In cases where a recommended image had insufficient related textual information on the web, the revision showed less improvement, and users found it more challenging to arrive at the correct answer. We observed that users opened more tabs in a short period for these challenging cases to find credible content on the web.

Our small-scale study revealed that learners exposed to images from the Internet with varying contexts and credible textual content were more likely to agree on the image’s credibility and reach a consensus opinion. This also led to more deterministic responses based on the labels and classes of our categories.

Based on the results of our school trials, we intend to expand our research and focus on high schools in (Germany). This expansion will allow us to evaluate the efficacy of our virtual learning companion (VLC) and InstaCour in a different linguistic and cultural setting.

We hope to collect more comprehensive data and validate our tools’ efficacy across various student populations by conducting larger-scale trials. We will refine the companion’s recommendation system based on the lessons learned from previous trials, focusing on enhancing accuracy and direction.

Through these initiatives, we hope to equip students worldwide to critically evaluate images’ veracity and confidently navigate the digital landscape.

We intend to continue refining and adapting our tools, considering the specific requirements of various educational settings. By doing so, we hope to equip students with the skills and resources necessary for informed digital consumption.

The hypothesis, which needs more investigation for a larger population such as daily internet users, engaging in online fact-checking based on Reverse Image Search (RIS), could lead to biased groups that systematically favor different interpretations of the images, regardless of whether the content is actually fake or real. The agreement on determining the authenticity or falsehood of the content depends on the richness and factual basis of the retrieval context obtained from similar image resources. If this hypothesis holds, then the selection of alternative sources by the search engines will significantly impact the users’ beliefs about the content and understanding of this information.