Evaluating the effectiveness of automatic image captioning for web accessibility

Leotta, Maurizio; Mori, Fabrizio; Ribaudo, Marina

doi:10.1007/s10209-022-00906-7

Evaluating the effectiveness of automatic image captioning for web accessibility

Long Paper
Open access
Published: 22 August 2022

Volume 22, pages 1293–1313, (2023)
Cite this article

Download PDF

You have full access to this open access article

Universal Access in the Information Society Aims and scope Submit manuscript

Evaluating the effectiveness of automatic image captioning for web accessibility

Download PDF

3824 Accesses
1 Altmetric
Explore all metrics

Abstract

The web has become a fundamental tool for carrying out many activities spanning from education to work and private life. For this reason, it must be accessible to every user regardless of any form of impairment or disability. Images on the web are a primary means for communicating information, and specific HTML elements were defined to enrich images with textual descriptions, which can be read aloud by screen readers or rendered by braille displays. A relevant problem is that adding a text describing each image published on a website is a demanding task requiring a non-negligible effort for web developers. Several tools based on machine learning have emerged, which can automatically return descriptions for the images. In this work, we evaluate the correctness of their outputs by comparing the generated descriptions with human-defined references. More specifically, we selected 60 images from Wikipedia and their corresponding descriptions as defined by Wikipedia contributors. We then generated the corresponding descriptions employing four state of the art tools (Azure Computer Vision Engine, Amazon Rekognition, Cloudsight, and Auto Alt-Text for Google Chrome) and asked 76 computer science students to blindly evaluate the perceived correctness of the descriptions without being aware of their source. The results show that the descriptions available in Wikipedia are still perceived as the best ones. However, some tools generate good results for specific categories of images, and they can represent proper candidates for the automated and massive addition of image descriptions to websites, helping to increase the accessibility level of the web drastically.

Microsoft COCO: Common Objects in Context

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The web brought a remarkable development to the society, changing how we study and learn, work and do business, travel and enjoy. Social media, with around 3.6 billion users^{Footnote 1} in 2020, have had a massive impact on how we communicate and share information, images, photos, ideas, and innovations. These changes affect all people, including those with different forms of disabilities. This was particularly evident during the COVID-19 pandemic lockdown when many activities and businesses were forced to move online.

The famous quote of 1997 by Tim Berners-Lee “The power of the web is in its universality. Access by everyone regardless of disability is an essential aspect” is even more relevant today: web accessibility is still of paramount importance since bad design choices or lazy use of existing authoring tools can create barriers that exclude people with disabilities from the current digital society.

Much work has been—and is still—done by the World Wide Web Consortium to make sure that everyone can participate on equal terms in the digital society. The Web Accessibility Initiative^{Footnote 2} aims to spread the basic principles of accessibility and to promote best practices among developers, online content publishers, and social media contributors. Accessibility standards and guidelines have been published for websites and mobile apps developers, thanks to the Web Content Accessibility Guidelines [25] and the Mobile Web Initiative [23], respectively. If we focus on the accessibility of images, a good starting point is the tutorial [24] which explains how to provide appropriate text alternatives based on the purpose of the images themselves which can be informative, decorative, functional (e.g., the image of an icon), simple, complex (e.g., a bar chart, a line graph, a diagram), etc.

According to the Web Accessibility Initiative, “Making a website accessible means allowing access to the information contained in the website also to people with different types of physical disabilities and those with limited hardware and software tools.” Different forms of disabilities exist, from visual and auditory impairments to mobility and cognitive difficulties, and some people can have multiple concurrent disabilities. Moreover, almost everyone will experience temporary disability at one point in their lives, being in a situation of temporary fragility after an accident, during an illness, or simply because of poorer eyesight or other ailments due to aging.

In this paper, we will concentrate on visual impairments (for some recent data see for example [17]), a type of disability that needs special attention. Indeed, developers should guarantee access to online services and content to those people who have low vision or cannot see. Many assistive technologies exist to help them, e.g., screen readers, audio descriptions, magnification tools, refresh able braille displays, which can also be paired. For example, users with a combined hearing and vision loss can use a screen reader in conjunction with a braille display. Despite this, if the content itself is not accessible, reading can be difficult or impossible, even with the help of these assistive technologies.

People who cannot see cannot understand the message conveyed by an image, and an accurate description of its content is essential. To ensure access to an image for the blind, it is necessary to include an alternative text that can be read aloud by a screen reader or rendered by a braille display. The HTML markup language provides properties and elements to add such text, for example the alt and longdesc^{Footnote 3} properties used in combination with the src element to include images in web pages, or the <figcaption> element used within the <figure> element. In this study, we will use the alt property and often adopt the term alt-text to denote image descriptions.

Describing online images and photos is time consuming, and, as a consequence, online content often fails to satisfy this essential accessibility requirement. If a picture is worth a thousand words, visually impaired users could be missing thousands of words of context.

Fortunately, in recent years, researchers in the AI community have developed algorithms and tools to automatically create natural language descriptions of images. This ability is essential for various tasks such as organizing extensive image collections or indexing and retrieving them in response to user queries. Many algorithms were proposed having these goals in mind. As a positive side effect, they can be used for accessibility purposes if the quality of the sentences they generate is good enough to be used as alt-text. It is also worth noting that the same alt-texts are indexable by search engines and provide quick summaries, which web crawlers use to understand the content of the images they crawl. These descriptions are also shown in place of the images when they are turned off, for example, in the case of mobile data roaming.

This paper evaluates the perceived quality of the descriptions generated by some tools developed for automatic image captioning. We limited the analysis to web images and collected a set of pictures from Wikipedia; their alternative texts were considered human-authored and formed the set of ground truth captions, e.g., the ideal reference of our experiment to compare the results provided by the various tools. We then formulated and answered the following Research Questions:

RQ1 Is there any difference in the perceived correctness among the descriptions generated by the considered tools?
RQ2 Is there any difference in the perceived correctness between the ground truth descriptions provided by humans and those provided by the tools?

To answer these questions, for each image, we queried the considered tools. We then evaluated the perceived correctness of their results thanks to a survey proposed to computer science university students attending Web and Mobile development courses (where accessibility is an important aspect).

This paper is organized as follows. Section 2 describes some related work and Sect. 3 introduces the tools chosen for the experiment. The survey administered to the students is presented in Sect. 4. The results are discussed in Sect. 5. Finally, Sect. 6 concludes this study suggesting possible future directions.

2 Related work

Generating high-quality descriptions from images is a challenging task that requires interdisciplinary competencies at the intersection of computer vision, natural language processing, and machine learning. Details on different approaches can be found, for example, in [2, 9, 18, 21] where the authors present different algorithms, data sets, and evaluation metrics which have been proposed to generate image descriptions and to assess their quality with different levels of confidence.

AI researchers have developed algorithms that provide excellent results in the field of image classification, but automatically extracting a fluent description from a picture is much more complex. This task is not limited to object detection in a scene but involves recognizing faces or facial expressions, landmarks or specific points of interest, interpersonal relations, etc. To achieve this goal, it requires the use of sophisticated natural language generation techniques. Moreover, the evaluation of the quality of the results often requires a human intervention which is costly and does not always scale.

Data sets are needed for training, validating, and testing AI algorithms, and researchers have created many of them. We recall here the Microsoft COCO (Common Object in Context) data set [11] which collects non-iconic images, e.g., images containing objects in their natural context. This data set was indeed designed for the detection and segmentation of objects in a scene. From this collection, 91 categories were extracted (for example, person, bicycle, bus, dog), and five captions were associated with each image to generate new descriptions of new images. As the authors argue, the process adopted to build such a data set required a lot of work, which amounts to over 70,000 working hours of workers employed through the Amazon Mechanical Turk^{Footnote 4} crowdsourcing marketplace.

Another data set, called Conceptual Captions, is presented in [20]. This data set was built by harvesting the web looking for a wide variety of $\langle image,caption \rangle$ pairs, including not only natural images (like in Microsoft COCO) but also other categories like products, cartoons, drawings, etc. The original alt-texts gathered from the web were transformed to obtain conceptual captions. This was achieved by removing proper nouns, dates, locations, etc., from the original captions or by replacing them, when possible, with the corresponding hyperonym, i.e., a word whose meaning includes the meaning of a more specific word. For example, “Lady Gaga” can be replaced by “singer”, “Tom Hanks” can be replaced by “actor”. This data set, which is much larger than other popular data sets, provides clean captions with fewer details but is still informative for training image captioning models.

Finally, we also mention the VizWiz data set^{Footnote 5} which, differently from previous examples, is populated by people who are visually impaired. The pictures in this data set have a lower quality with respect to those chosen by sighted users and (optionally) can have an associated recorded question about the picture itself. The ultimate goal of this collection is to increase the awareness about the technological needs of people who are blind and to provide new opportunities for researchers to develop assistive technologies that eliminate their accessibility barriers.

We conclude the first part of this section citing two viewpoints. In a comment by Chiarella et al. [4] that appeared on Nature Communications, the authors observe that scientists increasingly post natural sciences images and photos on social media, but this content might be inaccessible to those with visual impairments when the alt-text is missing. They suggest that actions should be taken to guarantee access to these images and other multimedia objects to maximize and broaden education and research experiences.

Morris, in a recent article published in the Communication of ACM [14], poses some ethical considerations touching, among others, the problem of errors in AI algorithms. Many people with disabilities need to trust and rely on the output of an AI system without the ability to verify the output itself. Still errors may occur, even though sometimes popular press and advertising material wrongly states that these translation systems have reached “human parity”. The author concludes by saying that “Educating our next generation of innovators is of paramount importance [...] As technologists, it is our responsibility to proactively address these issues to ensure people with disabilities are not left behind by the AI revolution.” We agree and believe that education on accessibility is paramount for current and future developers to let them understand the social value of appropriate image descriptions and get into the habit of labeling them.

2.1 Evaluating alternative texts

We introduce in what follows some works closer to our study since they specifically describe experiments related to the evaluation of image descriptions to be used as alt-texts.

Automated tests on web pages can be performed with validation tools to check whether they are accessible. These tools help to assess a minimal accessibility level, which is better than nothing. However, experts’ judgments is always required to capture subtler accessibility issues, as discussed for example in [22] or in a more recent study [3] where four commercial accessibility monitoring systems are compared.

In the case of images, the validation tools usually check whether alternative text is present or not. Unfortunately, the mere presence of such text does not guarantee accurate descriptions. The alternative text should indeed serve the same purpose as the non-text content: it should be descriptive and provide enough information without being too long. It should describe the content of an image without dwelling on visual details.

To make some examples, a common mistake is to use alt=“company logo” or even worse alt=“logo.png” to describe the logo of a company. These undescriptive captions provide little information when accessed with a screen reader and thus constitute a potential accessibility barrier. Other wrong alt-texts examples found in major websites are alt=“Image”, alt=“No photo description available”, or alt=“Insert alternative text here”, which can be automatically added by authoring software. Another common mistake is to include the two words “image of” in the alt-text since screen readers are already programmed to say aloud “image” when they encounter images on a page.

A discussion on descriptive vs undescriptive alt-texts can be found in [16] where the authors propose two approaches to automatically detect undescriptive alt-texts in web pages using pattern recognition algorithms. To get the data used for the classification, they analyzed the home pages of more than 400 Norwegian municipalities. By manually classifying the collected alt-texts either as descriptive or undescriptive, they found that 80% of the alternative texts in their data set were undescriptive, thus failing to correctly describe the corresponding content.

On the same line is the work in [1], where the authors analyze a set of images collected on some university websites and compare the results of human evaluation vs automatic evaluation done with the well-known AChecker validator^{Footnote 6}. In the university context, it is essential to carefully describe the complex images used for education purposes (for example, bar or pie charts, diagrams, scientific models of atoms or molecules, maps) so that blind students can access the same content of their sighted peers. Educators should be aware that the lack of these descriptions may constitute an accessibility barrier for students with disabilities. Moreover, for such complex images, assessing the quality of the corresponding descriptive text still requires a human evaluation.

The work in [13] compares two annotation methods for employing novice web workers to manually author descriptions for images in the STEM^{Footnote 7} category, making them accessible to individuals with visual and print-reading disabilities. The first method introduced accessibility guidelines to the workers and let them free to construct image descriptions in an empty text box. The second method was more structured: templates were provided to the web workers to get the proper information. The captions generated with the two approaches are compared in terms of word counts, the inclusion of specific terms or categories, inclusion of units and data trends (when applicable), presence of syntactic errors, etc. The results show that guidelines are not sufficient for novice web workers to produce quality image descriptions, and it is better to generate such descriptions using templates. Moreover, the workers themselves preferred the use of templates and found the task easier.

The following two papers report about two different experiments done for the social media platform Twitter. The first [19] presents a browser extension, Twitter A11y^{Footnote 8}, which dynamically adds the alt-text to the images posted by the users. The generation of the alt-text is performed server-side using different methods, returning a result early if one is successful. The pipeline consists of optical character recognition, scene recognition, reverse image search, plus two additional methods specific to Twitter. If none of the prior methods produce a satisfactory alt-text, the extension asks a crowd worker on the Amazon Mechanical Turk to describe the image according to a set of guidelines.

The authors present the results of their experiments designed to measure the quality of the captions returned by the different methods and users satisfaction. They show that with Twitter A11y blind users were able to follow many more images. However, they observe that work still needs to be done to make content accessible on Twitter. There is also a need to educate users to describe their images since most of the photos lack an alt-text or have the default one, thus resulting inaccessible.

The second experiment on Twitter is described in [7] where the authors present a Conversational Assistant workflow that uses TweetTalk, a scalable conversational platform between visually impaired users and human assistants to find out about visual content. Analyses of the conversations collected from TweetTalk helped defining canonical questions such as “Where is this picture taken?” or “What action is happening in the image?” These questions might be useful for human captioners to describe the most relevant concepts visually impaired users need to better understand a scene. Some questions covered subjective issues, for example, “What emotion is evoked by the scene, or by the people in it?” Detecting emotions is currently an unsolved problem in AI.

The authors of [28] describe their Automatic Alt-Text system that applies computer vision technology to identify faces, objects, and themes from photos. The main goal is to present a useful, fast, free alt-text generation system for blind users of Facebook to enhance their online experience. The alt-text is constructed in the form of “Image may contain...”, followed by a list of objects recognized by the computer vision engine. The primary design decisions include the selection of object tags, the structure of information, and the integration of machine-generated descriptions with the existing Facebook photo experience. Again, selecting the right tags is not an easy task, and the authors ended up with a list of 97 concepts that provide different sets of information about the image, including people, objects (e.g., car, building, tree, cloud, food), settings (e.g., inside a restaurant, outdoor, nature), and other image properties.

Good feedback was provided during (1) lab interview sessions with few blind users and (2) a large-scale experiment with thousands of visually impaired Facebook users, split into test and control groups. Users in the test group used the automatic alt-text system. Those in the control group did not, and, as expected, the former had an easier time understanding the content of photos. However, several design challenges also emerged, the major one related to the quality of the tags. It is indeed possible to get more tags with less accuracy, but would blind users still trust the system in this case?

On the same line is the work presented in [12], where blind users evaluate the captions of Twitter images, and the results show that blind and visually impaired people trust incorrect AI-generated captions and fill in details to reconcile discrepancies rather than suspecting the captions may be wrong. Another interesting point discussed in this work is related to the framing of the captions, considering the effect of positive vs negative framing. Results show that negatively framed captions encourage more distrust on low confidence captions. Machine-generated captions can contain errors, and sometimes the algorithms can hallucinate objects [20]. While sighted users can easily ignore or correct the wrong captions, blind users cannot do the same and incorrect captions can lead to misleading messages.

We could not find papers that compare well-known commercial tools for the generation of image descriptions to be used as alternative texts. A research similar to ours, but focused on tag-based descriptions, is that of researchers at Perficient Digital Company, whose goal was to discover the best image recognition engine [6]. They looked at Microsoft Azure Computer Vision (Sect. 3.1), Amazon Rekognition (Sect. 3.2), Google Vision^{Footnote 9}, and IBM Watson^{Footnote 10}.

For their study, the authors selected 2,000 images in the four categories charts, landscapes, people, and products; three users tagged them manually. Then they evaluated the accuracy of the tags returned by the recognition engines and how well the results matched the human expectations.

For the accuracy, the results show that a tag could be judged to be accurate, even if it was one that a human would not have chosen in describing the image. For example, a picture of an outdoor scene might get tagged by the engine as “panorama”, and be perfectly accurate, but still not be one of the tags a user would think of to describe the image.

For the matching with human expectations, for each image, the manual tags and the top five highest-confidence tags from each engine were presented, without revealing the source. Users had to select and rank the top five tags that they felt best describing the images. Results show that the tags written by humans score far higher than any of the engines. This is to be expected, as there is a clear difference between a tag being accurate and a tag being what a human would use for describing something. Among the engines, the winner was Google Vision, of course after human captioners.

Before ending this section, it is important to keep in mind that captions of different types exist. Alt-text can be written as a list of tags or keywords describing the objects detected in the image; it can be a conceptual description, e.g., a fluent sentence in which more generic words replace specific data such as proper names; at the opposite side, it can contain details of places or individuals, for example in the case of celebrities, political figures, scientists, etc.

3 Tools for the automatic generation of image descriptions

This section briefly introduces the four tools selected for the experiment; we chose them since they have many online reviews and seem to be among the most relevant for generating image descriptions. Moreover, some of them are proposed by big players like Microsoft and Amazon. They produce different results, from sequences of tags to structured sentences. With this experiment, we could assess the perceived correctness of their outputs when used for alt-texts.

In the remainder of this section, as a reference example, we will use a Wikipedia image^{Footnote 11} showing four small quantities of different kinds of sugars characterized by different colors and having similar sizes. In particular, the image shows clockwise from top-left: white refined, unrefined, unprocessed cane, and brown sugar.

3.1 Azure Computer Vision Engine

Microsoft Azure^{Footnote 12} is a cloud platform that provides services for software development. The Azure Computer Vision Engine^{Footnote 13} (Azure CVE for short) is one of such services which grants access to advanced AI algorithms focused on image processing. It is part of the Azure Cognitive Services^{Footnote 14}, a group of services that allow developers to easily add cognitive features into their applications without having AI or data science skills.

One of the most appreciated features of Azure CVE is the facial recognition which provides the ability to recognize famous people around the world. According to Microsoft blog^{Footnote 15}, “Microsoft researchers have built an artificial intelligence system that can generate captions for images that are, in many cases, more accurate than the descriptions people write. The breakthrough in a benchmark challenge is a milestone in Microsoft’s push to make its products and services inclusive and accessible to all users.”

A web developer willing to try Azure CVE can call a REST API which is available online^{Footnote 16}. By uploading an image as input, the AI algorithms process it and return a JSON file with the answer, e.g., a description composed of tags and complete sentences, with different confidence levels (see a portion in Table 1).

Table 1 Portion of the JSON file returned by Azure CVE

Evaluating the effectiveness of automatic image captioning for web accessibility

Abstract

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Attention mechanisms in computer vision: A survey

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

1 Introduction

2 Related work

2.1 Evaluating alternative texts

3 Tools for the automatic generation of image descriptions

3.1 Azure Computer Vision Engine

3.2 Amazon Rekognition

3.3 Cloudsight

3.4 Auto Alt-Text for Google Chrome

3.5 Summary

4 Experiment

4.1 Images and textual descriptions (Objects)

4.2 Questionnaires

4.3 Participants (Subjects)

4.4 Variables and hypotheses formulation

4.5 Analysis procedure

5 Results

5.1 General overview

5.2 Wikipedia

5.3 Azure Computer Vision Engine

5.4 Amazon Rekognition

5.5 Cloudsight

5.6 Auto Alt-Text for Google Chrome

5.7 Answers to the research questions

5.7.1 Human category

5.7.2 Landmark category

5.7.3 General category

5.8 Discussion

5.9 Threats to validity

6 Conclusion and future work

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation