Keywords

1 Introduction

PDF is the predominant format used for scientific publications. Yet scientific PDFs frequently lack enough structural information for proper interpretation by screen readers, making them inaccessible for many visually impaired users [1, 2]. Accessible PDFs require tags and metadata. Tags serve as labels providing semantic information about elements (e.g., heading, figure, formula, table, link, list) in a PDF. Metadata information such as title and default language enable screen-readers to read documents aloud in the correct language and with a meaningful, clearly marked title. Documents lacking tags and metadata can be impractical or nearly unusable for screen-reader users.

This study poses two research questions: 1) To what extent are PDFs in prominent academic repositories accessible? 2) To what extent are accessibility issues in academic articles known and addressed by repositories?

2 Related Work

Previous analyses of international repositories, such as Semantic Scholar and Web of Science, have reported accessibility results ranging from 2 to 15% of papers based on certain minimal accessibility features [1, 2]. Nganji [1] manually analysed articles related to disability, while Wang et al. [2] used Adobe Acrobat to automatically assess PDFs from various scientific fields. Like Wang et al. [2], we took an automated approach to investigate the accessibility of PDFs across subjects. Instead of examining digital archives like Web of Science, we focused on the repositories of prominent publishers (Springer, Elsevier, ACM, and Wiley). In addition to analysing the accessibility of the PDFs within these major repositories, we investigated and compared their respective author guidelines with regard to accessibility requirements.

The existing studies looked at articles published between 2014 and 2018 [1] and 2010 and 2019 [2]. This study provides an assessment of recently published articles, as 86% of our collected PDFs were published in 2023 or 2024.

Wang et al. [2] focused on five accessibility compliance criteria: alternative text (also called “alt text”), table headers, tagged PDFs (metadata), default language, and tab order. This study investigated title metadata and the presence of tags in greater detail by looking at the percentage of tagged content and the presence of semantically meaningful tags for headings, figures, tables, links, and lists. Based on our results as well as our analysis of author guidelines, we have created a set of recommendations to improve the accessibility of PDFs.

3 Methods

To address our two research questions, we conducted two quantitative content analyses: one performed automatically to assess the accessibility of PDFs contained in four prominent repositories, and another conducted manually to identify mentions of accessibility measures in the author submission guidelines from the same repositories.

3.1 Automatic Analysis of PDFs in Repositories

Our analysis focused on four academic repositories: Springer, Elsevier, ACM, and Wiley. Before the study, the top journals from each of Google Scholar’s eight subject categories were investigated to assess where research is most frequently published. This short research enabled us to determine which repositories and keywords were more likely to cover a diverse set of subjects.

Although detailed accessibility testing requires manual checks, automated analysis makes it possible to quickly identify inherent issues in a large number of PDFs. Thus, we developed Python crawlers to gather the 50 most recent papers using 40 keywords in each of the four repositories, totalling 8,000 articles. For Elsevier, only open-access articles could be retrieved. Publication dates range from 1971 to 2024, but about 86% of the PDFs were published in 2023 or 2024. We analysed the papers using the PAVE engine [3]. The PAVE engine can generate an accessibility report based on the Matterhorn Protocol’s automatically checkable criteria [4]. We logged a selection of 10 criteria from the Matterhorn Protocol which are essential for accessible PDFs. These include 7 tag-based criteria (the proportion of tagged content and the use of tags for headings, figures, formulae, lists, tables, and links) and 3 metadata-based criteria (appropriately marked title data, appropriate language setting, and whether the document is marked as “tagged” in its metadata).

Table 1. Accessibility criteria used to assess the PDFs based on Matterhorn Protocol

3.2 Manual Analysis of Author Submission Guidelines

To analyse the author submission guidelines from the four repositories, we looked for mentions of twelve important accessibility elements: Structured content with tagged headings, defined lists, meaningful links, alternative text for images, accessible mathematical formulae, colour contrast, font size, accessible font type, proper reading order, accessible tables, plain language, and an explicit mention of accessibility or disability (Table 1). The criteria were based on Web Content Accessibility Guidelines (WCAG) 2.1 [5] and other accessibility recommendations [6, 7].

4 Results

4.1 Tags in PDFs

Fig. 1.
A horizontal stacked bar graph displays the percentage of tagged P D Fs. Elsevier has the highest percentage, with 78.85% P D Fs having at least one tag, and 16.2% of these being marked as tagged. Springer follows with 67.2% tagged, and 1.6% marked as tagged.

Distribution of PDFs with at least one tag and marked as tagged (in metadata) in the repositories (N = 8,000).

We considered two criteria to check whether the PDF is tagged: one relates to the metadata and the other to the content. The “tagged PDF” metadata is formal information entered by the document creator that indicates to assistive technologies that the document contains tags. The content criterion is based on whether at least one tag was positively identified by the PAVE engine. Documents collected in Elsevier were more frequently tagged than the other publishers, with about 79% of PDFs containing at least one tag, compared to 67% for Springer, 20% for Wiley, and 13% for ACM (Fig. 1). It should be noted, however, that the presence of just one single tag is very low threshold for accessibility. Across all repositories, about 88% PDFs had more than 80% untagged content, i.e. most documents were completely or partially untagged. Moreover, only a small proportion of PDFs were also marked as “tagged” in their metadata. Only 16.25% of PDFs in Elsevier fulfil both conditions, whereas less than 3% do in other repositories. This finding concurs with the analysis of Wang et al. [2] which found that just 13.4% of the analysed PDFs were marked as “tagged” in metadata.

Figure 2 illustrates the distribution of semantically meaningful tags within the 3,578 PDFs containing at least one tag. None of the PDFs included a tagged mathematical formula, and only two PDFs across all repositories had any tagged tables. In Wiley, only figure tags were identified, while in Springer, less than 3% of PDFs contained a tagged figure, heading, or list. In ACM, a small number of papers contained at least one tagged list (8%), figure (13%), or heading (2%). In Elsevier, about 68% had at least one tagged figure, but only 11% included heading tags. Considering that these PDFs are academic articles, it is unlikely that the tag distribution reflects a real absence of formulae, headings, links, lists, tables, or figures in the documents. This means that most PDFs are not tagged properly.

Fig. 2.
A horizontal grouped bar graph plots the distribution of tags in 4 academic repositories. Atleast one image is tagged high in all with Elsevier recording the highest at 74% followed by A C M at 20%.

Distribution of semantically meaningful tags among the 3,578 PDFs with at least one tagged operator.

Among the 1,161 PDFs containing at least one figure, about 77% included an alt text. However, the chosen method of investigation only checks whether the alt text field is empty. A manual analysis suggests that the identified alt texts were largely not meaningful, an issue that was pointed out by Nganji [1]. Moreover, the high proportion of PDFs with an alt text can be explained by the fact that the percentage is calculated based on the smaller subsample of PDFs containing at least one figure. When considering all 8,000 PDFs analysed, the overall percentage of papers containing alt tags drops to 11%, a number similar to Wang et al. with 7.5% [2] and Nganji with 10.5% [1].

4.2 Metadata in PDFs

No PDF claimed to fulfil the PDF UA standard.

Fig. 3.
A horizontal stacked bar graph compares four academic repositories based on the percentage of P D Fs. Welley, Springer, and Elsevier plot a high proportion of P D Fs fulfilling the document title in X M P. A C M has a high proportion of P D Fs fulfilling the display document title.

Distribution of title metadata in repositories (N = 8,000)

All 8,000 PDFs contained a title in their metadata. The information was frequently stored correctly in dc:title in XMP file, except in ACM (Fig. 3). Nevertheless, the PDFs usually did not have their preference view setting on “DisplayDocTitle”, which guarantees that screen readers read the title of the document and not the file name, to the exception of ACM that usually fulfilled only the DisplayDocTitle requirement.

Fig. 4.
A horizontal bar graph of 4 academic repositories versus the percentage of P D Fs with a set language. Welly 20%, Springer 94%, Elsevier 75%, and A C M 67%.

Distribution of PDFs with a set language metadata in repositories (N = 8,000)

Figure 4 indicates that in Springer, Elsevier, and ACM, the PDFs usually have a set language, but this is not systematic as none reach 100%. Wiley lags behind with only 20% of its PDFs with language metadata. In their sample of papers dating from 2010 to 2019, Wang et al. [2] found that this setting was in an upward trend, with “10% compliance in 2010 to more than 25% in 2019” (p.11). This evolution in time has thus continued over time.

4.3 Repository Author Guidelines

All publishers but Wiley included references to disabilities in their general author submission guidelines (Table 2). Nevertheless, the mentions were usually brief and superficial, with ACM being the exception. ACM had the most extensive coverage of accessibility measures, meeting half of the investigated criteria. However, based on our PDF analysis, their commitment did not translate into more accessible documents. This suggests that the accessibility requirements are not implemented correctly and would require control before publication.

Table 2. Content analysis of the general author guidelines of the four analysed repositories

5 Discussion and Conclusion

Results indicate that even where general author guidelines include accessibility requirements, scientific PDFs are still largely not accessible. In particular, this study highlights that inaccessibility arises from a lack of tagging, which is a fundamental requirement for accessible documents.

Even though all publishers state that authors must follow the guidelines of their specific journals, the publisher’s general guidelines set a standard for all journals. For that reason, publishers can make accessibility a requirement for publication. Along with ensuring that articles follow visual standards, publishers could check that accessibility measures are implemented. As recent research has explored the potential of artificial intelligence for document accessibility [8, 9], much of the remediation work could be automated. In particular, a tool like PAVE or PAC3 could enable publishers to request or require authors to check the accessibility of the final version of a PDF before submission. This could be especially interesting for fields that use LaTeX more frequently than Microsoft, as the latter is associated with greater accessibility compliance than the former [2].

Although checking the final version of a PDF is crucial to ensure all tags are included, PDF accessibility must be considered from the beginning, both in templates and in author guidelines, because certain design choices (e.g. tables and colour contrast) cannot be easily remediated. Moreover, the correct use of heading styles can ensure that PDFs are structured correctly from the start. Favouring sans-serif fonts (e.g. Arial), as well as a minimum of 12-point size for body text, are also easy measures to integrate into guidelines. Finally, authors should provide alt text for their images and formulae.

Additionally, this study shows that many PDFs fulfill the default language setting, except in Wiley. Nevertheless, this setting is easy to correct and should be systematic. Similarly, all PDFs included a title metadata, but these were not stored correctly. While correcting metadata is not the most crucial of accessibility measures (unlike tagging), it is an easy step to implement.

5.1 Limitations

This study includes some limitations. First, PDFs were analysed automatically. While this enabled an assessment of a large sample of articles, criteria such as reading order could not be tested. However, as the results indicate that PDFs usually lack tags, a more detailed analysis would not have changed the interpretation of the results significantly. A PDF cannot be correctly tagged if it is never tagged to begin with.

Additionally, this article focused primarily on the accessibility of PDFs for screen-reader users and therefore did not investigate further accessibility criteria such as plain language. Nevertheless, the analysis of the guidelines checked for recommendations regarding plain language and indicates that no publishers made it a requirement or recommendation. Further research could investigate how this is reflected in articles and identify best practices.

Finally, the second analysis focused on author submission guidelines to assess how guidelines could have influenced the accessibility of created PDF documents due to the widespread use of this format. It’s important to note that these results do not imply that publishers are completely neglecting accessibility services. Future studies could investigate how frequently alternative formats like HTML or ePub are available and assess their accessibility.