Abstract
Academic articles are commonly published in Portable Document Format (PDF). However, for many people with visual impairments, PDF formats present significant accessibility issues. This study addresses two research questions: 1) To what extent are PDFs in prominent academic repositories accessible? and 2) To what extent are accessibility issues in academic articles known and addressed by repositories? To answer these questions, 8,000 PDFs from four prominent repositories (Springer, Elsevier, ACM, and Wiley) were retrieved and were automatically analysed according to accessibility criteria based on the Matterhorn Protocol. Additionally, a quantitative content analysis was performed on the submission guidelines of repositories to determine the degree to which accessibility is considered in document creation. Results suggest that most PDFs were not tagged in spite of the fact that some repositories included accessibility in their general author guidelines. This paper concludes with recommendations to improve the accessibility of papers in academic repositories.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
PDF is the predominant format used for scientific publications. Yet scientific PDFs frequently lack enough structural information for proper interpretation by screen readers, making them inaccessible for many visually impaired users [1, 2]. Accessible PDFs require tags and metadata. Tags serve as labels providing semantic information about elements (e.g., heading, figure, formula, table, link, list) in a PDF. Metadata information such as title and default language enable screen-readers to read documents aloud in the correct language and with a meaningful, clearly marked title. Documents lacking tags and metadata can be impractical or nearly unusable for screen-reader users.
This study poses two research questions: 1) To what extent are PDFs in prominent academic repositories accessible? 2) To what extent are accessibility issues in academic articles known and addressed by repositories?
2 Related Work
Previous analyses of international repositories, such as Semantic Scholar and Web of Science, have reported accessibility results ranging from 2 to 15% of papers based on certain minimal accessibility features [1, 2]. Nganji [1] manually analysed articles related to disability, while Wang et al. [2] used Adobe Acrobat to automatically assess PDFs from various scientific fields. Like Wang et al. [2], we took an automated approach to investigate the accessibility of PDFs across subjects. Instead of examining digital archives like Web of Science, we focused on the repositories of prominent publishers (Springer, Elsevier, ACM, and Wiley). In addition to analysing the accessibility of the PDFs within these major repositories, we investigated and compared their respective author guidelines with regard to accessibility requirements.
The existing studies looked at articles published between 2014 and 2018 [1] and 2010 and 2019 [2]. This study provides an assessment of recently published articles, as 86% of our collected PDFs were published in 2023 or 2024.
Wang et al. [2] focused on five accessibility compliance criteria: alternative text (also called “alt text”), table headers, tagged PDFs (metadata), default language, and tab order. This study investigated title metadata and the presence of tags in greater detail by looking at the percentage of tagged content and the presence of semantically meaningful tags for headings, figures, tables, links, and lists. Based on our results as well as our analysis of author guidelines, we have created a set of recommendations to improve the accessibility of PDFs.
3 Methods
To address our two research questions, we conducted two quantitative content analyses: one performed automatically to assess the accessibility of PDFs contained in four prominent repositories, and another conducted manually to identify mentions of accessibility measures in the author submission guidelines from the same repositories.
3.1 Automatic Analysis of PDFs in Repositories
Our analysis focused on four academic repositories: Springer, Elsevier, ACM, and Wiley. Before the study, the top journals from each of Google Scholar’s eight subject categories were investigated to assess where research is most frequently published. This short research enabled us to determine which repositories and keywords were more likely to cover a diverse set of subjects.
Although detailed accessibility testing requires manual checks, automated analysis makes it possible to quickly identify inherent issues in a large number of PDFs. Thus, we developed Python crawlers to gather the 50 most recent papers using 40 keywords in each of the four repositories, totalling 8,000 articles. For Elsevier, only open-access articles could be retrieved. Publication dates range from 1971 to 2024, but about 86% of the PDFs were published in 2023 or 2024. We analysed the papers using the PAVE engine [3]. The PAVE engine can generate an accessibility report based on the Matterhorn Protocol’s automatically checkable criteria [4]. We logged a selection of 10 criteria from the Matterhorn Protocol which are essential for accessible PDFs. These include 7 tag-based criteria (the proportion of tagged content and the use of tags for headings, figures, formulae, lists, tables, and links) and 3 metadata-based criteria (appropriately marked title data, appropriate language setting, and whether the document is marked as “tagged” in its metadata).
3.2 Manual Analysis of Author Submission Guidelines
To analyse the author submission guidelines from the four repositories, we looked for mentions of twelve important accessibility elements: Structured content with tagged headings, defined lists, meaningful links, alternative text for images, accessible mathematical formulae, colour contrast, font size, accessible font type, proper reading order, accessible tables, plain language, and an explicit mention of accessibility or disability (Table 1). The criteria were based on Web Content Accessibility Guidelines (WCAG) 2.1 [5] and other accessibility recommendations [6, 7].
4 Results
4.1 Tags in PDFs
We considered two criteria to check whether the PDF is tagged: one relates to the metadata and the other to the content. The “tagged PDF” metadata is formal information entered by the document creator that indicates to assistive technologies that the document contains tags. The content criterion is based on whether at least one tag was positively identified by the PAVE engine. Documents collected in Elsevier were more frequently tagged than the other publishers, with about 79% of PDFs containing at least one tag, compared to 67% for Springer, 20% for Wiley, and 13% for ACM (Fig. 1). It should be noted, however, that the presence of just one single tag is very low threshold for accessibility. Across all repositories, about 88% PDFs had more than 80% untagged content, i.e. most documents were completely or partially untagged. Moreover, only a small proportion of PDFs were also marked as “tagged” in their metadata. Only 16.25% of PDFs in Elsevier fulfil both conditions, whereas less than 3% do in other repositories. This finding concurs with the analysis of Wang et al. [2] which found that just 13.4% of the analysed PDFs were marked as “tagged” in metadata.
Figure 2 illustrates the distribution of semantically meaningful tags within the 3,578 PDFs containing at least one tag. None of the PDFs included a tagged mathematical formula, and only two PDFs across all repositories had any tagged tables. In Wiley, only figure tags were identified, while in Springer, less than 3% of PDFs contained a tagged figure, heading, or list. In ACM, a small number of papers contained at least one tagged list (8%), figure (13%), or heading (2%). In Elsevier, about 68% had at least one tagged figure, but only 11% included heading tags. Considering that these PDFs are academic articles, it is unlikely that the tag distribution reflects a real absence of formulae, headings, links, lists, tables, or figures in the documents. This means that most PDFs are not tagged properly.
Among the 1,161 PDFs containing at least one figure, about 77% included an alt text. However, the chosen method of investigation only checks whether the alt text field is empty. A manual analysis suggests that the identified alt texts were largely not meaningful, an issue that was pointed out by Nganji [1]. Moreover, the high proportion of PDFs with an alt text can be explained by the fact that the percentage is calculated based on the smaller subsample of PDFs containing at least one figure. When considering all 8,000 PDFs analysed, the overall percentage of papers containing alt tags drops to 11%, a number similar to Wang et al. with 7.5% [2] and Nganji with 10.5% [1].
4.2 Metadata in PDFs
No PDF claimed to fulfil the PDF UA standard.
All 8,000 PDFs contained a title in their metadata. The information was frequently stored correctly in dc:title in XMP file, except in ACM (Fig. 3). Nevertheless, the PDFs usually did not have their preference view setting on “DisplayDocTitle”, which guarantees that screen readers read the title of the document and not the file name, to the exception of ACM that usually fulfilled only the DisplayDocTitle requirement.
Figure 4 indicates that in Springer, Elsevier, and ACM, the PDFs usually have a set language, but this is not systematic as none reach 100%. Wiley lags behind with only 20% of its PDFs with language metadata. In their sample of papers dating from 2010 to 2019, Wang et al. [2] found that this setting was in an upward trend, with “10% compliance in 2010 to more than 25% in 2019” (p.11). This evolution in time has thus continued over time.
4.3 Repository Author Guidelines
All publishers but Wiley included references to disabilities in their general author submission guidelines (Table 2). Nevertheless, the mentions were usually brief and superficial, with ACM being the exception. ACM had the most extensive coverage of accessibility measures, meeting half of the investigated criteria. However, based on our PDF analysis, their commitment did not translate into more accessible documents. This suggests that the accessibility requirements are not implemented correctly and would require control before publication.
5 Discussion and Conclusion
Results indicate that even where general author guidelines include accessibility requirements, scientific PDFs are still largely not accessible. In particular, this study highlights that inaccessibility arises from a lack of tagging, which is a fundamental requirement for accessible documents.
Even though all publishers state that authors must follow the guidelines of their specific journals, the publisher’s general guidelines set a standard for all journals. For that reason, publishers can make accessibility a requirement for publication. Along with ensuring that articles follow visual standards, publishers could check that accessibility measures are implemented. As recent research has explored the potential of artificial intelligence for document accessibility [8, 9], much of the remediation work could be automated. In particular, a tool like PAVE or PAC3 could enable publishers to request or require authors to check the accessibility of the final version of a PDF before submission. This could be especially interesting for fields that use LaTeX more frequently than Microsoft, as the latter is associated with greater accessibility compliance than the former [2].
Although checking the final version of a PDF is crucial to ensure all tags are included, PDF accessibility must be considered from the beginning, both in templates and in author guidelines, because certain design choices (e.g. tables and colour contrast) cannot be easily remediated. Moreover, the correct use of heading styles can ensure that PDFs are structured correctly from the start. Favouring sans-serif fonts (e.g. Arial), as well as a minimum of 12-point size for body text, are also easy measures to integrate into guidelines. Finally, authors should provide alt text for their images and formulae.
Additionally, this study shows that many PDFs fulfill the default language setting, except in Wiley. Nevertheless, this setting is easy to correct and should be systematic. Similarly, all PDFs included a title metadata, but these were not stored correctly. While correcting metadata is not the most crucial of accessibility measures (unlike tagging), it is an easy step to implement.
5.1 Limitations
This study includes some limitations. First, PDFs were analysed automatically. While this enabled an assessment of a large sample of articles, criteria such as reading order could not be tested. However, as the results indicate that PDFs usually lack tags, a more detailed analysis would not have changed the interpretation of the results significantly. A PDF cannot be correctly tagged if it is never tagged to begin with.
Additionally, this article focused primarily on the accessibility of PDFs for screen-reader users and therefore did not investigate further accessibility criteria such as plain language. Nevertheless, the analysis of the guidelines checked for recommendations regarding plain language and indicates that no publishers made it a requirement or recommendation. Further research could investigate how this is reflected in articles and identify best practices.
Finally, the second analysis focused on author submission guidelines to assess how guidelines could have influenced the accessibility of created PDF documents due to the widespread use of this format. It’s important to note that these results do not imply that publishers are completely neglecting accessibility services. Future studies could investigate how frequently alternative formats like HTML or ePub are available and assess their accessibility.
References
Nganji, J.T.: An assessment of the accessibility of PDF versions of selected journal articles published in a WCAG 2.0 era (2014–2018). Learned Publishing 31, 391–401 (2018). https://doi.org/10.1002/leap.1197
Wang, L.L., et al.: Improving the accessibility of scientific documents: current state, user needs, and a system solution to enhance scientific PDF accessibility for blind and low vision users. arXiv abs/2105.00076 (2021)
Doblies, L., Stolz, D., Darvishy, A., Hutter, H.-P.: PAVE: a Web application to identify and correct accessibility problems in PDF documents. In: Miesenberger, K., Fels, D., Archambault, D., Peňáz, P., Zagler, W. (eds.) ICCHP 2014. LNCS, vol. 8547, pp. 185–192. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08596-8_29
PDF Association: Matterhorn Protocol 1.1 PDF/UA Conformance Testing Model (2021). https://pdfa.org/wp-content/uploads/2021/04/Matterhorn-Protocol-1-1.pdf
World Wide Web Consortium (W3C): Web Content Accessibility Guidelines (WCAG) 2.1 (2023). https://www.w3.org/TR/WCAG21/
British Dyslexia Association: Dyslexia Style Guide (2023). https://cdn.bdadyslexia.org.uk/uploads/documents/Advice/style-guide/BDA-Style-Guide-2023.pdf?v=1680514568
Coolidge, A., Doner, S., Robertson, T., Gray, J.: Accessibility toolkit. BCcampus (2018)
Schmitt-Koopmann, F.M., Huang, E.M., Darvishy, A.: Accessible PDFs: applying artificial intelligence for automated remediation of STEM PDFs. In: Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility, Athens, Greece, pp. 1–6. ACM (2022). https://doi.org/10.1145/3517428.3550407
Darvishy, A., Nevill, M., Hutter, H.-P.: Automatic paragraph detection for accessible PDF documents. In: Miesenberger, K., Bühler, C., Penaz, P. (eds.) ICCHP 2016. LNCS, vol. 9758, pp. 367–372. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41264-1_50
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Pierrès, O., Schmitt-Koopmann, F., Darvishy, A. (2024). PDF Accessibility in International Academic Publishers. In: Miesenberger, K., Peňáz, P., Kobayashi, M. (eds) Computers Helping People with Special Needs. ICCHP 2024. Lecture Notes in Computer Science, vol 14750. Springer, Cham. https://doi.org/10.1007/978-3-031-62846-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-62846-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-62845-0
Online ISBN: 978-3-031-62846-7
eBook Packages: Computer ScienceComputer Science (R0)