Skip to main content

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

  • Conference paper
  • First Online:
Book cover Research and Advanced Technology for Digital Libraries (TPDL 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9316))

Included in the following conference series:

Abstract

Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://voyant-tools.org/.

  2. 2.

    www.delpher.nl/kranten.

  3. 3.

    See http://lab.kbresearch.nl for examples.

  4. 4.

    http://www.delpher.nl/nl/platform/pages/?title=kwaliteit+(ocr).

  5. 5.

    http://www.loc.gov/standards/alto/.

  6. 6.

    http://resolver.kb.nl/resolve?urn=ddd:010633906:mpeg21:p002:alto.

  7. 7.

    http://lab.kbresearch.nl/static/html/impact.html.

  8. 8.

    available on http://dx.doi.org/10.6084/m9.figshare.1448810.

  9. 9.

    http://www.delpher.nl/nl/platform/pages/?title=zoekhulp.

References

  1. Acerbi, A., Lampos, V., Garnett, P., Bentley, R.A.: The expression of emotions in 20th century books. PLoS ONE 8(3), e59030 (2013)

    Article  Google Scholar 

  2. Alex, B., Grover, C., Klein, E., Tobin, R.: Digitised historical text: does it have to be mediOCRe? In: Jancsary, J. (ed.) Proceedings of KONVENS 2012, LThist 2012 Workshop, pp. 401–409. ÖGAI, September 2012

    Google Scholar 

  3. Bingham, A.: The digitization of newspaper archives: opportunities and challenges for historians. Twentieth Century Br. Hist. 21(2), 225–231 (2010)

    Article  Google Scholar 

  4. Bron, M.; Exploration and contextualization through interaction and concepts. Ph.D. Thesis (2013)

    Google Scholar 

  5. Brown, C.D.: Straddling the humanities and social sciences: the research process of music scholars. Libr. Inf. Sci. Res. 24(1), 73–94 (2002)

    Article  Google Scholar 

  6. Cohen, D.J., Rosenzweig, R.: Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web, vol. 28. University of Pennsylvania Press, Philadelphia (2006)

    Google Scholar 

  7. Croft, W.B., Harding, S., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. Technical report, Amherst, MA, USA (1993)

    Google Scholar 

  8. Fuhr, N., Hansen, P., Mabe, M., Micsik, A., Sølvberg, I.T.: Digital libraries: a generic classification and evaluation scheme. In: Constantopoulos, P., Sølvberg, I.T. (eds.) ECDL 2001. LNCS, vol. 2163, pp. 187–199. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  9. Holley, R.: How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Mag. 15(3/4) (2009)

    Google Scholar 

  10. Holley, R.: Many hands make light work: public collaborative OCR text correction in Australian Historic Newspapers. Technical report, National Library of Australia, March 2009

    Google Scholar 

  11. Kettunen, K., Honkela, T., Lindén, K., Kauppinen, P., Pääkkönen, T., Kervinen, J. et al.: Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods. In: Proceedings of the 80th IFLA General Conference and Assembly, IFLA World Library and Information Congress (2014)

    Google Scholar 

  12. Klijn, E.: The current state-of-art in newspaper digitization a market perspective. D-Lib Mag. 14, January 2008

    Google Scholar 

  13. Mittendorf, E., Schäuble, P.: Information retrieval can cope with many errors. Inf. Retr. 3(3), 189–216 (2000)

    Article  MATH  Google Scholar 

  14. Nicholson, B.: Counting culture; or, how to read Victorian newspapers from a distance. J. Victorian Cult. 17(2), 238–246 (2012)

    Article  Google Scholar 

  15. Strange, C., McNamara, D., Wodak, J., Wood, I.: Mining for the meanings of a murder: the impact of OCR quality on the use of digitized historical newspapers. Digital Humanit. Q. 8(1) (2014)

    Google Scholar 

  16. Taghva, K., Beckley, R., Coombs, J.: The effects of OCR error on the extraction of private information. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 348–357. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  17. Taghva, K., Borsack, J., Condit, A., Erva, S.: The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45(1), 50–58 (1994)

    Article  Google Scholar 

  18. Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. D-Lib Mag. 15(7/8), 1082–9873 (2009)

    Google Scholar 

  19. Weymann, A., Luna Orozco, R.A., Mueller, C., Nickolay, B., Schneider, J., Barzik, K.: Einführung in die Digitalisierung von gedrucktem Kulturgut - Ein Handbuch für Einsteiger. Ibero-American Institute (Berlin) (2010)

    Google Scholar 

  20. Xie, H.I.: Evaluation of digital libraries: criteria and problems from users’ perspectives. Libr. Inf. Sci. Res. 28(3), 433–452 (2006)

    Article  Google Scholar 

  21. Xie, H.I.: Users’ evaluation of digital libraries (DLs): Their uses, their criteria, and their assessment. Inf. Process. Manage. 44(3), 1346–1373 (2008)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank our interviewees for their contributions, the National Library of The Netherlands for their support and the reviewers for their helpful feedback. This research is funded by the Dutch COMMIT/ program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Myriam C. Traub .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Traub, M.C., van Ossenbruggen, J., Hardman, L. (2015). Impact Analysis of OCR Quality on Research Tasks in Digital Archives. In: Kapidakis, S., Mazurek, C., Werla, M. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2015. Lecture Notes in Computer Science(), vol 9316. Springer, Cham. https://doi.org/10.1007/978-3-319-24592-8_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24592-8_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24591-1

  • Online ISBN: 978-3-319-24592-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics