Automated Processing of Digitized Historical Newspapers beyond the Article Level: Sections and Regular Features

  • Robert B. Allen
  • Catherine Hall
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6102)


Millions of pages of historical newspapers have been digitized but in most cases access to these are supported by only basic search services. We are exploring interactive services for these collections which would be useful for supporting access, including automatic categorization of articles. Such categorization is difficult because of the uneven quality of the OCR text, but there are many clues which can be useful for improving the accuracy of the categorization. Here, we describe observations of several historical newspapers to determine the characteristics of sections. We then explore how to automatically identify those sections and how to detect serialized feature articles which are repeated across days and weeks. The goal is not the introduction of new algorithms but the development of practical and robust techniques. For both analyses we find substantial success for some categories and articles, but others prove very difficult.


Access Classification Digital Humanities Historian’s Workbench Newspapers Text Processing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Murray, R.L.: Toward a Metadata Standard for Digitized Historical Newspapers. In: Proceedings of IEEE/ACM JCDL, pp. 330–331 (2005)Google Scholar
  2. 2.
    Allen, R.B., Waldstein, I., Zhu, W.Z.: Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres. In: Buchanan, G., Masoodian, M., Cunningham, S.J. (eds.) ICADL 2008. LNCS, vol. 5362, pp. 380–387. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  3. 3.
    Toms, E., Flora, N.: From Physical to Digital Humanities Library: Designing the Humanities Scholar’s Workbench. In: Siemens, R., Moorman, D. (eds.) Mind Technologies, Humanities Computing, and the Canadian Academic Community, pp. 91–115. U. Calgary Press, Calgary (2006)Google Scholar
  4. 4.
    Allen, R.B.: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design. In: IFLA International Newspaper Conference: Digital Preservation and Access to News and Views, pp. 54–59 (2010)Google Scholar
  5. 5.
    Holley, R.: How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine 15(3/4) (March/April 2009)Google Scholar
  6. 6.
    Ihlström, C., Åkesson, M.: Genre Characteristics – A Front Page Analysis of 85 Swedish Online Newspapers. In: Proceedings of the Proceedings of the Hawaii International Conference on System Sciences (2004)Google Scholar
  7. 7.
    Foulger, D.: Medium as an Ecology of Genre: Integrating Media Theory and Genre Theory. Media Ecology Association (2006)Google Scholar
  8. 8.
    Allen, R.B., Nalluru, S.: Exploring History with Narrative Timelines. In: Smith, M.J., Salvendy, G. (eds.) HCII 2009. LNCS, vol. 5617, pp. 333–338. Springer, Heidelberg (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Robert B. Allen
    • 1
  • Catherine Hall
    • 1
  1. 1.The iSchool at Drexel UniversityPhiladelphiaUSA

Personalised recommendations