A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers

  • Robert B. Allen
  • Andrea Japzon
  • Palakorn Achananuparp
  • Ki Jung Lee
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4558)


Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.


Digital Library News Article Historical Newspaper News Story Topic Detection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allan, J.: Introduction to topic detection and tracking. In: Allan, J. (ed.) Topic Detection and Tracking: Event-Based Information Organization, pp. 1–16. Kluwer, Dordrecht (2002)Google Scholar
  2. 2.
    Allen, R.B.: Timelines as information system interfaces. In: Proceedings of International Symposium on Digital Libraries, pp. 175–180 (1995)Google Scholar
  3. 3.
    Allen, R.B.: Developing a query interface for an event gazetteer. In: Proceedings of IEEE/ACM Joint Conference on Digital Libraries, pp.72–73 (2004)Google Scholar
  4. 4.
    Allen, R.B.: A focus-context timeline for historical newspapers. In: Proceedings of IEEE/ACM Joint Conference on Digital Libraries, pp. 260–261 ( 2005a)Google Scholar
  5. 5.
    Allen, R.B.: Using information visualization to support access of archival records. Journal of Archival Organization 3, 37–49 (2005b)CrossRefGoogle Scholar
  6. 6.
    Allen, R.B., Acheson, J.: Browsing structured multimedia stories. In: Proceedings of ACM Digital Libraries Conference, pp. 11–18 (2000)Google Scholar
  7. 7.
    Allen, R.B., Schalow, J.: Metadata and data structures for the Historical Newspaper Digital Library Project. In: Proceedings of ACM Conference on Information and Knowledge Management, pp. 147–153 (1999)Google Scholar
  8. 8.
    Allen, R.B., Wu, Y.J., Jun, L.: Interactive Causal Schematics for Qualitative Scientific Explanations. In: ICADL 2005. LNCS, vol. 3815, pp. 411–415. Springer, Heidelberg (2005)Google Scholar
  9. 9.
    Beitzel, S.M., Jensen, E.C., Grossman, D.A.: Retrieving OCR text: A Survey of Current Approaches. Symposium on Document Image Understanding Technologies (SDUIT) Greenbelt, MD (2003)Google Scholar
  10. 10.
    Bontcheva, K., Maynard, D., Cunningham, H., Saggion, H.: Using Human Language Technology for automatic annotation and indexing of digital library content. In: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 613–625 (2002)Google Scholar
  11. 11.
    Caplan, P., Barnett, B., Bishoff, L., Borgman, C., Hamma, K., Lynch, C.: Report of the workshop on Opportunities for Research on the Creation, Management, Preservation and Use of Digital Content (2003),
  12. 12.
    Cohen, W.W.: Infrastructure components for large-scale information extraction systems. Conference on Innovative Applications of Artificial Intelligence (2003)Google Scholar
  13. 13.
    Cox, R.J.: Documenting Localities: A Practical Model for American Archivists and Manuscript Curators, Scarecrow Press (2001)Google Scholar
  14. 14.
    Gatos, B., Mantzaris, S.L., Chandrinos, K.V., Tsigris, A., Perantonis, S.J.: Integrated Algorithms for Newspaper Page Decomposition and Article Tracking. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition, p. 559 (1999)Google Scholar
  15. 15.
    Crane, G., Jones, A.: The challenge of Virginia Banks: an evaluation of named entity analysis in a 19th-Century newspaper collection. In: Proceedings of IEEE/ACM JCDL (2006)Google Scholar
  16. 16.
    Cunningham, H., Bontcheva, K., Tablan, V., Ursu, C., Dimitrov, M.: Developing language processing components with Gate (user’s guide), Technical report, University of Sheffield, U.K (2002),
  17. 17.
    Fiscus, J.G, Doddington, G.R.: Topic detection and tracking evaluation overview. In: Allan, J. (ed.) Topic Detection and Tracking: Event-Based Information Organization, pp. 17–31. Kluwer, Dordrecht (2002)Google Scholar
  18. 18.
    Hearst, M.: Multi-paragraph segmentation of expository text. In: Proceedings of Association for Computational Linguistics, pp. 9–14 (1994)Google Scholar
  19. 19.
    Hill, L.: Core elements of digital gazetteers: Placenames, categories, and footprints. In: Proceedings of the European Conference on Digital Libraries, pp. 280–290 (2000)Google Scholar
  20. 20.
    Hovy, E., Lin, C-Y.: Automated text summarization in SUMMARIST. In: Mani, I., Maybury, M. (eds.) Advances in Automated Text Summarization, MIT Press, Cambridge (1999)Google Scholar
  21. 21.
    Liddy, E.D, McVearry, K.A., Paik, W., Yu, E., McKenna, M.: Development, implementation and testing of a discourse model for newspaper texts, Human Language Technology (1993)Google Scholar
  22. 22.
    Mantzaris, S.L., Gates, B., Gouraros, N., Tzavelis, P.: Integrated Search Tools for Newspaper Digital LibrariesGoogle Scholar
  23. 23.
    McKeown, K.R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J.L., Nenkova, A., Sable, C., Schiffman, B., Sigelman, S.: Tracking and summarizing news on a daily basis with Columbia’s Newsblaster, Human Language Technology (2002)Google Scholar
  24. 24.
    Murray, R.: Towards a metadata standard for digitized historical newspapers. In: Proceedings of IEEE/ACM JCDL, pp. 330–331 (2005)Google Scholar
  25. 25.
    Petras, V., Larson, R.R., Buckland, M.: Time Period Directories: A Metadata Infrastructure for Placing Events in Temporal and Geographic Context. In: Proceedings of IEEE/ACM JCDL (2006)Google Scholar
  26. 26.
    Radev, D.R.: A common theory of information fusion from multiple text sources step one: Cross-document structure, ACL SIGDIAL (2000),
  27. 27.
    Riloff, E., Lehnert, W.: Information extraction as a basis for high-precision text classification. ACM Transactions on Information Systems 12, 296–333 (1994)CrossRefGoogle Scholar
  28. 28.
    Rumelhart, D.E., McClellan, J. (eds.): Parallel Distributed Processing, vol. 2. MIT Press, Cambridge, MA (1986)Google Scholar
  29. 29.
    Salton, G., Allan, J., Buckley, C., Singhal, A.: Automatic analysis, theme generation, and summarization of machine-readable texts. Science 264, 1421–1426 (1994)CrossRefGoogle Scholar
  30. 30.
    Smith, D.A: Detecting events with date and place information in unstructured text. In: Proceedings of ACM/IEEE Conference Digital libraries, pp. 191–196 (2002)Google Scholar
  31. 31.
    Swan, R., Allan, J.: Automatic generation of overview timelines. In: Proceedings of ACM SIGIR, pp. 49–56 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Robert B. Allen
    • 1
  • Andrea Japzon
    • 1
  • Palakorn Achananuparp
    • 1
  • Ki Jung Lee
    • 1
  1. 1.College of Information Science and Technology, Drexel University Philadelphia, PA 

Personalised recommendations