Abstract.
This paper proposes a document image analysis system that extracts newspaper headlines from microfilm images with a view to providing automatic indexing for news articles in microfilm. A major challenge in achieving this is the poor image quality of microfilm as most images are usually inadequately illuminated and considerably dirty. To overcome the problem we propose a new effective method for separating characters from noisy background since conventional threshold selection techniques are inadequate to deal with this kind of image. A run length smoothing algorithm is then applied to the headline extraction. Experimental results confirm the validity of the approach.
Similar content being viewed by others
References
Fisher JL, Hinds SC, D’Amato DP (1990) A rule-based system for document image segmentation. In: Proceedings of the international conference on pattern recognition (ICPR), Atlantic City, NJ, June 1990, pp 567-572
Fletcher LA, Kasturi R (1988) A robust algorithm for text string separation from mixed text/graphics images. IEEE Trans Patt Analysis Mach Intell 10(6):910-918
Forrester MA(1987) Evaluation of potential approach to improve digitized image quality at the patent and trademark office, MITRE Corp, Working Paper WP-87W00277, McLean, VA
Junker M, Hoch R, Dengle A (1999) On the evaluation of document analysis components by recall, precision and accuracy. In: Proceedings of the international conference on document analysis and recognition (ICDAR), Bangalore, India, September 1999, pp 713-716
Negishi H, Kato J, Hase H, Watanabe T (1999) Character extraction from noisy background for an automatic reference system. In: Proceedings of the international conference on document analysis and recognition (ICDAR), Bangalore, India, September 1999, pp 143-146
Niblack W (1986) An introduction to image processing. Prentice-Hall, Englewood Cliffs, NJ, pp 115-116
Niyogi D, Sihari SN (1997) The use of document structure analysis to retrieve information from documents in digital libraries. In: Proceedings of SPIE Document Recognition and Retrieval IV, San Jose, February 1997
Niyogi D, Sihari SN (1996) Using domain knowledge to derive the logical structure of documents. In: Proceedings of SPIE Document Recognition and Retrieval III, San Jose, January 1996
O’Gorman L (1992) Image and document processing techniques for the right pages electronic library system. In: Proceedings of the international conference on pattern recognition (ICPR), Amsterdam, August 1992, pp 260-263
O’Gorman L (1994) Binarization and multithresholding of document images using connectivity. CVGIP Graphical Model Image Process 56(6):494-506
Otsu N (1979) A threshold selection method from gray-level histogram. IEEE Trans Sys Man Cybern SMC-9(1):62-66
Pavlidis T (1982) Algorithms for graphics and image processing. Computer Science Press, Rockville, MD
Takebe H, Katsuyama Y, Naoi S (1999) Character string extraction from newspaper headlines with a background design by recognizing a combination of connected component. In: Proceedings of SPIE Document Recognition and Retrieval VI, San Jose, January 1999, pp 22-29
Trier OD, Taxt T (1995) Evaluation of binarization methods for document images. IEEE Trans Patt Analysis Mach Intell 17:312-315
Wong KY, Casey RG, Wahl FM (1983) Document analysis system. IBM J Res Develop 26(6):647-656
Author information
Authors and Affiliations
Corresponding author
Additional information
Received: 15 November 2002, Accepted: 19 May 2003, Published online: 30 January 2004
Correspondence to: Chew Lim Tan
Rights and permissions
About this article
Cite this article
Tan, C.L., Liu, Q.H. Extraction of newspaper headlines from microfilm for automatic indexing. IJDAR 6, 201–210 (2003). https://doi.org/10.1007/s10032-003-0111-2
Issue Date:
DOI: https://doi.org/10.1007/s10032-003-0111-2