Efficient Word Retrieval by Means of SOM Clustering and PCA

  • Simone Marinai
  • Stefano Faini
  • Emanuele Marino
  • Giovanni Soda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)


We propose an approach for efficient word retrieval from printed documents belonging to Digital Libraries. The approach combines word image clustering (based on Self Organizing Maps, SOM) with Principal Component Analysis. The combination of these methods allows us to efficiently retrieve the matching words from large documents collections without the need for a direct comparison of the query word with each indexed word.


Principal Component Analysis Digital Library Word Image Word Representation Query Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Baird, H.S.: Digital libraries and document image analysis. In: Proc. 7th ICDAR, pp. 2–14 (2003)Google Scholar
  2. 2.
    Doermann, D.: The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding 70, 287–298 (1998)CrossRefGoogle Scholar
  3. 3.
    Mitra, M., Chaudhuri, B.B.: Information retrieval from documents: A survey. Information retrieval 2(2/3), 141–163 (2000)CrossRefGoogle Scholar
  4. 4.
    Curtis, J.D., Chen, E.: Keyword spotting via word shape recognition. In: Proceedings of the SPIE - Document Recognition II, pp. 270–277 (1995)Google Scholar
  5. 5.
    Trenkle, J., Vogt, R.: Word recognition for information retrieval in the image domain. In: SDAIR, pp. 105–122 (1993)Google Scholar
  6. 6.
    Williams, W., Zalubas, E., Hero, A.: Word spotting in bitmapped fax documents. Information Retrieval 2(2/3), 207–226 (2000)CrossRefGoogle Scholar
  7. 7.
    Kise, K., Tsujino, M., Matsumoto, K.: Spotting where to read on pages - retrieval of relevant parts from page images. In: Lopresti, D.P., Hu, J., Kashi, R.S. (eds.) DAS 2002. LNCS, vol. 2423, pp. 388–399. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  8. 8.
    Marinai, S., Marino, E., Soda, G.: Indexing and retrieval of words in old docunents. In: Proc. 7th ICDAR, pp. 223–227 (2003)Google Scholar
  9. 9.
    Tan, C.L., Huang, W., Yu, Z., Xu, Y.: Imaged document text retrieval without OCR. IEEE Transactions on PAMI 24, 838–844 (2002)Google Scholar
  10. 10.
    Guttman, A.: R-tree: a dynamic index structure for spatial searching. In: Proc. ACM SIGMOD, pp. 47–57 (1984)Google Scholar
  11. 11.
    Berchtold, S., Keim, D.A., Kriegel, H.-P.: The X-tree: an index structure for high-dimensional data. In: Proc. 22nd VLDB, pp. 28–39 (1996)Google Scholar
  12. 12.
    Yu, D., Zhang, A.: Clustertree: integration of cluster representation and nearest-neighbor search for large data sets with high dimensions. IEEE Transactions on Knowledge and Data Discovery 15(5), 1316–1337 (2003)MathSciNetGoogle Scholar
  13. 13.
    Kohonen, T.: Self-organizing maps. Information Sciences. Springer, Heidelberg (2001)zbMATHGoogle Scholar
  14. 14.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Chichester (2001)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Simone Marinai
    • 1
  • Stefano Faini
    • 1
  • Emanuele Marino
    • 1
  • Giovanni Soda
    • 1
  1. 1.Dipartimento di Sistemi e InformaticaUniversità di FirenzeFirenzeItaly

Personalised recommendations