Abstract
Finding useful information from large multimodal document collections such as the WWW without encountering numerous false positives poses a challenge to multimedia information retrieval systems (MMIR). This research addresses the problem of finding pictures. The fact that images do not appear in isolation, but rather with accompanying, collateral text is exploited. Taken independently, existing techniques for picture retrieval using (i) text-based and (ii) image-based methods have several limitations. This research presents a general model for multimodal information retrieval that addresses the following issues: (i) users' information need, (ii) expressing information need through composite, multimodal queries, and (iii) determining the most appropriate weighted combination of indexing techniques in order to best satisfy information need. A machine learning approach is proposed for the latter. The focus is on improving precision and recall in a MMIR system by optimally combining text and image similarity. Experiments are presented which demonstrate the utility of individual indexing systems in improving overall average precision.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aslandogan YA, Thier C, Yu CT, Zou J and Rishe N (1997) Using Semantic contents and WordNet in Image Retrieval. In: Proceedings of the 20th Annual InternationalACMSIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 286–295.
Baird HS, Bunke H and Yamamoto K (1992) Ed. Structured Document Image Analysis. Springer-Verlag.
Bikel DM, Miller S, Schwartz R and Weischedel R (1997) Nymble: A high-performance learning name-finder. In: Proceedings of the Fifth Conference on Applied Natural Language Processing. Morgan Kaufmann, pp. 194–201.
Brill E (1992) A simple rule-based part of speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing, ACL.
Chang CC and Lee SY (1991) Retrieval of similar pictures on pictorial databases. Pattern Recognition, 24(7):675–680.
Chellappa R, Wilson CL and Sirohey S (1995) Human and machine recognition of faces: A survey. Proc. of the IEEE, 83(5).
Fellbaum C (1998) A semantic network of English verbs. In: Fellbaum C, Ed., WordNet: an Electronic Lexical Database. MIT Press, Ch. 3.
Huhns M and Munindar S (1998) Ed. Readings in Agents. Morgan Kaufmann.
Jorgensen C (1996) An investigation of pictorial image attributes in descriptive tasks. In: Rogowitz BE and Allenbach JP, Eds., Proceedings of SPIE Vol. 2657: Human Vision and Electronic Imaging. SPIE Press, pp. 241–251.
Maybury MT (1997) Ed. Intelligent Multimedia Information Retrieval, AAAI Press/MIT Press.
Meghini C (1995) An image retrieval model based on classical logic. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp. 300–309.
Merialdo B and Dubois F (1997) An agent-based architecture for content-based multimedia browsing. In: Maybury MT, Ed., Intelligent Multimedia Information Retrieval. AAAI Press/MIT Press.
Miller GA (1998) Nouns inWordNet. In: Fellbaum C, Ed., WordNet: an Electronic Lexical Database. MIT Press, Ch. 1.
Niblack W et al. (1993) The QBIC Project: Querying Images by Content Using Color, Texture, and Shape. In: Storage and Retrieval for Image and Video Databases. SPIE.
Picard RW, Pentland A and Sclaroff S (1994) Photobook: Content-based manipulation of image databases. M.I.T Media Laboratory Perceptual Computing Technical Report, 255.
Romer DM (1998) Research Agenda for Cultural Heritage on Information Networks: Image and Multimedia Retrieval. http://www.ahip.getty.edu/agenda.
Rosch E, Mervis CB, Gray W, Johnson D and Boyes-Braem P (1976) Basic objects in natural categories. Cognitive Psychology, 8:382–349.
Rowe M and Guglielmo E (1993) Exploiting captions in retrieval of multimedia data. Information Processing and Management, 29(4):453–461.
Salton G (1989) Automatic Text Processing. Addison-Wesley.
Smith JR (1997) Integrated spatial and feature image systems: retrieval, analysis and compression. PhD Thesis, Columbia Univ.
Smith JR and Chang SF (1996) Visualseek: A fully automated content-based image query System. In: Proc. of ACM Multimedia 96.
Srihari RK (1995) Use of collateral text in understanding photos. Artificial Intelligence Review 8 (special issue on Integration of NLP and Vision): 409–430.
Srihari RK and Burhans DT (1994) Visual semantics: Extracting visual information from text accompanying Pictures. In: Preceedings of AAAI-94, Seattle, WA, pp. 793–798.
Srihari RK and Zhang Z (1998) Finding pictures in context. In: Proc. of IAPR International Workshop on Multimedia Information Analysis & Retrieval. Springer-Verlag Press, pp. 109–123.
Subrahmanian VS (1998) Priciples of Multimedia Database Systems. Morgan Kaufmann.
Sundheim B (1995) Ed. Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann.
Swain MJ and Ballard DH (1991) Color indexing. International Journal of Computer Vision, 7(1):11–32.
Webseer (1998) http://webseer.cs.uchicago.edu.
What is MPEG-4 (1998) http://www.crs4.it/»luigi/MPEG/mpeg4.html.
Zhang Z (1998) Invited paper. Recognizing human faces in complex context. In: Proc. of the International Conference on Imaging Science, Systems, and Technology. CSREA Press, pp. 218–225.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Srihari, R.K., Zhang, Z. & Rao, A. Intelligent Indexing and Semantic Retrieval of Multimodal Documents. Information Retrieval 2, 245–275 (2000). https://doi.org/10.1023/A:1009962928226
Issue Date:
DOI: https://doi.org/10.1023/A:1009962928226