Skip to main content

Improving the retrieval of information from external sources

Abstract

A major barrier to successful retrieval from external sources (e.g., electronic databases) is the tremendous variability in the words that people use to describe objects of interest. The fact that different authors use different words to describe essentially the same idea means that relevant objects will be missed; conversely, the fact that the same word can be used to refer to many different things means that irrelevant objects will be retrieved. We describe a statistical method called latent semantic indexing, which models the implicit higher order structure in the association of words and objects and improves retrieval performance by up to 30%. Additional large performance improvements of 40% and 67% can be achieved through the use of differential term weighting and iterative retrieval methods.

References

  • Atherton, P., &Borko, H. (1965).A test of factor-analytically derived automated classification methods (Rep. AIP-DRP 65-1).

  • Baker, F. B. (1962). Information retrieval based on latent class analysis.Journal of the ACM,9, 512–521.

    Article  Google Scholar 

  • Bates, M.J. (1986). Subject access in online catalogs: A design model.Journal of the American Society for Information Science,37, 357–376.

    Google Scholar 

  • Blair, D. C., &Maron, M. E. (1985). An evaluation of retrieval effectiveness for a full-text document-retrieval system.Communications of the ACM,28, 289–299.

    Article  Google Scholar 

  • Borko, H., &Bernick, M. D. (1963). Automatic document classification.Journal of the ACM,10, 151–162.

    Article  Google Scholar 

  • Cullum, J. K., &Willoughby, R. A. (1985).Lanczos algorithms for large symmetric eigenvalue computations: Vol. I. Theory. Boston: Birkhauser.

    Google Scholar 

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., &Harshman, R. A. (1990). Indexing by latent semantic analysis.Journal of the American Society for Information Science,41, 391–407.

    Article  Google Scholar 

  • Dumais, S. T., Furnas, G. W., Landauer, T. K., &Deerwester, S. (1988, May). Using latent semantic analysis to improve information retrieval. InCHI ’88 Conference Proceedings: Human Factors in Computing Systems (pp. 281–285). New York: ACM.

    Chapter  Google Scholar 

  • Dumais, S. T., &Littman, M. L. (1990, April). InfoSearch: A program for iterative information retrieval using LSI [Poster].CHI ’90 Conference Proceedings: Human Factors in Computing Systems. New York: ACM.

    Google Scholar 

  • Fidel, R. (1985, October). Individual variability in online searching behavior. In C. A. Parkhurst (Ed.),ASIS ’85: Proceedings of the ASIS 48th Annual Meeting (pp. 69–72). White Plains, NY: Knowledge Industry Publications.

    Google Scholar 

  • Forsythe, G. E., Malcolm, M. A., &Moler, C. B. (1977).Computer methods for mathematical computations. Englewood Cliffs, NJ: Prentice-Hall.

    Google Scholar 

  • Furnas, G. W., Landauer, T. K., Gomez, L. M., &Dumais, S. T. (1987). The vocabulary problem in humansystem communication.Communications of the ACM,30, 964–971.

    Article  Google Scholar 

  • Jardin, N., &van Rusbergen, C. J. (1971). The use of hierarchic clustering in information retrieval.Information Storage & Retrieval,7, 217–240.

    Article  Google Scholar 

  • Kane-Esrig, Y., Casella, G., Streeter, L. A., &Dumais, S. T. (1989, August). Ranking documents for retrieval by modeling of a relevance density. In S. Boker (Ed.),Proceedings of the 12th IRIS (Information Systems Research Seminar in Scandinavia) (pp. 329–338). Aarhus, Denmark: Aarhus University.

    Google Scholar 

  • Koll, M. (1979). An approach to concept-based information retrieval.ACM SIGIR Forum,13, 32–50.

    Article  Google Scholar 

  • Oddy, R. N. (1977). Information retrieval through man-machine dialogue.Journal of Documentation,33, 1–14.

    Article  Google Scholar 

  • Ossowo, P. G. (1966). Classification space: A multivariate procedure for automatic document indexing and retrieval.Multivariate Behavioral Research,1, 479–524.

    Article  Google Scholar 

  • Salton, G., &Buckley, C. (1990). Improving retrieval performance by relevance feedback.JASIS,41, 288–297.

    Article  Google Scholar 

  • Sparck Jones, K. (1971).Automatic keyword classification for information retrieval. London: Buttersworth.

    Google Scholar 

  • Sparck Jones, K. (1972). A statistical interpretation of term specificity and its applications in retrieval.Journal of Documentation,28, 11–21.

    Article  Google Scholar 

  • Stanfill, C., &Kahle, B. (1986). Parallel free-text search on the connection machine system.Communications of the ACM,29, 1229–1239.

    Article  Google Scholar 

  • Swets, J. (1963). Information retrieval systems.Science,141, 245–250.

    PubMed  Article  Google Scholar 

  • Tarr, D., &Borko, H. (1974, October). Factors influencing inter-indexer consistency. In P. Zünde (Ed.),Proceedings of the ASIS 37th Annual Meeting (pp. 50–55). Washington, DC: ASIS.

    Google Scholar 

  • Voorhees, E. (1985, June). The cluster hypothesis revisited. InSIGIR ’85: Proceedings of the Eighth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 188–196). New York: ACM.

    Chapter  Google Scholar 

  • Williams, M. D. (1984). What makes RABBIT run?International Journal of Man-Machine Studies,21, 333–352.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Dumais, S.T. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, & Computers 23, 229–236 (1991). https://doi.org/10.3758/BF03203370

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.3758/BF03203370

Keywords

  • Singular Value Decomposition
  • Relevant Document
  • Relevance Feedback
  • Latent Semantic Analysis
  • Test Collection