Naive (Bayes) at forty: The independence assumption in information retrieval

  • David D. Lewis
Invited Papers
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1398)

Abstract

The naive Bayes classifier, currently experiencing a renaissance ] in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assumptions made about word occurrences in documents.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abraham Bookstein and Don Kraft. Operations research applied to document indexing and retrieval decisions. Journal of the Association for Computing Machinery, 24(3):418–427, 1977.Google Scholar
  2. 2.
    Abraham Bookstein and Don R. Swanson. A decision theoretic foundation for indexing. Journal of the American Society for Information Science, pages 45–50, January–February 1975.Google Scholar
  3. 3.
    Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Matthias Jarke, Michael Carey, Klaus R. Dittrich, Fred Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, Proceedings of the 23rd VLDB Conference, pages 446–455, 1997.Google Scholar
  4. 4.
    Kenneth Ward Church. One term or two? In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 310–318, New York, 1995. Association for Computing Machinery.Google Scholar
  5. 5.
    William W. Cohen and Yoram Singer. Context-sensitive learning methods for text categorization. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 307–315, 1996.Google Scholar
  6. 6.
    W. S. Cooper. Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Transactions on Information Systems, 13(1):100–111, January 1995.CrossRefGoogle Scholar
  7. 7.
    W. B. Croft. Experiments with representation in a document retrieval system. Information Technology: Research and Development, 2:1–21, 1983.Google Scholar
  8. 8.
    W. Bruce Croft. Boolean queries and term dependencies in probabilistic retrieval models. Journal of the American Society for Information Science, 37(2):71–77, 1986.CrossRefGoogle Scholar
  9. 9.
    Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2/3):103–130, November 1997.CrossRefGoogle Scholar
  10. 10.
    Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. Wiley-Interscience, New York, 1973.Google Scholar
  11. 11.
    B. Del Favero and R. Fung. Bayesian inference with node aggregation for information retrieval. In D. K. Harman, editor, The Second Text Retrieval Conference (TREC-2), pages 151–162, Gaithersburg, MD, March 1994. U. S. Dept. of Com merce, National Institute of Standards and Technology. NIST Special Publication 500-215.Google Scholar
  12. 12.
    William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, NJ, 1992.Google Scholar
  13. 13.
    Norbert Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55–72, 1989.CrossRefGoogle Scholar
  14. 14.
    William A. Gale, Kenneth W. Church, and David Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415–439, 1993.CrossRefGoogle Scholar
  15. 15.
    Louise Guthrie, Elbert Walker, and Joe Guthrie. Document classification by machine: Theory and practice. In COLING 94: The 15th International Conference on Computational Linguistics. Proceedings, Vol. II., pages 1059–1063, 1994.Google Scholar
  16. 16.
    D. K. Harman, editor. The First Text REtrieval Conference (TREC-1), Gaithersburg, MD 20899, 1993. National Institute of Standards and Technology. Special Publication 500-207.Google Scholar
  17. 17.
    D. K. Harman, editor. The Second Text REtrieval Conference (TREC-2), Gaithersburg, MD 20899, 1994. National Institute of Standards and Technology. Special Publication 500-215.Google Scholar
  18. 18.
    D. K. Harman, editor. Overview of the Third Text REtrieval Conference (TREC-3), Gaithersburg, MD 20899-0001, 1995. National Institute of Standards and Technology. Special Publication 500-225.Google Scholar
  19. 19.
    D. K. Harman, editor. The Fourth Text REtrieval Conference (TREC-3), Gaithersburg, MD 20899-0001, 1996. National Institute of Standards and Technology. Special Publication 500-236.Google Scholar
  20. 20.
    Donna Harman. Relevance feedback and other query modification techniques. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 241–263. Prentice Hall, Englewood Cliffs, NJ, 1992.Google Scholar
  21. 21.
    D. J. Harper and C. J. van Rijsbergen. An evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation, 34:189–216, 1978.Google Scholar
  22. 22.
    Stephen P. Harter. A probabilistic approach to automatic keyword indexing. Part I. On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science, pages 197–206, July–August 1975.Google Scholar
  23. 23.
    Stephen P. Harter. A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing. Journal of the American Society for Information Science, pages 280–289, September–October 1975.Google Scholar
  24. 24.
    David J. Ittner, David D. Lewis, and David D. Alm. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301–315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.Google Scholar
  25. 25.
    Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. LS-8 Report 23, University of Dortmund, Computer Science Dept., Dortmund, Germany, 27 November 1997.Google Scholar
  26. 26.
    S. Katz. Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2(1):15–59, March 1996.CrossRefGoogle Scholar
  27. 27.
    Ron Kohavi. Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 202–207, 1996.Google Scholar
  28. 28.
    Robert R. Korfhage. Information Storage and Retrieval. John Wiley, New York, 1997.Google Scholar
  29. 29.
    Gerald Kowalski. Information Retrieval Systems: Theory and Implementation. Kluwer, Boston, 1997.Google Scholar
  30. 30.
    David D. Lewis. Text representation for intelligent text retrieval: A classification-oriented view. In Paul S. Jacobs, editor, Text-Based Intelligent Systems, pages 179–197. Lawrence Erlbaum, Hillsdale, NJ, 1992.Google Scholar
  31. 31.
    David D. Lewis. Evaluating and optimizing autonomous text classification systems. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 246–254, New York, 1995. Association for Computing Machinery.Google Scholar
  32. 32.
    David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In W. Bruce Croft and C. J. van Rijsbergen, editors, SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3–12, London, 1994. Springer-Verlag.Google Scholar
  33. 33.
    David D. Lewis and Karen Sparck Jones. Natural language processing for information retrieval. Communications of the ACM, 39(1):92–101, January 1996.CrossRefGoogle Scholar
  34. 34.
    Hang Li and Kenji Yamanishi. Document classification using a finite mixture model, 1997.Google Scholar
  35. 35.
    Robert M. Losee. Parameter estimation for probabilistic document-retrieval models. Journal of the American Society for Information Science, 39(1):8–16, 1988.CrossRefGoogle Scholar
  36. 36.
    E. L. Margulis. Modelling documents with multiple Poisson distributions. Information Processing and Management; 29:215–227, 1993CrossRefGoogle Scholar
  37. 37.
    M. E. Maron. Automatic indexing: An experimental inquiry. Journal of the Association for Computing Machinery, 8:404–417, 1961.Google Scholar
  38. 38.
    M. E. Maron and J. L. Kuhns. On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery, 7(3):216–244, July 1960.Google Scholar
  39. 39.
    Marvin Minsky and Seymour Papert. Perceptrons: An Introduction to Computational Geometry (Expanded Edition). The MIT Press, Cambridge, MA, 1988.Google Scholar
  40. 40.
    Frederick Mosteller and David L. Wallace. Applied Bayesian and Classical Inference. Springer-Verlag, New York, 2nd edition, 1984.Google Scholar
  41. 41.
    S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, pages 129–146, May–June 1976.Google Scholar
  42. 42.
    S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter. Probabilistic models of indexing and searching. In R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen, and P. W. Williams, editors, Information Research and Retrieval, chapter 4, pages 35–56. Butterworths, 1981.Google Scholar
  43. 43.
    S. E. Robertson and S. Walker.Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In W. Bruce Croft and C. J. van Rijsbergen, editors, SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 232–241, London, 1994. Springer-Verlag.Google Scholar
  44. 44.
    J. J. Rocchio, Jr. Relevance feedback in information retrieval. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1971.Google Scholar
  45. 45.
    Gerard Salton and Chris Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4):288–297, 1990.CrossRefGoogle Scholar
  46. 46.
    Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York, 1983.Google Scholar
  47. 47.
    Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–29, 1996.Google Scholar
  48. 48.
    Karen Sparck Jones. Search term relevance weighting given little relevance information. Journal of Documentation, 35(1):30–48, March 1979.Google Scholar
  49. 49.
    Howard R. Turtle and W. Bruce Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187–222, July 1991.CrossRefGoogle Scholar
  50. 50.
    C. J. van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2):106–119, June 1977.Google Scholar
  51. 51.
    C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.Google Scholar
  52. 52.
    E. M. Voorhees and D. K. Harman, editors. Information Technology: The Fifth Text REtrieval Conference (TREC-6), Gaithersburg, MD 20899-0001, 1997. National Institute of Standards and Technology. Special Publication 500-238.Google Scholar
  53. 53.
    Clement T. Yu and Hirotaka Mizuno. Two learning schemes in information retrieval. In Eleventh International Conference on Research & Development in Information Retrieval, pages 201–215, 1998.Google Scholar

Copyright information

© Springer-Verlag 1998

Authors and Affiliations

  • David D. Lewis
    • 1
  1. 1.AT&T Labs - ResearchFlorham ParkUSA

Personalised recommendations