Information Retrieval

, Volume 5, Issue 1, pp 5–40 | Cite as

Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking

  • Aleksander Kołcz
  • Joshua Alspector


In certain classification problems there is a strong a asymmetry between the number of labeled examples available for each of the classes involved. In an extreme case, there may be a complete lack of labeled data for one of the classes while, at the same time, there are adequate labeled examples for the others, accompanied by a large body of unlabeled data. Since most classification algorithms require some information about all classes involved, label estimation for the un-represented class is desired. An important representative of this group of problems is that of user interest/preference modeling where there may be a large number of examples of what the user likes with essentially no counterexamples.

Recently, there has been much interest in applying the EM algorithm to incomplete data problems in the area of text retrieval and categorization. We adapt this approach to the asymmetric case of modeling user interests in news articles, where only labeled positive training data are available, with access to a large corpus of unlabeled documents. User modeling is here equivalent to that of user-specific document ranking. EM is used in conjunction with the Naive Bayes model while its output is also utilized by a Support Vector Machine and Rocchio's technique.

Our findings demonstrate that the EM algorithm can be quite effective in modeling the negative class under a number of different initialization schemes. Although primarily just the negative training examples are needed, a natural question is whether using all of the estimated labels (i.e., positive and negative) would be more (or less) beneficial. This is important considering that, in this context, the initialization of the negative class for EM is likely not to be very accurate. Experimental results suggest that EM output should be limited to negative label estimates only.

incomplete data problems imbalanced training data user modeling personalization information retrieval 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Billsus D and Pazzani M (1999) A hybrid user model for news story classification. In: Seventh International Conference on User Modeling (UM '99). Scholar
  2. Blum A and Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 1998 Conference on Computational Learning Theory.Google Scholar
  3. Buckley C and Salton G (1995) Optimization of relevance feedback weights. In: Proceedings of 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 351–357.Google Scholar
  4. Buckley C, Salton G, Allan J and Singhal A (1995) Automatic query expansion using SMART: TREC-3. In: Harman DK, Ed., Proceedings of the 3rd Text Retrieval Conference (TREC-3) (NIST SP 500-225).Google Scholar
  5. Claypool M, Gokhale A, Miranda T, Murnikov P, Netes D and Sartin M (1999) Combining contentbased and collaborative filters in an online newspaper. ACM SIGIR Recommender Systems Workshop. Scholar
  6. Cohn D, Atlas Land Ladner R(1994) Improving generalization with active learning. Machine Learning, 15(2):201–221.Google Scholar
  7. Croft W and Harper D (1979) Using probabilistic models of document retrieval without relevance information.Journal of Documentation, 35:285–295.Google Scholar
  8. Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society, Series B, 39:1–38.Google Scholar
  9. Domingos P (1999) MetaCost: A general method for making classifiers cost-sensitive. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99).Google Scholar
  10. Foltz PW and Dumais ST (1992) Personalized information delivery: An analysis of information filtering methods.Communications of the ACM, 35(12):51–60. Scholar
  11. Frakes W Band Baeza-Yates R (1992) Eds., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Boston, MA.Google Scholar
  12. Harman D (1992) Ranking algorithms. In: Frakes WB and Baeza-Yates R, Eds., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Boston, MA, pp. 363–392.Google Scholar
  13. Iyengar V, Apte C and Zhang T (2000) Active learning using adaptive resampling. In: Proceedings of ACM SIGKDD 2000.Google Scholar
  14. Japkowicz N (2000) The class imbalance problem: Significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence. Scholar
  15. Jennings A, Higichi H and Liu H (1993) A user model neural network for a personal news service. Australian Telecommunication Research, 27(1):1–12.Google Scholar
  16. Joachims T (1997) Text categorization with support vector machines: Learning with many relevant features.Technical Report LS-8/23, University of Dortmund.Google Scholar
  17. Joachims T (1999) Making large-scale svm learning practical. In: Schoelkopf B, Burges C and Smola A, Eds., Advances in Kernel Methods-Support Vector Learning. MIT Press.Google Scholar
  18. Joachims T, Freitag D and Mitchell T (1997)Webwatcher: A tour guide for the world wide web. In: Proceedings of the International Joint Conference on Artificial Intelligence. www/ Scholar
  19. Kamba T, Sahagami H and Koseki Y (1997) ANATANAGONOMY: A personalized newspaper on the world wide web. Int. J. Human-Compuer Studies, 46:789–803.Google Scholar
  20. Kubat M, Holte R and Matwin S (1997) Learning when negative examples abound. In: Proceedings of the European Conference on Machine Learning, ECML'97, pp. 146–153. mkubat/publications/ Scholar
  21. Kubat M and Matwin S (1997) Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, ICML'97, pp. 179–186. Scholar
  22. Lang K (1995) NewsWeeder: Learning to filter NetNews. In: Proceedings of the 12th International Conference on Machine Learning: ICML-95, pp. 331–339.Google Scholar
  23. Lewis DD (1998) Naive (bayes) at forty: The independence assumption in information retrieval. In: Proceedings of the 10th European Conference on Machine Learning, pp. 4–15. ?lewis/papers/ Scholar
  24. Lewis DDand Gale WA(1994)Asequential algorithm for training text classifiers. In: Croft WBand van Rijsbergen CJ, Eds., SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag, Dublin, Ireland, pp. 3–12.Google Scholar
  25. Lieberman H (1995) Letizia: An agent that assists web browsing. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 924–929. Letizia/Letizia-AAAI/Letizia.html.Google Scholar
  26. Losee RM (1998) Text Retrieval & Filtering: Analytic Models of Performance. Kluwer Academic Publishers, New York.Google Scholar
  27. McCallum AK and Nigam K (1997) Employing EM in pool-based active learning for text classification.In: Proceedings of the 1998 International Machine Learning Conference, pp. 25–32. Scholar
  28. McCallum AK and Nigam K (1998) A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. Scholar
  29. McLachlan GJ and Krishnan T (1996) The EM Algorithm and Extensions. JohnWiley & Sons, Philadelphia, PA.Google Scholar
  30. Mitra M, Singhal Aand Buckley C(1997) Learning queries in a query zone. In: Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 25–32.Google Scholar
  31. Morik K, Imboff M, Brockhausen P, Joachims T and Gather U (2000) Knowledge discovery and knowledge validation in intensive care. Artificial Intelligence in Medicine, 19(3):225–249.Google Scholar
  32. Morita Mand Shinoda Y (1994) Information filtering based on user behavior analysis and best match text retrieval.In: Proceedings of 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272–281.Google Scholar
  33. Nickerson A, Japkowicz N and Milios E (2001) Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics.Google Scholar
  34. Nigam K and Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the Ninth International Conference on Information and Knowledge Management.Google Scholar
  35. Nigam K, McCallum AK, Thrun S and Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2):103–134.Google Scholar
  36. Pazzani M and Billsus D (1997) Learning and revising user profiles: The identification of interesting web sites.Machine Learning, 27:313–331. Scholar
  37. Pazzani M, Merz C, Murphy, P, Ali K, Hume T and Brunk C (1994) Reducing misclassification costs. In: 11th International Conference of Machine Learning, pp. 217–225. MLC94.pdf.Google Scholar
  38. Pednault E, Rosen BK and Apte C (2000) Handling imbalanced data sets in insurance risk modeling. Technical Report RC-21731, IBM.Google Scholar
  39. Platt J (2000) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.In: Smola A, Bartlett P, Scholkopf B and Schuurmans D, Eds., Advances in Large-Margin Classifiers (Neural Information Processing), MIT PressGoogle Scholar
  40. Porter M (1980) An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14(3):130–137.Google Scholar
  41. Provost F (2000) Machine learning from imbalanced data sets 101. In: Japkowicz N, Ed., Learning from Imbalanced Data Sets-Papers from the AAAI Workshop. AAAI Press, Austin, TX, pp. 1–3.Google Scholar
  42. Robertson SE and Sparck-Jones K (1976) Relevance weighting of search terms. JASIS pp. 129–176.Google Scholar
  43. Rocchio J (1971) Relevance feedback in information retrieval. In: Salton G, Ed., The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, pp. 313–323.Google Scholar
  44. Salton G (1971) The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Melbourne, Australia.Google Scholar
  45. Schapire RE, Singer Yand Singhal, A(1998) Boosting and Rocchio applied to text filtering. In: Proceedings of 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 215–223. Scholar
  46. Schohn G and Cohn D (2000) Less is more: Active learning with support vector machines. In: Proceedings of the Seventeenth International Conference on Machine Learning.Google Scholar
  47. Seung HS, Opper M and Sompolinsky H (1992) Query by committee. In: Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp. 287–294.Google Scholar
  48. Stevens C (1992) Automating the creation of information filters. Communications of the ACM, 35(12):48.Google Scholar
  49. van Rijsbergen CJ (1979) Information Retrieval. Buttersworth, London.Google Scholar
  50. Vapnik VN (1998) Statistical Learning Theory. John Wiley, New York.Google Scholar
  51. Yang Y and Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49. Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Aleksander Kołcz
    • 1
  • Joshua Alspector
    • 2
  1. 1.Personalogy, Inc.Colorado SpringsUSA
  2. 2.Department of Electrical and Computer EngineeringUniversity of Colorado at Colorado SpringsColorado SpringsUSA

Personalised recommendations