Information Retrieval

, Volume 3, Issue 2, pp 127–163 | Cite as

Automating the Construction of Internet Portals with Machine Learning

  • Andrew Kachites McCallum
  • Kamal Nigam
  • Jason Rennie
  • Kristie Seymore


Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at These techniques are widely applicable to portal creation in other domains.

spidering crawling reinforcement learning information extraction hidden Markov models text classification naive Bayes expectation-maximization unlabeled data 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Baker D, Hofmann T, McCallum A and Yang Y (1999) A hierarchical probabilistic model for novelty detection in text. Tech. Rep., Just Research.»mccallum.Google Scholar
  2. Baum LE (1972) An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3:1-8.Google Scholar
  3. Bellman RE (1957) Dynamic Programming. Princeton University Press, Princeton, NJ.Google Scholar
  4. Bikel DM, Miller S, Schwartz R and Weischedel R (1997) Nymble: A high-performance learning name-finder. In: Procedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97), pp. 194-201.Google Scholar
  5. Blum A and Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT '98), pp. 92-100.Google Scholar
  6. Boyan J, Freitag D and Joachims T (1996) A machine learning architecture for optimizing web search engines. In: AAAI-96 Workshop on Internet-Based Information Systems.Google Scholar
  7. Chakrabarti S, van der Berg M and Dom B (1999) Focused crawling: A new approach to topic-specific Web resource discovery. In: Proceedings of 8th International World Wide Web Conference (WWW8).Google Scholar
  8. Chang H, Cohn D and McCallum A (1999) Creating customized authority lists. Scholar
  9. Chen SF and Goodman JT (1998) An empirical study of smoothing techniques for language modeling. Tech. Rep. TR-10-98, Computer Science Group, Harvard University.Google Scholar
  10. Cho J, Garcia-Molina H and Page L (1998) Efficient crawling through URL ordering. In: Proceedings of the Seventh World-Wide Web Conference (WWW7).Google Scholar
  11. Cohen W(1998) A web-based information system that reasons with structured collections of text. In: Proceedings of the Second International Conference on Autonomous Agents (Agents '98), pp. 400-407.Google Scholar
  12. Cohen Wand Fan W(1999) Learning page-independent heuristics for extracting data from web pages. In: AAAI Spring Symposium on Intelligent Agents in Cyberspace.Google Scholar
  13. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K and Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pp. 509-516.Google Scholar
  14. Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1-38.Google Scholar
  15. Freitag D and McCallum A (1999) Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.Google Scholar
  16. Giles CL, Bollacker KD and Lawrence S (1998) CiteSeer: An autonomous citation indexing system. In: Digital Libraries 98-Third ACM Conference on Digital Libraries, pp. 89-98.Google Scholar
  17. Hofmann T and Puzicha J (1998) Statistical models for co-occurrence data. Tech. Rep. AI Memo 1625, Artificial Intelligence Laboratory, MIT.Google Scholar
  18. Joachims T, Freitag D and Mitchell T (1997)Webwatcher: A tour guide for theWorldWideWeb. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pp. 770-777.Google Scholar
  19. Kaelbling LP, Littman ML and Moore AW (1996) Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237-285.Google Scholar
  20. Kearns M, Mansour Y and Ng A (2000) Approximate planning in large POMDPs via reusable trajectories. In: Advances in Neural Information Processing Systems 12. The MIT Press.Google Scholar
  21. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46.Google Scholar
  22. Kupiec J (1992) Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6:225-242.Google Scholar
  23. Lawrence S, Giles CL and Bollacker K. (1999) Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67-71.Google Scholar
  24. Leek TR (1997) Information extraction using hidden Markov models. Master's Thesis, UC San Diego.Google Scholar
  25. Lewis DD (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. In: Machine Learning: ECML-98, Tenth European Conference on Machine Learning, pp. 4-15.Google Scholar
  26. McCallum A and Nigam K (1998) A comparison of event models for naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization. Scholar
  27. McCallum A, Rosenfeld R, Mitchell T and Ng A (1998) Improving text clasification by shrinkage in a hierarchy of classes. In: Machine Learning: Proceedings of the Fifteenth International Conference (ICML '98), pp. 359-367.Google Scholar
  28. McLachlan G and Basford K (1988) Mixture Models. Marcel Dekker, New York.Google Scholar
  29. Menczer F (1997) ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97), pp. 227-235.Google Scholar
  30. Merialdo B (1994) Tagging english text with a probabilistic model. Computational Linguistics, 20(2):155-171.Google Scholar
  31. Mitchell TM (1997) Machine Learning. McGraw-Hill, New York.Google Scholar
  32. Ney H, Essen U and Kneser R (1994) On structuring probabilistic dependencies in stochastic language modeling. Computer Speech and Language, 8(1):1-38.Google Scholar
  33. Nigam K, McCallum A, Thrun S and Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning, 39.Google Scholar
  34. Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-286.Google Scholar
  35. Riloff E and Jones R (1999) Learning dictionaries for information extraction using multi-level boot-strapping. In: Proceedings of the Sixteenth National Conference on Artificial Intellligence (AAAI-99), pp.474-479.Google Scholar
  36. Stolcke A, Shriberg E, Bates R, Coccaro N, Jurafsky D, Martin R, Meteer M, Ries K, Taylor P and Ess-Dykema CV (1998) Dialog act modeling for conversational speech. In: AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, pp. 98-105.Google Scholar
  37. Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learning, 3:9-44.Google Scholar
  38. Tesauro G and Galperin GR (1997) On-line policy improvement using monte-carlo search. In: Advances in Neural Information Processing Systems 9, The MIT Press, pp. 1068-1074.Google Scholar
  39. Torgo L and Gama J (1997) Regression using classification algorithms. Intelligent Data Analysis, 1(4):275-292.Google Scholar
  40. Viterbi AJ (1967) Error bounds for convolutional codes and an asymtotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13, 260-269.Google Scholar
  41. Witten IH, Nevill-Manning C, McNab R and Cunnningham SJ (1998) A public digital library based on full-text retrieval: Collections and experience. Communications of the ACM, 41(4):71-75.Google Scholar
  42. Yamron J, Carp I, Gillick L, Lowe S and van Mulbregt, P. (1998) A hidden Markov model approach to text segmentation and event tracking. In: Procedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-98), Seattle, Washington.Google Scholar
  43. Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95), pp. 189-196.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Andrew Kachites McCallum
    • 1
  • Kamal Nigam
    • 2
  • Jason Rennie
    • 3
  • Kristie Seymore
    • 4
  1. 1.Just Research and Carnegie Mellon UniversityThe Netherlands
  2. 2.Carnegie Mellon UniversityThe Netherlands
  3. 3.Massachusetts Institute of TechnologyUSA
  4. 4.Carnegie Mellon UniversityThe Netherlands

Personalised recommendations