Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning

  • Chaitanya Chemudugunta
  • America Holloway
  • Padhraic Smyth
  • Mark Steyvers
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5318)


Human-defined concepts are fundamental building-blocks in constructing knowledge bases such as ontologies. Statistical learning techniques provide an alternative automated approach to concept definition, driven by data rather than prior knowledge. In this paper we propose a probabilistic modeling framework that combines both human-defined concepts and data-driven topics in a principled manner. The methodology we propose is based on applications of statistical topic models (also known as latent Dirichlet allocation models). We demonstrate the utility of this general framework in two ways. We first illustrate how the methodology can be used to automatically tag Web pages with concepts from a known set of concepts without any need for labeled documents. We then perform a series of experiments that quantify how combining human-defined semantic knowledge with data-driven techniques leads to better language models than can be obtained with either alone.


ontologies tagging unsupervised learning topic models 


  1. 1.
    McGuinness, D.L.: Ontologies come of age. In: Fensel, D., Hendler, J.A., Lieberman, H., Wahlster, W. (eds.) Spinning the Semantic Web, pp. 171–194. MIT Press, Cambridge (2003)Google Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. of Nat’l. Academy of Science 101, 5228–5235 (2004)CrossRefGoogle Scholar
  4. 4.
    Handschuh, S., Staab, S., Ciravegna, F.: Scream — semi-automatic creation of metadata. In: International Conference on Knowledge Engineering and Knowledge Management (2002)Google Scholar
  5. 5.
    Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: Kim - semantic annotation platform. In: International Semantic Web Conference, pp. 834–849 (2003)Google Scholar
  6. 6.
    Tang, J., Hong, M., Li, J.Z., Liang, B.: Tree-structured conditional random fields for semantic annotation. In: International Semantic Web Conference, pp. 640–653 (2006)Google Scholar
  7. 7.
    Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In: WWW 2003, pp. 178–186. ACM, New York (2003)Google Scholar
  8. 8.
    Hotho, A., Staab, S., Stumme, G.: Text clustering based on background knowledge (technical report 425). Technical report, University of Karlsruhe, Institute AIFB (2003)Google Scholar
  9. 9.
    Gabrilovich, E., Markovitch, S.: Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Mach. Learn. Res. 8, 2297–2345 (2007)Google Scholar
  10. 10.
    Ifrim, G., Theobald, M., Weikum, G.: Learning word-to-concept mappings for automatic text classification. In: Proceedings of the 22nd ICML-LWS, pp. 18–26 (2005)Google Scholar
  11. 11.
    Boyd-Graber, D., Blei, D., Zhu, X.: A topic model for word sense disambiguation. In: Proc. 2007 Joint Conf. Empirical Methods in Nat’l. Lang. Processing and Compt’l. Nat’l. Lang. Learning, pp. 1024–1033 (2007)Google Scholar
  12. 12.
    Brewster, C., Alani, H., Dasmahapatra, S., Wilks, Y.: Data driven ontology evaluation. In: Int’l. Conf. Language Resources and Evaluation (2004)Google Scholar
  13. 13.
    Alani, H., Brewster, C.: Metrics for ranking ontologies. In: 4th Int’l. EON Workshop, 15th Int’l World Wide Web Conf. (2006)Google Scholar
  14. 14.
    Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction and representation of knowledge. Psychological Review 104, 211–240 (1997)CrossRefGoogle Scholar
  15. 15.
    Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. In: Psychological Review, vol. 114, pp. 211–244 (2007)Google Scholar
  16. 16.
    Chemudugunta, C., Smyth, P., Steyvers, M.: Modeling general and specific aspects of documents with a probabilistic topic model. In: NIPS, vol. 19, pp. 241–248 (2007)Google Scholar
  17. 17.
    Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database (Language, Speech and Communication). MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  18. 18.
    Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Compt’l. Linguistics, 467–479 (1992)Google Scholar
  19. 19.
    Chemudugunta, C., Smyth, P., Steyvers, M.: Combining concept hierarchies and statistical topic models. In: 17th ACM Conference on Information and Knowledge Management (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Chaitanya Chemudugunta
    • 1
  • America Holloway
    • 1
  • Padhraic Smyth
    • 1
  • Mark Steyvers
    • 2
  1. 1.Department of Computer ScienceUniversity of California,IrvineIrvine
  2. 2.Department of Cognitive ScienceUniversity of California, IrvineIrvine

Personalised recommendations