Ontology-Driven Information Extraction with OntoSyphon

  • Luke K. McDowell
  • Michael Cafarella
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4273)


The Semantic Web’s need for machine understandable content has led researchers to attempt to automatically acquire such content from a number of sources, including the web. To date, such research has focused on “document-driven” systems that individually process a small set of documents, annotating each with respect to a given ontology. This paper introduces OntoSyphon, an alternative that strives to more fully leverage existing ontological content while scaling to extract comparatively shallow content from millions of documents. OntoSyphon operates in an “ontology-driven” manner: taking any ontology as input, OntoSyphon uses the ontology to specify web searches that identify possible semantic instances, relations, and taxonomic information. Redundancy in the web, together with information from the ontology, is then used to automatically verify these candidate instances and relations, enabling OntoSyphon to operate in a fully automated, unsupervised manner. A prototype of OntoSyphon is fully implemented and we present experimental results that demonstrate substantial instance learning in a variety of domains based on independently constructed ontologies. We also introduce new methods for improving instance verification, and demonstrate that they improve upon previously known techniques.


Assessment Technique Candidate Pair Root Class Pointwise Mutual Information Unsupervised Manner 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Guha, R., McCool, R., Miller, E.: Semantic search. In: World Wide Web (2003)Google Scholar
  2. 2.
    Chapman, S., Dingli, A., Ciravegna, F.: Armadillo: harvesting information for the semantic web. In: Proc. of the 27th Annual Int. ACM SIGIR Conference on Research and Development in Information Retrieval (2004)Google Scholar
  3. 3.
    Cimiano, P., Ladwig, G., Staab, S.: Gimme’ the context: Context-driven automatic semantic annotation with C-PANKOW. In: Proc. of the Fourteenth Int. WWW Conference (2005)Google Scholar
  4. 4.
    Soderland, S.: Learning to extract text-based information from the World Wide Web. In: Knowledge Discovery and Data Mining, pp. 251–254 (1997)Google Scholar
  5. 5.
    Popescu, A.M., Etzioni, O.: Extracting product features and opinions from reviews. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2005)Google Scholar
  6. 6.
    Craven, M., DiPasquo, D., Freitag, D., McCallum, A.K., Mitchell, T.M., Nigam, K., Slattery, S.: Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence 118(1/2), 69–113 (2000)MATHCrossRefGoogle Scholar
  7. 7.
    Davalcu, H., Vadrevu, S., Nagarajan, S.: OntoMiner: Bootstrapping and populating ontologies from domain specific web sites. IEEE Intelligent Systems 18(5), 24–33 (2003)CrossRefGoogle Scholar
  8. 8.
    Celjuska, D., Vargas-Vera, M.: Ontosophie: A semi-automatic system for ontology population from text. In: International Conference on Natural Language Processing (ICON) (2004)Google Scholar
  9. 9.
    Lerman, K., Gazen, C., Minton, S., Knoblock, C.A.: Populating the semantic web. In: Proceedings of the AAAI 2004 Workshop on Advances in Text Extraction and Mining (2004)Google Scholar
  10. 10.
    Matuszek, C., Witbrock, M., Kahlert, R., Cabral, J., Schneider, D., Shah, P., Lenat, D.: Searching for common sense: Populating cyc from the web. In: Proc. of AAAI (2005)Google Scholar
  11. 11.
    Schneider, D., Matuszek, C., Shah, P., Kahlert, R., Baxter, D., Cabral, J., Witbrock, M., Lenat, D.: Gathering and managing facts for intelligence analysis. In: Proceedings of the International Conference on Intelligence Analysis (2005)Google Scholar
  12. 12.
    van Hage, W., Katrenko, S., Schreiber, G.: A method to combine linguistic ontology-mapping techniques. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  13. 13.
    Richardson, S., Dolan, W., Vanderwende, L.: Mindnet: acquiring and structuring semantic information from text. In: COLING (1998)Google Scholar
  14. 14.
    Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: Proceedings of the Fifth ACM International Conference on Digital Libraries (2000)Google Scholar
  15. 15.
    Cederberg, S., Widdows, D.: Using LSA and noun coordination information to improve the precision and recall of automatic hyponymy extraction. In: Seventh Conference on Computational Natural Language Learning (CoNLL) (2003)Google Scholar
  16. 16.
    Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked, T., Soderland, S., Weld, D., Yates, A.: Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence 165(1), 91–134 (2005)CrossRefGoogle Scholar
  17. 17.
    Pantel, P., Ravichandran, D., Hovy, E.: Towards terascale knowledge acquisition. In: 20th International Conference on Computational Linguistics (COLING) (2004)Google Scholar
  18. 18.
    Hahn, U., Schnattinger, K.: Towards text knowledge engineering. In: AAAI/IAAI (1998)Google Scholar
  19. 19.
    Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM – Semi-automatic creation of Metadata. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  20. 20.
    Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R.: Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation. In: Proc. of the Twelth Int. WWW Conference (2003)Google Scholar
  21. 21.
    Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation, indexing, and retrieval. Journal of Web Semantics 2(1), 49–79 (2004)Google Scholar
  22. 22.
    Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proc. of the 14th Intl. Conf. on Computational Linguistics (1992)Google Scholar
  23. 23.
    Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora using formal concept analysis. Journal of Artificial Intelligence Research 24, 305–339 (2005)MATHGoogle Scholar
  24. 24.
    Maedche, A., Pekar, V., Staab, S.: Ontology learning part one – on discovering taxonomic relations from the web. In: Web Intelligence. Springer, Heidelberg (2002)Google Scholar
  25. 25.
    Alfonesca, E., Manandhar, S.: Improving an ontology refinement method with hyponymy patterns. In: Language Resources and Evaluation (LREC) (2002)Google Scholar
  26. 26.
    Cimiano, P., Volker, J.: Text2onto - a framework for ontology learning and data-driven change discovery. In: Int. Conf. on Applications of Natural Language to Information Systems (2005)Google Scholar
  27. 27.
    Snow, R., Jurafsky, D., Ng, A.Y.: Learning syntactic patterns for automatic hypernym discovery. In: NIPS 17 (2004)Google Scholar
  28. 28.
    Cimiano, P., Pivk, A., Schmidt-Thieme, L., Staab, S.: Learning taxonomic relations from heterogeneous evidence. In: ECAI 2004 Workshop on Ontology Learning and Population (2004)Google Scholar
  29. 29.
    Cafarella, M., Etzioni, O.: A search engine for natural language applications. In: Proc. of the Fourteenth Int. WWW Conference (2005)Google Scholar
  30. 30.
    Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: Proc. of IJCAI (2005)Google Scholar
  31. 31.
    Cafarella, M., Downey, D., Soderland, S., Etzioni, O.: KnowItNow: fast, scalable information extraction from the web. In: Proc. of HLT-EMNLP (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Luke K. McDowell
    • 1
  • Michael Cafarella
    • 2
  1. 1.Computer Science DepartmentU.S. Naval AcademyAnnapolisUSA
  2. 2.Dept. of Computer Science and EngineeringUniversity of WashingtonSeattleUSA

Personalised recommendations