Constructing a Focused Taxonomy from a Document Collection

  • Olena Medelyan
  • Steve Manion
  • Jeen Broekstra
  • Anna Divoli
  • Anna-Lan Huang
  • Ian H. Witten
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7882)


We describe a new method for constructing custom taxonomies from document collections. It involves identifying relevant concepts and entities in text; linking them to knowledge sources like Wikipedia, DBpedia, Freebase, and any supplied taxonomies from related domains; disambiguating conflicting concept mappings; and selecting semantic relations that best group them hierarchically. An RDF model supports interoperability of these steps, and also provides a flexible way of including existing NLP tools and further knowledge sources. From 2000 news articles we construct a custom taxonomy with 10,000 concepts and 12,700 relations, similar in structure to manually created counterparts. Evaluation by 15 human judges shows the precision to be 89% and 90% for concepts and relations respectively; recall was 75% with respect to a manually generated taxonomy for the same domain.


Concept Mapping Document Collection News Article Knowledge Source Related Taxonomy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Caraballo, S.: Automatic construction of a hypernym-labeled noun hierarchy from text. In: Proc. of the 37th Annual Meeting of the ACL, pp. 120–126. ACL (1999)Google Scholar
  2. 2.
    Ponzetto, S., Strube, M.: Deriving a large scale taxonomy from wikipedia. In: Proc. of the 22nd National Conference on Artificial Intelligence, pp. 1440–1445. AAAI Press (2007)Google Scholar
  3. 3.
    Snow, R., Jurafsky, D., Ng, A.: Semantic taxonomy induction from heterogenous evidence. In: Proc. of the 21st Intl. Conf. on Computational Linguistics, pp. 801–808. ACL (2006)Google Scholar
  4. 4.
    Stoica, E., Hearst, M.A.: Automating creation of hierarchical faceted metadata structures. In: Procs. of the HLT/NAACL Conference (2007)Google Scholar
  5. 5.
    Dakka, W., Ipeirotis, P.: Automatic extraction of useful facet hierarchies from text databases. In: Proc. of the 24th IEEE Intl. Conf. on Data Engineering, pp. 466–475. IEEE (2008)Google Scholar
  6. 6.
    Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: Proc. of the 22nd Annual Intl. Conf. on R&D in Information Retrieval, pp. 206–213. ACM (1999)Google Scholar
  7. 7.
    Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proc. of the 14th Conference on Computational Linguistics, pp. 539–545. ACL (1992)Google Scholar
  8. 8.
    Suchanek, F., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proc. of the 16th Intl. Conference on World Wide Web, pp. 697–706. ACM (2007)Google Scholar
  9. 9.
    Wu, W., Li, H., Wang, H., Zhu, K.: Probase: A probabilistic taxonomy for text understanding. In: Proc. of the 2012 ACM Intl. Conf. on Management of Data, pp. 481–492. ACM (2012)Google Scholar
  10. 10.
    Matuszek, C., Witbrock, M., Kahlert, R., Cabral, J., Schneider, D., Shah, P., Lenat, D.: Searching for common sense: Populating cyc from the web. In: Proc. of the 20th Nat. Conf. on Artificial Intelligence, pp. 1430–1435. AAAI Press (2005)Google Scholar
  11. 11.
    Milne, D., Witten, I.: Learning to link with wikipedia. In: Proc. of the 17th Conference on Information and Knowledge Management, pp. 509–518. ACM (2008)Google Scholar
  12. 12.
    Mendes, P., Jakob, M., García-Silva, A., Bizer, C.: Dbpedia spotlight: shedding light on the web of documents. In: Proc. of the 7th Intl. Conf. on Semantic Systems, pp. 1–8. ACM (2011)Google Scholar
  13. 13.
    Augenstein, I., Padó, S., Rudolph, S.: LODifier: Generating Linked Data from Unstructured Text. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 210–224. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 54–68. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. 15.
    Milne, D., Witten, I.H.: An open-source toolkit for mining Wikipedia. Artificial Intelligence (2012)Google Scholar
  16. 16.
    Marrero, M., Sanchez-Cuadrado, S., Lara, J., Andreadakis, G.: Evaluation of Named Entity Extraction Systems. In: Proc. of the Conference on Intelligent Text Processing and Computational Linguistics, CICLing, pp. 47–58 (2009)Google Scholar
  17. 17.
    Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proc. of Intl. Conf. on Management of Data, SIGMOD 2008, pp. 1247–1250. ACM, New York (2008)CrossRefGoogle Scholar
  18. 18.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: A nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Olena Medelyan
    • 1
  • Steve Manion
    • 1
  • Jeen Broekstra
    • 1
  • Anna Divoli
    • 1
  • Anna-Lan Huang
    • 2
  • Ian H. Witten
    • 2
  1. 1.Pingar ResearchAucklandNew Zealand
  2. 2.University of WaikatoHamiltonNew Zealand

Personalised recommendations