Extracting Enterprise Vocabularies Using Linked Open Data

  • Julian Dolby
  • Achille Fokoue
  • Aditya Kalyanpur
  • Edith Schonberg
  • Kavitha Srinivas
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5823)


A common vocabulary is vital to smooth business operation, yet codifying and maintaining an enterprise vocabulary is an arduous, manual task. We describe a process to automatically extract a domain specific vocabulary (terms and types) from unstructured data in the enterprise guided by term definitions in Linked Open Data (LOD). We validate our techniques by applying them to the IT (Information Technology) domain, taking 58 Gartner analyst reports and using two specific LOD sources – DBpedia and Freebase. We show initial findings that address the generalizability of these techniques for vocabulary extraction in new domains, such as the energy industry.


Linked Data Vocabulary Extraction 


  1. 1.
    Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A large ontology from wikipedia and wordnet. Web Semant 6(3), 203–217 (2008)Google Scholar
  2. 2.
    Park, Y., Byrd, R.J., Boguraev, B.K.: Automatic glossary extraction: beyond terminology identification. In: Proceedings of the 19th international conference on Computational linguistics, pp. 1–7. Association for Computational Linguistics, Morristown (2002)CrossRefGoogle Scholar
  3. 3.
    Metaweb Technologies: Freebase data dumps (2008),
  4. 4.
    Wu, F., Weld, D.S.: Automatically refining the wikipedia infobox ontology. In: Proc. of 17th international conference on World Wide Web (WWW), pp. 635–644. ACM, New York (2008)CrossRefGoogle Scholar
  5. 5.
    Welty, C., Murdock, J.W.: Towards knowledge acquisition from information extraction. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 709–722. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: Proceedings of the Fifth ACM International Conference on Digital Libraries (2000)Google Scholar
  7. 7.
    Wang, G., Yu, Y., Zhu, H.: Pore: Positive-only relation extraction from wikipedia text. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 580–594. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  8. 8.
    Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pp. 160–163. Association for Computational Linguistics, Morristown (2003)CrossRefGoogle Scholar
  9. 9.
    Cimiano, P., Staab, S.: Learning by googling. SIGKDD Explorations 6(2), 24–34 (2004)CrossRefGoogle Scholar
  10. 10.
    Ponzetto, S., Strube, M.: Deriving a large scale taxonomy from wikipedia. In: Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI 2007), Vancouver, B.C, July, pp. 1440–1447 (2007)Google Scholar
  11. 11.
    Wu, F., Hoffmann, R., Weld, D.S.: Information extraction from wikipedia: moving down the long tail. In: KDD, pp. 731–739 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Julian Dolby
    • 1
  • Achille Fokoue
    • 1
  • Aditya Kalyanpur
    • 1
  • Edith Schonberg
    • 1
  • Kavitha Srinivas
    • 1
  1. 1.IBM Watson Research CenterYorktown HeightsUSA

Personalised recommendations