Language Resources and Evaluation

, Volume 42, Issue 4, pp 395–408 | Cite as


Cleaning noisy data using semantic technology
  • Chris Welty
  • J. William Murdock
  • James Fan


In our research on using information extraction to help populate semantic web resources, we have encountered significant obstacles to interoperability between the technologies. We believe these obstacles to be endemic to the basic paradigms and not quirks of the specific implementations we have worked with. In particular, we identify five dimensions of interoperability that must be addressed to successfully employ information extraction systems to populate semantic web resources that are suitable for reasoning. We call the task of transforming IE data into knowledge-based resources knowledge integration and we report results of experiments in which the knowledge integration process uses the deeper semantics of OWL ontologies to improve by between 8% and 13% the precision of relation extraction from text.


Information extraction OWL reasoning Ontologies 



This work was supported in part by the DTO (nee ARDA) NIMD program.


  1. Bontcheva, K. (2004). Open-source tools for creation, maintenance, and storage of lexical resources for language generation from ontologies. In Fourth International Conference on Language Resources and Evaluation (LREC’2004), Lisbon, Portugal.Google Scholar
  2. Byrd, R., & Ravin, Y. (1999). Identifying and extracting relations in text. In 4th International Conference on Applications of Natural Language to Information Systems (NLDB), Klagenfurt, Austria.Google Scholar
  3. Chu-Carroll, J., Czuba, K., Duboue, P., & Prager, J. (2005). IBM’s PIQUANT II in TREC2005. In The Fourteenth Text Retrieval Conference (TREC 2005).Google Scholar
  4. Cimiano, P., & Völker, J. (2005). Text2Onto—a framework for ontology learning and data-driven change discovery. In 10th International Conference on Applications of Natural Language to Information Systems (NLDB), Alicante, Spain.Google Scholar
  5. Cunningham, H. (2005). Automatic information extraction. In Encyclopedia of language and linguistics (2nd ed.). Amsterdam: Elsevier.Google Scholar
  6. Dagan, I., Glickman, O., & Magnini, B. (2005). The PASCAL recognising textual entailment challenge. In Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment.Google Scholar
  7. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., et al. (2003). SemTag and seeker: Bootstrapping the semantic web via automated semantic annotation. In 12th International World Wide Web Conference (WWW), Budapest, Hungary.Google Scholar
  8. Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., & Weischedel, R. (2004). Automatic content extraction (ACE) program-task definitions and performance measures. In Fourth International Conference on Language Resources and Evaluation (LREC).Google Scholar
  9. Dolby, J., Fan, J., Fokoue, A., Kalyanpur, A., Kershenbaum, A., Ma, L., et al. (2007). Scalable cleanup of information extraction data using ontologies. In Proceedings of ISWC-07.Google Scholar
  10. Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3/4), 327–348.CrossRefGoogle Scholar
  11. Fikes, R., Ferrucci, D., & Thurman, D. (2005). Knowledge associates for novel intelligence (KANI). In 2005 International Conference on Intelligence Analysis, McClean, VA.Google Scholar
  12. Götz, T., & Suhre, O. (2004). Design and implementation of the UIMA common analysis system. IBM Systems Journal, 43(3), 476–489.CrossRefGoogle Scholar
  13. IBM. (2007). Semantic layered research platform.
  14. Kalyanpur, A., Parsia, B., Horridge, M., & Sirin, E. (2007). Finding all justifications of OWL DL entailments. In Proceedings of ISWC-07.Google Scholar
  15. Kalyanpur, A., Parsia, B., Sirin, E., Cuenca-Grau, B., & Hendler, J. (2005). Swoop: A ‘web’ ontology editing browser. Journal of Web Semantics, 4(2), 144–153.Google Scholar
  16. Liddy, E. D. (2000). Text mining. Bulletin of American Society for Information Science & Technology, 14(1), 13–14.CrossRefGoogle Scholar
  17. Luo, X., Ittycheriah, A., Jing, H., Kambhatla, N., & Roukos, S. (2004). A mention-synchronous coreference resolution algorithm based on the bell tree. In Proceedings of ACL-04.Google Scholar
  18. Marsh, E. (1998). TIPSTER information extraction evaluation: the MUC-7 workshop.Google Scholar
  19. Maynard, D. (2005). Benchmarking ontology-based annotation tools for the Semantic Web. AHM2005 Workshop “Text Mining, e-Research and Grid-enabled Language Technology”, Nottingham, UK, 2005.Google Scholar
  20. Maynard, D., Yankova, M., Kourakis, A., & Kokossis, A. (2005). Ontology-based information extraction for market monitoring and technology watch. ESWC Workshop “End User Apects of the Semantic Web,” Heraklion, Crete, May, 2005.Google Scholar
  21. Miller, S., Bratus, S., Ramshaw, L., Weischedel, R., & Zamanian, A. (2001). FactBrowser demonstration. In First International Conference on Human Language Technology Research HLT ’01.Google Scholar
  22. Murdock, J. W., McGuinness, D. L., Pinheiro da Silva, P., Welty, C., & Ferrucci, D. (2006). Explaining conclusions from diverse knowledge sources. In Proceedings of the 5th International Semantic Web Conference. New York: Springer-Verlag.Google Scholar
  23. Oltramari, A., Prevot, L., & Borgo, S. (2005). Theoretical and practical aspects of interfacing ontologies and lexical resources. In Proceedings of SWAP2005.Google Scholar
  24. Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., & Kirilov, A. (2004). KIM—a semantic platform for information extraction and retrieval. Journal of Natural Language Engineering, 10(3–4), 375–392.CrossRefGoogle Scholar
  25. Schneider, D. (2004). Cyc enhancement of information extraction. Cycorp white paper.
  26. Sirin, E., Parsia, B., Cuenca Grau, B., Kalyanpur, A., & Katz, Y. (2007). Pellet: A practical OWL-DL reasoner. Journal of Web Semantics, 5(2), 51–53.Google Scholar
  27. Vanderwende, L., Kacmarcik, G., Suzuki, H., & Menezes, A. (2005). MindNet: An automatically-created lexical resource. In Proceedings of HLT/EMNLP 2005 Interactive Demostrations, Vancouver, British Columbia, Canada.Google Scholar
  28. Welty, C., & Murdock, J. W. (2006). Towards knowledge acquisition from information extraction. In Proceedings of ISWC-06.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  1. 1.IBM Watson Research CenterHawthorneUSA

Personalised recommendations