Skip to main content
Log in

SemantiClean

Cleaning noisy data using semantic technology

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In our research on using information extraction to help populate semantic web resources, we have encountered significant obstacles to interoperability between the technologies. We believe these obstacles to be endemic to the basic paradigms and not quirks of the specific implementations we have worked with. In particular, we identify five dimensions of interoperability that must be addressed to successfully employ information extraction systems to populate semantic web resources that are suitable for reasoning. We call the task of transforming IE data into knowledge-based resources knowledge integration and we report results of experiments in which the knowledge integration process uses the deeper semantics of OWL ontologies to improve by between 8% and 13% the precision of relation extraction from text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Bontcheva, K. (2004). Open-source tools for creation, maintenance, and storage of lexical resources for language generation from ontologies. In Fourth International Conference on Language Resources and Evaluation (LREC’2004), Lisbon, Portugal.

  • Byrd, R., & Ravin, Y. (1999). Identifying and extracting relations in text. In 4th International Conference on Applications of Natural Language to Information Systems (NLDB), Klagenfurt, Austria.

  • Chu-Carroll, J., Czuba, K., Duboue, P., & Prager, J. (2005). IBM’s PIQUANT II in TREC2005. In The Fourteenth Text Retrieval Conference (TREC 2005).

  • Cimiano, P., & Völker, J. (2005). Text2Onto—a framework for ontology learning and data-driven change discovery. In 10th International Conference on Applications of Natural Language to Information Systems (NLDB), Alicante, Spain.

  • Cunningham, H. (2005). Automatic information extraction. In Encyclopedia of language and linguistics (2nd ed.). Amsterdam: Elsevier.

  • Dagan, I., Glickman, O., & Magnini, B. (2005). The PASCAL recognising textual entailment challenge. In Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment.

  • Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., et al. (2003). SemTag and seeker: Bootstrapping the semantic web via automated semantic annotation. In 12th International World Wide Web Conference (WWW), Budapest, Hungary.

  • Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., & Weischedel, R. (2004). Automatic content extraction (ACE) program-task definitions and performance measures. In Fourth International Conference on Language Resources and Evaluation (LREC).

  • Dolby, J., Fan, J., Fokoue, A., Kalyanpur, A., Kershenbaum, A., Ma, L., et al. (2007). Scalable cleanup of information extraction data using ontologies. In Proceedings of ISWC-07.

  • Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3/4), 327–348.

    Article  Google Scholar 

  • Fikes, R., Ferrucci, D., & Thurman, D. (2005). Knowledge associates for novel intelligence (KANI). In 2005 International Conference on Intelligence Analysis, McClean, VA.

  • Götz, T., & Suhre, O. (2004). Design and implementation of the UIMA common analysis system. IBM Systems Journal, 43(3), 476–489.

    Article  Google Scholar 

  • IBM. (2007). Semantic layered research platform. http://ibm-slrp.sourceforge.net/.

  • Kalyanpur, A., Parsia, B., Horridge, M., & Sirin, E. (2007). Finding all justifications of OWL DL entailments. In Proceedings of ISWC-07.

  • Kalyanpur, A., Parsia, B., Sirin, E., Cuenca-Grau, B., & Hendler, J. (2005). Swoop: A ‘web’ ontology editing browser. Journal of Web Semantics, 4(2), 144–153.

    Google Scholar 

  • Liddy, E. D. (2000). Text mining. Bulletin of American Society for Information Science & Technology, 14(1), 13–14.

    Article  Google Scholar 

  • Luo, X., Ittycheriah, A., Jing, H., Kambhatla, N., & Roukos, S. (2004). A mention-synchronous coreference resolution algorithm based on the bell tree. In Proceedings of ACL-04.

  • Marsh, E. (1998). TIPSTER information extraction evaluation: the MUC-7 workshop.

  • Maynard, D. (2005). Benchmarking ontology-based annotation tools for the Semantic Web. AHM2005 Workshop “Text Mining, e-Research and Grid-enabled Language Technology”, Nottingham, UK, 2005.

  • Maynard, D., Yankova, M., Kourakis, A., & Kokossis, A. (2005). Ontology-based information extraction for market monitoring and technology watch. ESWC Workshop “End User Apects of the Semantic Web,” Heraklion, Crete, May, 2005.

  • Miller, S., Bratus, S., Ramshaw, L., Weischedel, R., & Zamanian, A. (2001). FactBrowser demonstration. In First International Conference on Human Language Technology Research HLT ’01.

  • Murdock, J. W., McGuinness, D. L., Pinheiro da Silva, P., Welty, C., & Ferrucci, D. (2006). Explaining conclusions from diverse knowledge sources. In Proceedings of the 5th International Semantic Web Conference. New York: Springer-Verlag.

  • Oltramari, A., Prevot, L., & Borgo, S. (2005). Theoretical and practical aspects of interfacing ontologies and lexical resources. In Proceedings of SWAP2005.

  • Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., & Kirilov, A. (2004). KIM—a semantic platform for information extraction and retrieval. Journal of Natural Language Engineering, 10(3–4), 375–392.

    Article  Google Scholar 

  • Schneider, D. (2004). Cyc enhancement of information extraction. Cycorp white paper. http://www.cyc.com/cyc/technology/whitepapers_dir/IE-Improvement-Whitepaper.pdf.

  • Sirin, E., Parsia, B., Cuenca Grau, B., Kalyanpur, A., & Katz, Y. (2007). Pellet: A practical OWL-DL reasoner. Journal of Web Semantics, 5(2), 51–53.

    Google Scholar 

  • Vanderwende, L., Kacmarcik, G., Suzuki, H., & Menezes, A. (2005). MindNet: An automatically-created lexical resource. In Proceedings of HLT/EMNLP 2005 Interactive Demostrations, Vancouver, British Columbia, Canada.

  • Welty, C., & Murdock, J. W. (2006). Towards knowledge acquisition from information extraction. In Proceedings of ISWC-06.

Download references

Acknowledgment

This work was supported in part by the DTO (nee ARDA) NIMD program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chris Welty.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Welty, C., Murdock, J.W. & Fan, J. SemantiClean. Lang Resources & Evaluation 42, 395–408 (2008). https://doi.org/10.1007/s10579-009-9080-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9080-5

Keywords

Navigation