SemantiClean

Welty, Chris; Murdock, J. William; Fan, James

doi:10.1007/s10579-009-9080-5

SemantiClean

Cleaning noisy data using semantic technology

Published: 27 January 2009

Volume 42, pages 395–408, (2008)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Chris Welty¹,
J. William Murdock¹ &
James Fan¹

111 Accesses
Explore all metrics

Abstract

In our research on using information extraction to help populate semantic web resources, we have encountered significant obstacles to interoperability between the technologies. We believe these obstacles to be endemic to the basic paradigms and not quirks of the specific implementations we have worked with. In particular, we identify five dimensions of interoperability that must be addressed to successfully employ information extraction systems to populate semantic web resources that are suitable for reasoning. We call the task of transforming IE data into knowledge-based resources knowledge integration and we report results of experiments in which the knowledge integration process uses the deeper semantics of OWL ontologies to improve by between 8% and 13% the precision of relation extraction from text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bontcheva, K. (2004). Open-source tools for creation, maintenance, and storage of lexical resources for language generation from ontologies. In Fourth International Conference on Language Resources and Evaluation (LREC’2004), Lisbon, Portugal.
Byrd, R., & Ravin, Y. (1999). Identifying and extracting relations in text. In 4th International Conference on Applications of Natural Language to Information Systems (NLDB), Klagenfurt, Austria.
Chu-Carroll, J., Czuba, K., Duboue, P., & Prager, J. (2005). IBM’s PIQUANT II in TREC2005. In The Fourteenth Text Retrieval Conference (TREC 2005).
Cimiano, P., & Völker, J. (2005). Text2Onto—a framework for ontology learning and data-driven change discovery. In 10th International Conference on Applications of Natural Language to Information Systems (NLDB), Alicante, Spain.
Cunningham, H. (2005). Automatic information extraction. In Encyclopedia of language and linguistics (2nd ed.). Amsterdam: Elsevier.
Dagan, I., Glickman, O., & Magnini, B. (2005). The PASCAL recognising textual entailment challenge. In Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment.
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., et al. (2003). SemTag and seeker: Bootstrapping the semantic web via automated semantic annotation. In 12th International World Wide Web Conference (WWW), Budapest, Hungary.
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., & Weischedel, R. (2004). Automatic content extraction (ACE) program-task definitions and performance measures. In Fourth International Conference on Language Resources and Evaluation (LREC).
Dolby, J., Fan, J., Fokoue, A., Kalyanpur, A., Kershenbaum, A., Ma, L., et al. (2007). Scalable cleanup of information extraction data using ontologies. In Proceedings of ISWC-07.
Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3/4), 327–348.
Article Google Scholar
Fikes, R., Ferrucci, D., & Thurman, D. (2005). Knowledge associates for novel intelligence (KANI). In 2005 International Conference on Intelligence Analysis, McClean, VA.
Götz, T., & Suhre, O. (2004). Design and implementation of the UIMA common analysis system. IBM Systems Journal, 43(3), 476–489.
Article Google Scholar
IBM. (2007). Semantic layered research platform. http://ibm-slrp.sourceforge.net/.
Kalyanpur, A., Parsia, B., Horridge, M., & Sirin, E. (2007). Finding all justifications of OWL DL entailments. In Proceedings of ISWC-07.
Kalyanpur, A., Parsia, B., Sirin, E., Cuenca-Grau, B., & Hendler, J. (2005). Swoop: A ‘web’ ontology editing browser. Journal of Web Semantics, 4(2), 144–153.
Google Scholar
Liddy, E. D. (2000). Text mining. Bulletin of American Society for Information Science & Technology, 14(1), 13–14.
Article Google Scholar
Luo, X., Ittycheriah, A., Jing, H., Kambhatla, N., & Roukos, S. (2004). A mention-synchronous coreference resolution algorithm based on the bell tree. In Proceedings of ACL-04.
Marsh, E. (1998). TIPSTER information extraction evaluation: the MUC-7 workshop.
Maynard, D. (2005). Benchmarking ontology-based annotation tools for the Semantic Web. AHM2005 Workshop “Text Mining, e-Research and Grid-enabled Language Technology”, Nottingham, UK, 2005.
Maynard, D., Yankova, M., Kourakis, A., & Kokossis, A. (2005). Ontology-based information extraction for market monitoring and technology watch. ESWC Workshop “End User Apects of the Semantic Web,” Heraklion, Crete, May, 2005.
Miller, S., Bratus, S., Ramshaw, L., Weischedel, R., & Zamanian, A. (2001). FactBrowser demonstration. In First International Conference on Human Language Technology Research HLT ’01.
Murdock, J. W., McGuinness, D. L., Pinheiro da Silva, P., Welty, C., & Ferrucci, D. (2006). Explaining conclusions from diverse knowledge sources. In Proceedings of the 5th International Semantic Web Conference. New York: Springer-Verlag.
Oltramari, A., Prevot, L., & Borgo, S. (2005). Theoretical and practical aspects of interfacing ontologies and lexical resources. In Proceedings of SWAP2005.
Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., & Kirilov, A. (2004). KIM—a semantic platform for information extraction and retrieval. Journal of Natural Language Engineering, 10(3–4), 375–392.
Article Google Scholar
Schneider, D. (2004). Cyc enhancement of information extraction. Cycorp white paper. http://www.cyc.com/cyc/technology/whitepapers_dir/IE-Improvement-Whitepaper.pdf.
Sirin, E., Parsia, B., Cuenca Grau, B., Kalyanpur, A., & Katz, Y. (2007). Pellet: A practical OWL-DL reasoner. Journal of Web Semantics, 5(2), 51–53.
Google Scholar
Vanderwende, L., Kacmarcik, G., Suzuki, H., & Menezes, A. (2005). MindNet: An automatically-created lexical resource. In Proceedings of HLT/EMNLP 2005 Interactive Demostrations, Vancouver, British Columbia, Canada.
Welty, C., & Murdock, J. W. (2006). Towards knowledge acquisition from information extraction. In Proceedings of ISWC-06.

Download references

Acknowledgment

This work was supported in part by the DTO (nee ARDA) NIMD program.

Author information

Authors and Affiliations

IBM Watson Research Center, Hawthorne, NY, USA
Chris Welty, J. William Murdock & James Fan

Authors

Chris Welty
View author publications
You can also search for this author in PubMed Google Scholar
J. William Murdock
View author publications
You can also search for this author in PubMed Google Scholar
James Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chris Welty.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Welty, C., Murdock, J.W. & Fan, J. SemantiClean. Lang Resources & Evaluation 42, 395–408 (2008). https://doi.org/10.1007/s10579-009-9080-5

Download citation

Received: 20 December 2008
Accepted: 06 January 2009
Published: 27 January 2009
Issue Date: December 2008
DOI: https://doi.org/10.1007/s10579-009-9080-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SemantiClean

Abstract

Access this article

Similar content being viewed by others

Multilingual Extraction Ontologies

Multilingual Natural Language Interaction with Semantic Web Knowledge Bases and Linked Open Data

Semantic Web Languages: Expressivity of SWL

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SemantiClean

Abstract

Access this article

Similar content being viewed by others

Multilingual Extraction Ontologies

Multilingual Natural Language Interaction with Semantic Web Knowledge Bases and Linked Open Data

Semantic Web Languages: Expressivity of SWL

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation