iASA: Learning to Annotate the Semantic Web

Tang, Jie; Li, Juanzi; Lu, Hongjun; Liang, Bangyong; Huang, Xiaotong; Wang, Kehong

doi:10.1007/11603412_4

Jie Tang¹⁷,
Juanzi Li¹⁷,
Hongjun Lu¹⁷,
Bangyong Liang¹⁷,
Xiaotong Huang¹⁷ &
…
Kehong Wang¹⁷

Part of the book series: Lecture Notes in Computer Science ((JODS,volume 3730))

930 Accesses
13 Citations

Abstract

With the advent of the Semantic Web, there is a great need to upgrade existing web content to semantic web content. This can be accomplished through semantic annotations. Unfortunately, manual annotation is tedious, time consuming and error-prone. In this paper, we propose a tool, called iASA, that learns to automatically annotate web documents according to an ontology. iASA is based on the combination of information extraction (specifically, the Similarity-based Rule Learner—SRL) and machine learning techniques. Using linguistic knowledge and optimal dynamic window size, SRL produces annotation rules of better quality than comparable semantic annotation systems. Similarity-based learning efficiently reduces the search space by avoiding pseudo rule generalization. In the annotation phase, iASA exploits ontology knowledge to refine the annotation it proposes. Moreover, our annotation algorithm exploits machine learning methods to correctly select instances and to predict missing instances. Finally, iASA provides an explanation component that explains the nature of the learner and annotator to the user. Explanations can greatly help users understand the rule induction and annotation process, so that they can focus on correcting rules and annotations quickly. Experimental results show that iASA can reach high accuracy quickly.

Supported by the National Natural Science Foundation of China under Grant No. 60443002.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alani, H., Kim, S., Millard, D., Weal, M., Hall, W., Lewis, P., Shadbolt, N.: Automatic Ontology-Based Knowledge Extraction from Web Documents. IEEE Intelligent Systems 18(1), 14–21 (2003)
Article Google Scholar
Benjamins, R., Contreras, J.: White Paper Six Challenges for the Semantic Web. Intelligent Software Components. Intelligent software for the networked economy, isoco (April 2002)
Google Scholar
Berger, A.L., Della Pietra, S.A., Della Pietra, V.J.: A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22, 39–71 (1996)
Google Scholar
Berners-Lee, T., Fischetti, M., Dertouzos, M.L.: Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web (1999)
Google Scholar
Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284, 34–43 (2001)
Article Google Scholar
Buitelaar, P., Declerck, T.: Linguistic Annotation for the Semantic Web. In: Annotation for the Semantic Web. Frontiers in Artificial Intelligence and Applications Series, vol. 96. IOS Press, Amsterdam (2003)
Google Scholar
Califf, M.E.: Relational Learning Techniques for Natural Language Information Extraction. Ph.D. thesis. University of Texas, Austin (1998)
Google Scholar
Chieu, H.L., Ng, H.T.: A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text. In: Eighteenth national conference on Artificial intelligence (2002)
Google Scholar
Ciravegna, F.: (LP)², an Adaptive Algorithm for Information Extraction from Web-related Texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with 17th International Joint Conference on Artificial Intelligence (IJCAI), Seattle, Usa (August 2001)
Google Scholar
Ciravegna, F., Dingli, A., Iria, J., Wilks, Y.: Multi-strategy Definition of Annotation Services in Melita. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 97–107. Springer, Heidelberg (2003)
Google Scholar
Cohen, W., Jensen, L.: A Structured Wrapper Induction System for Extracting Information from Semi-structured Documents. In: Proceedings of the Workshop on Adaptive Text Extraction and Mining, IJCAI 2001 (2001)
Google Scholar
Collins, M.: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In: Proceedings of the Conference on Empirical Methods in NLP (2002)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995)
MATH Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)
Google Scholar
Dean, M., Schreiber, G., Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., Andrea Stein, L.: OWL Web Ontology Language Reference. W3C Recommendation (February 10, 2004), http://www.w3.org/TR/owl-ref/
Dhamankar, R., Lee, Y., Doan, A.H., Halevy, A., Domingos, P.: iMAP: Discovering Complex Semantic Matches between Database Schemas. In: SIGMOD 2004, Paris, France (June 13–18, 2004)
Google Scholar
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K.S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: A Case for Automated Large-scale Semantic Annotation. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 115–132 (July 2003)
Google Scholar
Eriksson, H., Fergerson, R., Shahar, Y., Musen, M.: Automatic Generation of Ontology Editors. In: Proceedings of the 12th Banff Knowledge Acquisition Workshop, Banff Alberta, Canada (1999)
Google Scholar
Fensel, D., Decker, S., Erdmann, M., Studer, R.: Ontobroker: Or how to enable intelligent access to the WWW. In: Proceedings of 11th Banff Knowledge Acquisition for Knowledge-Based SystemsWorkshop, Banff, Canada (1998)
Google Scholar
Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of 17th National Conference on Artificial Intelligence (2000)
Google Scholar
Ghahramani, Z., Jordan, M.I.: Factorial Hidden Markov Models. Machine Learning 29, 245–273 (1997)
Article MATH Google Scholar
Hammond, B., Sheth, A., Kochut, K.: Semantic Enhancement Engine: A Modular Document Enhancement Platform for Semantic Applications over Heterogeneous Content. In: Kashyap, V., Shklar, L. (eds.) Real World Semantic Web Applications, December 2002, pp. 29–49. IOS Press, Amsterdam (2002)
Google Scholar
Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic Document Metadata Extraction Using Support Vector Machine. In: Proceedings of Joint Conference on Digital Libraries (JCDL 2003), pp. 37–48 (2003)
Google Scholar
Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM—Semi-automatic Creation of Metadata, In Proceedings of the 13th International Conference on Knowledge Engineering and Management (EKAW 2002), Siguenza, Spain. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 358–372. Springer, Heidelberg (2002)
Chapter Google Scholar
Handschuh, S., Staab, S.: Annotation for the Semantic Web. Frontiers in Artificial Intelligence and Applications, vol. 96. New IOS Publication (2003)
Google Scholar
Heflin, J., Hendler, J.: Searching the Web with SHOE. In: Proceedings of AAAI-2000 Workshop on AI for Web Search, Austin, Texas (2000)
Google Scholar
Kahan, J., Koivunen, M.R.: Annotea: an Open RDF Infrastructure for Shared Web Annotations. In: Proceedings of World Wide Web, pp. 623–632 (2001)
Google Scholar
Kogut, P., Holmes, W.: AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages (2001)
Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper Induction for Information Extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Nagoya, Japan, pp. 729–737 (1997)
Google Scholar
Leonard, T., Glaser, H.: Large Scale Acquisition and Maintenance from the Web without Source Access (2001), http://www.semannot2001.aifb.uni-karlsruhe.de/positionpapers/Leonard.pdf
Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA (August 2001)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML 2001 (2001)
Google Scholar
Lavelli, A., Califf, M., Ciravegna, F., Freitag, F., Giuliano, D., Kushmerick, C., Romano, N.: A Critical Survey of the Methodology for IE Evaluation. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (2004)
Google Scholar
Li, J., Yu, Y.: Learning to Generate Semantic Annotation for Domain Specific Sentences. In: Proceedings of the Knowledge Markup and Semantic Annotation Workshop in K-CAP 2001, Victoria, BC (2001)
Google Scholar
Martin, P., Eklund, P.: Embedding Knowledge in Web Documents. In: Proceedings of the 8th International World Wide Web Conf (WWW 1998), Toronto, May 1999, pp. 1403–1419. Elsevier Science B.V, Amsterdam (1999)
Google Scholar
McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proceedings of the ICML Coference (2000)
Google Scholar
Mukherjee, S., Yang, G., Ramakrishnan, I.V.: Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 533–549. Springer, Heidelberg (2003)
Chapter Google Scholar
Muslea, I.: Active Learning with Multiple Views. Ph.D. dissertation USC (2002)
Google Scholar
Nahm, U.Y., Mooney, R.J.: Using Soft-Matching Mined Rules to Improve Information Extraction. In: Proceedings of the AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), San Jose, CA, July 2004, pp. 27–32 (2004)
Google Scholar
Peng, F., McCallum, A.: Accurate Information Extraction from Research Papers using Conditional Random Fields. In: Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics, HLT-NAACL (2004)
Google Scholar
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table Extraction Using Conditional Random Fields. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval (2003)
Google Scholar
Popov, B., Kiryakov, A., Manov, D., Kirilov, A., Ognyanoff, D., Goranov, M.: Towards Semantic Web Information Extraction. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 1–21. Springer, Heidelberg (2003)
Chapter Google Scholar
Schaffer, C.: Selecting a Classification method by Cross-Validation. Machine Learning 13(1), 135–143 (1993)
Google Scholar
Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model Structure for Information Extraction. In: Proceedings of AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Google Scholar
Soderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 1–44 (January 1999)
Google Scholar
Soo, V.W., Lee, C.Y., Li, C.–C., Chen, S.L., Chen, C.: Automated Semantic Annotation and Retrieval Based on Sharable Ontology and Case-based Learning Techniques. In: Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, Los Alamitos (2003)
Google Scholar
Vapnik, V.: Statistical Learning Theroy. Springer, New York (1998)
Google Scholar
Vargas-Vera, M., Motta, E., Domingue, J., Buckingham Shum, S., Lanzoni, M.: Knowledge Extraction by Using an Ontology-based Annotation Tool. In: Proceedings of K-CAP 2001 Workshop on Knowledge Markup and Semantic Annotation, Victoria, BC, Canada (October 2001)
Google Scholar
Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F.: MnM: Ontology Driven Semiautomatic and Automatic Support for Semantic Markup. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, p. 379. Springer, Heidelberg (2002)
Chapter Google Scholar
Zhang, K., Xu, P., Li, J.: Optimal Hierarchical Clustering based Logic Structure Extraction. Journal of Tsinghua Science and Technology (2005)
Google Scholar
Zhang, L., Pan, Y., Zhang, T.: Recognising and using named entities: Focused named entity recognition using machine learning. In: Proceedings of the SIGIR 2004 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Tsinghua University, Beijing, 100084, P.R. China
Jie Tang, Juanzi Li, Hongjun Lu, Bangyong Liang, Xiaotong Huang & Kehong Wang

Authors

Jie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Juanzi Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongjun Lu
View author publications
You can also search for this author in PubMed Google Scholar
Bangyong Liang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaotong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Kehong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EPFL-IC-IIF-LBD, Station 14 - INJ 236, 1015, Lausanne, Switzerland
Stefano Spaccapietra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, J., Li, J., Lu, H., Liang, B., Huang, X., Wang, K. (2005). iASA: Learning to Annotate the Semantic Web. In: Spaccapietra, S. (eds) Journal on Data Semantics IV. Lecture Notes in Computer Science, vol 3730. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11603412_4

Download citation

DOI: https://doi.org/10.1007/11603412_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31001-3
Online ISBN: 978-3-540-31447-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics