A Semi-supervised Learning Algorithm for Web Information Extraction with Tolerance Rough Sets

Sengoz, Cenker; Ramanna, Sheela

doi:10.1007/978-3-319-09912-5_1

A Semi-supervised Learning Algorithm for Web Information Extraction with Tolerance Rough Sets

Cenker Sengoz¹⁹ &
Sheela Ramanna¹⁹

Conference paper

2367 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8610))

Abstract

In this paper, we propose a semi-supervised learning algorithm (TPL) to extract categorical noun phrase instances from unstructured web pages based on the tolerance rough sets model (TRSM). TRSM has been successfully employed for document representation, retrieval and classification tasks. However, instead of the vector-space model, our model uses noun phrases which are described in terms of sets of co-occurring contextual patterns. The categorical information that we employ is derived from the Never Ending Language Learner System (NELL) [3]. The performance of the TPL algorithm is compared with the Coupled Bayesian Sets (CBS) algorithm. Experimental results show that TPL is able to achieve comparable performance with CBS in terms of precision.

This research has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grants 194376. We are very grateful to Prof. Estevam R. Hruschka Jr. and to Saurabh Verma for the NELL dataset and for discussions regarding the NELL project. Special Thanks to Prof. James F. Peters for helpful suggestions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Callan, J., Hoy, M.: Clueweb09 data set (2009), http://lemurproject.org/clueweb09/
Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr., E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110 (2010)
Google Scholar
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proceedings of the Twenty-Fourth Conference on Artificial Intelligence, AAAI 2010 (2010)
Google Scholar
Carlson, A.: All-pairs data set (2010)
Google Scholar
Curran, J.R., Murphy, T., Scholz, B.: Minimising semantic drift with mutual exclusion bootstrapping. In: Proc. of PACLING, pp. 172–180 (2007)
Google Scholar
Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam: Open information extraction: The second generation. In: International Joint Conference on Artificial Intelligence. pp. 3–10 (2011)
Google Scholar
Ghahramani, Z., Heller, K.A.: Bayesian sets. Advances in Neural Information Processing Systems 18 (2005)
Google Scholar
Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. International Journal of Intelligent Systems 17, 199–212 (2002)
Article MATH Google Scholar
Kawasaki, S., Nguyen, N.B., Ho, T.-B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458–463. Springer, Heidelberg (2000)
Chapter Google Scholar
Ngo, C.L.: A tolerance rough set approach to clustering web search results. Master’s thesis, Warsaw University (2003)
Google Scholar
Pawlak, Z.: Rough sets. International Journal of Computer & Information Sciences 11(5), 341–356 (1982), http://dx.doi.org/10.1007/BF01001956
Article MATH MathSciNet Google Scholar
Peters, J., Wasilewski, P.: Tolerance spaces: Origins, theoretical aspects and applications. Information Sciences 195(1-2), 211–225 (2012)
Article MATH MathSciNet Google Scholar
Shi, L., Ma, X., Xi, L., Duan, Q., Zhao, J.: Rough set and ensemble learning based semi-supervised algorithm for text classification. Expert Syst. Appl. 38(5), 6300–6306 (2011)
Article Google Scholar
Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundam. Inf. 27(2,3), 245–253 (1996), http://dl.acm.org/citation.cfm?id=2379560.2379571
MATH MathSciNet Google Scholar
Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons. Biologiske skrifter, I kommission hos E. Munksgaard (1948), http://books.google.co.in/books?id=rpS8GAAACAAJ
Verma, S., Hruschka Jr., E.R.: Coupled bayesian sets algorithm for semi-supervised learning and information extraction. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012, Part II. LNCS, vol. 7524, pp. 307–322. Springer, Heidelberg (2012)
Chapter Google Scholar
Virginia, G., Nguyen, H.S.: Lexicon-based document representation. Fundam. Inform. 124(1-2), 27–46 (2013)
MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Computer Science, University of Winnipeg, Winnipeg, Manitoba, R3B 2E9, Canada
Cenker Sengoz & Sheela Ramanna

Authors

Cenker Sengoz
View author publications
You can also search for this author in PubMed Google Scholar
Sheela Ramanna
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Warsaw and Infobright Inc., Poland
Dominik Ślȩzak
Department of Computer Science, Loughborough University, Loughborough, U.K.
Gerald Schaefer
Computer Science Department, University of British Columbia, 2366 Main Mall, P.O. Box, Vancouver, B.C., Canada
Son T. Vuong
Department of Information & Communication Engineering, Inha University, Korea
Yoo-Sung Kim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sengoz, C., Ramanna, S. (2014). A Semi-supervised Learning Algorithm for Web Information Extraction with Tolerance Rough Sets. In: Ślȩzak, D., Schaefer, G., Vuong, S.T., Kim, YS. (eds) Active Media Technology. AMT 2014. Lecture Notes in Computer Science, vol 8610. Springer, Cham. https://doi.org/10.1007/978-3-319-09912-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-09912-5_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09911-8
Online ISBN: 978-3-319-09912-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics