Abstract
In this paper, we propose a semi-supervised learning algorithm (TPL) to extract categorical noun phrase instances from unstructured web pages based on the tolerance rough sets model (TRSM). TRSM has been successfully employed for document representation, retrieval and classification tasks. However, instead of the vector-space model, our model uses noun phrases which are described in terms of sets of co-occurring contextual patterns. The categorical information that we employ is derived from the Never Ending Language Learner System (NELL) [3]. The performance of the TPL algorithm is compared with the Coupled Bayesian Sets (CBS) algorithm. Experimental results show that TPL is able to achieve comparable performance with CBS in terms of precision.
This research has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grants 194376. We are very grateful to Prof. Estevam R. Hruschka Jr. and to Saurabh Verma for the NELL dataset and for discussions regarding the NELL project. Special Thanks to Prof. James F. Peters for helpful suggestions.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Callan, J., Hoy, M.: Clueweb09 data set (2009), http://lemurproject.org/clueweb09/
Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr., E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110 (2010)
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proceedings of the Twenty-Fourth Conference on Artificial Intelligence, AAAI 2010 (2010)
Carlson, A.: All-pairs data set (2010)
Curran, J.R., Murphy, T., Scholz, B.: Minimising semantic drift with mutual exclusion bootstrapping. In: Proc. of PACLING, pp. 172–180 (2007)
Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam: Open information extraction: The second generation. In: International Joint Conference on Artificial Intelligence. pp. 3–10 (2011)
Ghahramani, Z., Heller, K.A.: Bayesian sets. Advances in Neural Information Processing Systems 18 (2005)
Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. International Journal of Intelligent Systems 17, 199–212 (2002)
Kawasaki, S., Nguyen, N.B., Ho, T.-B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458–463. Springer, Heidelberg (2000)
Ngo, C.L.: A tolerance rough set approach to clustering web search results. Master’s thesis, Warsaw University (2003)
Pawlak, Z.: Rough sets. International Journal of Computer & Information Sciences 11(5), 341–356 (1982), http://dx.doi.org/10.1007/BF01001956
Peters, J., Wasilewski, P.: Tolerance spaces: Origins, theoretical aspects and applications. Information Sciences 195(1-2), 211–225 (2012)
Shi, L., Ma, X., Xi, L., Duan, Q., Zhao, J.: Rough set and ensemble learning based semi-supervised algorithm for text classification. Expert Syst. Appl. 38(5), 6300–6306 (2011)
Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundam. Inf. 27(2,3), 245–253 (1996), http://dl.acm.org/citation.cfm?id=2379560.2379571
Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons. Biologiske skrifter, I kommission hos E. Munksgaard (1948), http://books.google.co.in/books?id=rpS8GAAACAAJ
Verma, S., Hruschka Jr., E.R.: Coupled bayesian sets algorithm for semi-supervised learning and information extraction. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012, Part II. LNCS, vol. 7524, pp. 307–322. Springer, Heidelberg (2012)
Virginia, G., Nguyen, H.S.: Lexicon-based document representation. Fundam. Inform. 124(1-2), 27–46 (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Sengoz, C., Ramanna, S. (2014). A Semi-supervised Learning Algorithm for Web Information Extraction with Tolerance Rough Sets. In: Ślȩzak, D., Schaefer, G., Vuong, S.T., Kim, YS. (eds) Active Media Technology. AMT 2014. Lecture Notes in Computer Science, vol 8610. Springer, Cham. https://doi.org/10.1007/978-3-319-09912-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-09912-5_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09911-8
Online ISBN: 978-3-319-09912-5
eBook Packages: Computer ScienceComputer Science (R0)