A Novel Approach to Web Information Extraction

Reina Quintero, Antonia M.; Jiménez, Patricia; Corchuelo, Rafael

doi:10.1007/978-3-319-19027-3_13

Antonia M. Reina Quintero⁷,
Patricia Jiménez⁷ &
Rafael Corchuelo⁷

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 208))

Included in the following conference series:

International Conference on Business Information Systems

2434 Accesses
1 Citations

Abstract

Business Intelligence requires the acquisition and aggregation of key pieces of knowledge from multiple sources in order to provide valuable information to customers. The Web is the largest source of information nowadays. Unfortunately, the information it provides is available in semi-structured human-friendly formats, which makes it difficult to be processed by automated business processes. Classical propositional and ILP machine-learning techniques have been applied for this purpose. However, the former have not enough expressive power, whereas the latter are more expressive but intractable with large datasets. Propositionalisation was devised as a means to provide propositional techniques with more expressive power, enabling them to exploit structural information in a propositional way that allows them to be efficient. In this paper, we present a proposal to extract information from semi-structured web documents that uses this approach. It leverages a classical propositional machine learning technique and enhances it with the ability to learn from an unbounded context, which helps increase its precision and recall. Our experiments prove that our proposal outperforms other state-of-art techniques in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Joint Information Extraction from the Web Using Linked Data

Roller: a novel approach to Web information extraction

Article 10 March 2016

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

References

Blockeel, H., Raedt, L.D., Jacobs, N., Demoen, B.: Scaling up inductive logic programming by learning from interpretations. Data Min. Knowl. Discov. 3(1), 59–93 (1999)
Article Google Scholar
Bădică, C., Bădică, A., Popescu, E., Abraham, A.: L-Wrappers: concepts, properties and construction. Soft Comput. 11(8), 753–772 (2007)
Article Google Scholar
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Article Google Scholar
Crescenzi, V., Merialdo, P.: Wrapper inference for ambiguous web pages. Appl. Artif. Intell. 22(1&2), 21–52 (2008)
Article Google Scholar
Freitag, D.: Machine learning for information extraction in informal domains. Mach. Learn. 39(2/3), 169–202 (2000)
Article MATH Google Scholar
Gregg, D.G., Walczak, S.: Exploiting the information web. IEEE Trans. Syst. Man Cybern. Part C 37(1), 109–125 (2007)
Article Google Scholar
Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(8), 521–538 (1998)
Article Google Scholar
Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW, pp. 553–563 (2006)
Google Scholar
Kavurucu, Y., Senkul, P., Toroslu, I.H.: A comparative study on ILP-based concept discovery systems. Expert Syst. Appl. 38(9), 11598–11607 (2011)
Article Google Scholar
Kayed, M., Chang, C.H.: FiVaTech: page-level web data extraction from template pages. IEEE Trans. Knowl. Data Eng. 22(2), 249–263 (2010)
Article Google Scholar
Kramer, S., Lavrač, N., Flach, P.: Propositionalization approaches to relational data mining. In: Džeroski, S., Lavrač, N. (eds.) Relational Data Mining, pp. 262–291. Springer, Heidelberg (2001)
Chapter Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: IJCAI (1), pp. 729–737 (1997)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Auton. Agent. Multi-Agent Syst. 4(1/2), 93–114 (2001)
Article Google Scholar
Quinlan, J.R., Cameron-Jones, R.M.: Induction of logic programs: FOIL and related systems. New Gener. Comput. 13(3&4), 287–312 (1995)
Article Google Scholar
Saggion, H., Funk, A., Maynard, D., Bontcheva, K.: Ontology-based information extraction for business intelligence. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 843–856. Springer, Heidelberg (2007)
Chapter Google Scholar
Sleiman, H.A., Corchuelo, R.: A survey on region extractors from web documents. IEEE Trans. Knowl. Data Eng. 25(9), 1960–1981 (2013)
Article Google Scholar
Sleiman, H.A., Corchuelo, R.: A class of neural-network-based transducers for web information extraction. Neurocomputing 135, 61–68 (2014)
Article Google Scholar
Sleiman, H.A., Corchuelo, R.: Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans. Knowl. Data Eng. 26(6), 1544–1556 (2014)
Article Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1–3), 233–272 (1999)
Article MATH Google Scholar
Srivastava, J., Cooley, R.: Web business intelligence: mining the web for actionable knowledge. INFORMS J. Comput. 15(2), 191–207 (2003)
Article Google Scholar
Witten, I.H., Frank, E.: Weka machine learning algorithms in Java. In: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, pp. 265–320. Morgan Kauffman Publishers (2000)
Google Scholar

Download references

Acknowledgements

Our work was funded by the Spanish and the Andalusian R&D&I programmes by means of grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, TIN2010-09988-E, TIN2011-15497-E, and TIN2013-40848-R, which got funds from the European FEDER programme.

Author information

Authors and Affiliations

ETSI Informática, Avda. Reina Mercedes, s/n., 41012, Sevilla, Spain
Antonia M. Reina Quintero, Patricia Jiménez & Rafael Corchuelo

Authors

Antonia M. Reina Quintero
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Jiménez
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Corchuelo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonia M. Reina Quintero .

Editor information

Editors and Affiliations

Department of Information Systems, Poznań University of Economics, Poznań, Poland
Witold Abramowicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Reina Quintero, A.M., Jiménez, P., Corchuelo, R. (2015). A Novel Approach to Web Information Extraction. In: Abramowicz, W. (eds) Business Information Systems. BIS 2015. Lecture Notes in Business Information Processing, vol 208. Springer, Cham. https://doi.org/10.1007/978-3-319-19027-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-19027-3_13
Published: 16 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19026-6
Online ISBN: 978-3-319-19027-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Novel Approach to Web Information Extraction

Abstract

Access this chapter

Similar content being viewed by others

Joint Information Extraction from the Web Using Linked Data

Roller: a novel approach to Web information extraction

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Novel Approach to Web Information Extraction

Abstract

Access this chapter

Similar content being viewed by others

Joint Information Extraction from the Web Using Linked Data

Roller: a novel approach to Web information extraction

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation