Abstract
Web pages containing huge amount of information are designed for human readers; it makes their automatic computer processing difficult. Moreover web pages live their content is changing. Once a page is downloaded and processed, few seconds after that its content can be different. Many scraping frameworks and extraction mechanisms have been proposed and implemented; their common task is to download and extract required data. Nevertheless, the complexity of development of such application is enormous since the nature of data does not conform to common programming paradigms. Moreover, the changing content of the web pages often implies repetitive extracting of the whole data set.
This paper describes the LinqToWeb framework for web data extraction. It is designed in an innovative way that allows defining strongly typed object model transparently reflecting data on the living web. This mechanism provides access to raw web data in a completely object oriented way using modern techniques of Language Integrated Query (LINQ). Using this framework development of web-based applications such as data semantization tools is more efficient, type-safe, and the resulting product is easily maintainable and extendable.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gisle Aas: HTML Parser informations, http://search.cpan.org/~GAAS/HTML-Parser/
Bednarek, D., Dokulil, J., Yaghob, J., Zavoral, F.: Using Methods of Paral-lel Semi-structured Data Processing for SemanticWeb. In: Proceedings of SEMAPRO 2009. IEEE Computer Society Press, Los Alamitos (2009)
Beno, M., Misek, J., Zavoral, F.: AgentMat: Framework for Data Scraping and Semantization. In: RCIS, Fez, Morocco (2009)
Box, D., Hejlsberg, A.: LINQ:NET Language-Integrated Query. In: MSDN (2007)
Dokulil, J., Yaghob, J., Zavoral, F.: Trisolda: The Environment for Semantic Data Processing. International Journal On Advances in Software 2008, IARIAÂ 1(1) (2009)
Friedl, J.: Mastering Regular Expressions. O’Reilly Media, Inc., Sebastopol (2006)
Kulkarni, D., Bolognese, L., Warren, M., Hejlsberg, A., George, K.: LINQ to SQL:NET Language-Integrated Query for Relational Data
Lester, A.: WWW:Mechanize, http://search.cpan.org/~petdance/WWW-Mechanize-1.52/
Mackay, C.A.: Using .NET Enumerators, The Code Project (2003), http://www.codeproject.com/KB/cs/csenumerators.aspx
Misek, J.: LinqToWeb Language Definition, Technical report KSI 2010/01, Charles University in Prague (2010)
Misek, J.: LINQ to Web project, http://linqtoweb.codeplex.com/
Ekiwi: Screen scraper informations, http://www.screen-scraper.com/
Kapow Technologies: Kapowtech Mashup Server informations, http://www.kapowtech.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Misek, J., Zavoral, F. (2010). High-Level Web Data Abstraction Using Language Integrated Query. In: Essaaidi, M., Malgeri, M., Badica, C. (eds) Intelligent Distributed Computing IV. Studies in Computational Intelligence, vol 315. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15211-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-15211-5_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15210-8
Online ISBN: 978-3-642-15211-5
eBook Packages: EngineeringEngineering (R0)