Flint: From Web Pages to Probabilistic Semantic Data

Blanco, Lorenzo; Bronzi, Mirco; Crescenzi, Valter; Merialdo, Paolo; Papotti, Paolo

doi:10.1007/978-3-642-25008-8_13

Lorenzo Blanco⁴,
Mirco Bronzi⁴,
Valter Crescenzi⁴,
Paolo Merialdo⁴ &
…
Paolo Papotti⁴

Part of the book series: Data-Centric Systems and Applications ((DCSA))

1356 Accesses

Abstract

A large and increasing number of web sites publish structured data about recognizable concepts (such as stock quotes, movies, restaurants). The great chance to create applications that rely on the huge amount of data taken from these sites has been discussed for more than a decade now, but in practice, only a small fraction of such information is currently used. The main reason is that extracting and integrating web data of good quality is an expensive task, which often requires human intervention. In this chapter, we present the main results of the Flint project, which aims at developing automatic and domain-independent tools to perform all the steps required to benefit from Web data: discovering data-intensive web sites containing information about entities of interest, extracting and integrating the published data, and performing a probabilistic analysis to characterize the impreciseness of the data and the accuracy of the sources. The results of the processing are semantically annotated data that can be used to populate a probabilistic database and to develop novel applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The distance between an attribute and a mapping is from the centroid of the mapping.
2.
The names of the models presented in this chapter are inspired by those introduced by Dong et al. in [31].

References

Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. DL ’00, pp. 85–94 (2000)
Google Scholar
Amento, B., Terveen, L.G., Hill, W.C.: Does “authority” mean quality? predicting expert quality ratings of web documents. SIGIR, pp. 296–303 (2000)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. ACM SIGMOD international conference on management of data (SIGMOD’2003), San Diego, California, pp. 337–348 (2003)
Google Scholar
Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. IJCAI (2007)
Google Scholar
Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies, and Techniques. Springer, Berlin, Heidelberg, New York (2008)
Google Scholar
Bilke, A., Naumann, F.: Schema matching using duplicates. ICDE, pp. 69–80 (2005)
Google Scholar
Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Exploiting information redundancy to wring out structured data from the web. In: Rappa, M., Jones, P., Freire, J., Chakrabarti, S. (eds.) WWW, pp. 1063–1064. ACM, New York (2010)
Google Scholar
Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Redundancy-driven web data extraction and integration. WebDB (2010)
Google Scholar
Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Automatically building probabilistic databases from the web. WWW (Companion Volume), pp. 185–188 (2011)
Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P.: Efficiently locating collections of web pages to wrap. WEBIST (2005)
Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Supporting the automatic construction of entity aware search engines. WIDM, pp. 149–156 (2008)
Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Probabilistic models to reconcile complex data from inaccurate data sources. CAiSE, pp. 83–97 (2010)
Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Contextual data extraction and instance-based integration. International workshop on searching and integrating new web data sources (VLDS) (2011)
Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Wrapper generation for overlapping web sources. Web Intelligence (WI) (2011)
Google Scholar
Blanco, L., Dalvi, N.N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. WWW, pp. 437–446 (2011)
Google Scholar
Brin, S.: Extracting patterns and relations from the World Wide Web. Proceedings of the First Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98, pp. 102–108 (1998)
Google Scholar
Cafarella, M.J., Etzioni, O., Suciu, D.: Structured queries over web text. IEEE Data Eng. Bull. 29(4), 45–51 (2006)
Google Scholar
Cafarella, M.J., Halevy, A.Y., Khoussainova, N.: Data integration for the relational web. PVLDB 2(1), 1090–1101 (2009)
Google Scholar
Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Networks (Amsterdam, Netherlands) 31(11–16), 1623–1640 (1999)
Google Scholar
Chang, K.C.C., Bin, H., Zhen, Z.: Toward large scale integration: building a metaquerier over databases on the web. CIDR 2005, pp. 44–66 (2005)
Google Scholar
Chuang, S.L., Chang, K.C.C., Zhai, C.X.: Context-aware wrapping: synchronized data extraction. VLDB, pp. 699–710 (2007)
Google Scholar
Clemen, R.T., Winkler, R.L.: Combining probability distributions from experts in risk analysis. Risk Anal. 19(2), 187–203 (1999)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: roadRunner: towards automatic data extraction from large Web sites. International conference on very large data bases (VLDB 2001), Roma, Italy, 11–14 September 2001, pp. 109–118
Google Scholar
Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. PODS, pp. 1–12 (2007)
Google Scholar
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. WWW ’03: proceedings of the 12th International Conference on World Wide Web, pp. 178–186. ACM, New York, NY, USA (2003). http://doi.acm.org/10.1145/775152.775178
Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)
Article Google Scholar
Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to map between ontologies on the semantic web. WWW ’02, pp. 662–673 (2002)
Google Scholar
Doan, A., Ramakrishnan, R., Chen, F., DeRose, P., Lee, Y., McCann, R., Sayyadian, M., Shen, W.: Community information management. IEEE Data Eng. Bull. 29(1), 64–72 (2006)
Google Scholar
Dong, X., Berti-Equille, L., Hu, Y., Srivastava, D.: Global detection of complex copying relationships between sources. PVLDB 3(1), 1358–1369 (2010)
Google Scholar
Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. PVLDB 2(1), 550–561 (2009)
Google Scholar
Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)
Google Scholar
Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. IJCAI, pp. 1034–1041 (2005)
Google Scholar
Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. VLDB, pp. 216–225 (1997)
Google Scholar
Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. Proceedings of WSDM, New York, USA (2010)
Google Scholar
Guha, R., McCool, R.: Tap: a semantic web platform. Comput. Networks 42(5), 557–577 (2003)
Article Google Scholar
Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.Y.: Corpus-based schema matching. ICDE, pp. 57–68 (2005)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008). http://www.informationretrieval.org
Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)
Article Google Scholar
Sarma, A.D., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. SIGMOD conference, pp. 861–874 (2008)
Google Scholar
Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. SIGMOD conference, pp. 1031–1042 (2008)
Google Scholar
Shen, W., DeRose, P., Vu, L., Doan, A., Ramakrishnan, R.: Source-aware entity matching: a compositional approach. ICDE, pp. 196–205. IEEE Computer Society, Silver Spring, MD (2007)
Google Scholar
Sizov, S., Biwer, M., Graupmann, J., Siersdorfer, S., Theobald, M., Weikum, G., Zimmer, P.: The bingo! system for information portal generation and expert web search. CIDR 2003, First Biennial conference on innovative data systems research, Asilomar, CA, USA, 2003
Google Scholar
Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-driven crawler generation by example. In: Efthimiadis, E.N., Dumais, S.T., Hawking, D., Järvelin, K. (eds.) SIGIR, pp. 292–299. ACM, New York (2006)
Google Scholar
Wu, M., Marian, A.: Corroborating answers from multiple web sources. WebDB (2007)
Google Scholar
Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica e Automazione, Università degli Studi Roma Tre, via della Vasca Navale 79, Rome, Italy
Lorenzo Blanco, Mirco Bronzi, Valter Crescenzi, Paolo Merialdo & Paolo Papotti

Authors

Lorenzo Blanco
View author publications
You can also search for this author in PubMed Google Scholar
Mirco Bronzi
View author publications
You can also search for this author in PubMed Google Scholar
Valter Crescenzi
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Merialdo
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Papotti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Merialdo .

Editor information

Editors and Affiliations

, Dipartimento di Informatica, Università degli Studi Roma Tre, Via della Vasca Navale 79, Roma, 00146, Italy
Roberto De Virgilio
e Reggio Emilia, Dipartimento di Economia Aziendale, Università degli Studi di Modena, Via le Berengario, 51, Modena, 41100, Italy
Francesco Guerra
Università degli Studi di Trento, Via Sommarive 14, Trento, 38123, Italy
Yannis Velegrakis

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P. (2012). Flint: From Web Pages to Probabilistic Semantic Data. In: De Virgilio, R., Guerra, F., Velegrakis, Y. (eds) Semantic Search over the Web. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25008-8_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-25008-8_13
Published: 28 January 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25007-1
Online ISBN: 978-3-642-25008-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics