Flint: From Web Pages to Probabilistic Semantic Data

  • Paolo Merialdo
  • Paolo Papotti
Part of the Data-Centric Systems and Applications book series (DCSA)


A large and increasing number of web sites publish structured data about recognizable concepts (such as stock quotes, movies, restaurants). The great chance to create applications that rely on the huge amount of data taken from these sites has been discussed for more than a decade now, but in practice, only a small fraction of such information is currently used. The main reason is that extracting and integrating web data of good quality is an expensive task, which often requires human intervention. In this chapter, we present the main results of the Flint project, which aims at developing automatic and domain-independent tools to perform all the steps required to benefit from Web data: discovering data-intensive web sites containing information about entities of interest, extracting and integrating the published data, and performing a probabilistic analysis to characterize the impreciseness of the data and the accuracy of the sources. The results of the processing are semantically annotated data that can be used to populate a probabilistic database and to develop novel applications.


Soccer Player Extraction Rule Probabilistic Database Target Entity Weak Rule 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. DL ’00, pp. 85–94 (2000)Google Scholar
  2. 2.
    Amento, B., Terveen, L.G., Hill, W.C.: Does “authority” mean quality? predicting expert quality ratings of web documents. SIGIR, pp. 296–303 (2000)Google Scholar
  3. 3.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. ACM SIGMOD international conference on management of data (SIGMOD’2003), San Diego, California, pp. 337–348 (2003)Google Scholar
  4. 4.
    Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. IJCAI (2007)Google Scholar
  5. 5.
    Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies, and Techniques. Springer, Berlin, Heidelberg, New York (2008)Google Scholar
  6. 6.
    Bilke, A., Naumann, F.: Schema matching using duplicates. ICDE, pp. 69–80 (2005)Google Scholar
  7. 7.
    Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Exploiting information redundancy to wring out structured data from the web. In: Rappa, M., Jones, P., Freire, J., Chakrabarti, S. (eds.) WWW, pp. 1063–1064. ACM, New York (2010)Google Scholar
  8. 8.
    Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Redundancy-driven web data extraction and integration. WebDB (2010)Google Scholar
  9. 9.
    Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Automatically building probabilistic databases from the web. WWW (Companion Volume), pp. 185–188 (2011)Google Scholar
  10. 10.
    Blanco, L., Crescenzi, V., Merialdo, P.: Efficiently locating collections of web pages to wrap. WEBIST (2005)Google Scholar
  11. 11.
    Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Supporting the automatic construction of entity aware search engines. WIDM, pp. 149–156 (2008)Google Scholar
  12. 12.
    Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Probabilistic models to reconcile complex data from inaccurate data sources. CAiSE, pp. 83–97 (2010)Google Scholar
  13. 13.
    Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Contextual data extraction and instance-based integration. International workshop on searching and integrating new web data sources (VLDS) (2011)Google Scholar
  14. 14.
    Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Wrapper generation for overlapping web sources. Web Intelligence (WI) (2011)Google Scholar
  15. 15.
    Blanco, L., Dalvi, N.N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. WWW, pp. 437–446 (2011)Google Scholar
  16. 16.
    Brin, S.: Extracting patterns and relations from the World Wide Web. Proceedings of the First Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98, pp. 102–108 (1998)Google Scholar
  17. 17.
    Cafarella, M.J., Etzioni, O., Suciu, D.: Structured queries over web text. IEEE Data Eng. Bull. 29(4), 45–51 (2006)Google Scholar
  18. 18.
    Cafarella, M.J., Halevy, A.Y., Khoussainova, N.: Data integration for the relational web. PVLDB 2(1), 1090–1101 (2009)Google Scholar
  19. 19.
    Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)Google Scholar
  20. 20.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Networks (Amsterdam, Netherlands) 31(11–16), 1623–1640 (1999)Google Scholar
  21. 21.
    Chang, K.C.C., Bin, H., Zhen, Z.: Toward large scale integration: building a metaquerier over databases on the web. CIDR 2005, pp. 44–66 (2005)Google Scholar
  22. 22.
    Chuang, S.L., Chang, K.C.C., Zhai, C.X.: Context-aware wrapping: synchronized data extraction. VLDB, pp. 699–710 (2007)Google Scholar
  23. 23.
    Clemen, R.T., Winkler, R.L.: Combining probability distributions from experts in risk analysis. Risk Anal. 19(2), 187–203 (1999)Google Scholar
  24. 24.
    Crescenzi, V., Mecca, G., Merialdo, P.: roadRunner: towards automatic data extraction from large Web sites. International conference on very large data bases (VLDB 2001), Roma, Italy, 11–14 September 2001, pp. 109–118Google Scholar
  25. 25.
    Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. PODS, pp. 1–12 (2007)Google Scholar
  26. 26.
    Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. WWW ’03: proceedings of the 12th International Conference on World Wide Web, pp. 178–186. ACM, New York, NY, USA (2003).
  27. 27.
    Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)Google Scholar
  28. 28.
    Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to map between ontologies on the semantic web. WWW ’02, pp. 662–673 (2002)Google Scholar
  29. 29.
    Doan, A., Ramakrishnan, R., Chen, F., DeRose, P., Lee, Y., McCann, R., Sayyadian, M., Shen, W.: Community information management. IEEE Data Eng. Bull. 29(1), 64–72 (2006)Google Scholar
  30. 30.
    Dong, X., Berti-Equille, L., Hu, Y., Srivastava, D.: Global detection of complex copying relationships between sources. PVLDB 3(1), 1358–1369 (2010)Google Scholar
  31. 31.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. PVLDB 2(1), 550–561 (2009)Google Scholar
  32. 32.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)Google Scholar
  33. 33.
    Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. IJCAI, pp. 1034–1041 (2005)Google Scholar
  34. 34.
    Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. VLDB, pp. 216–225 (1997)Google Scholar
  35. 35.
    Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. Proceedings of WSDM, New York, USA (2010)Google Scholar
  36. 36.
    Guha, R., McCool, R.: Tap: a semantic web platform. Comput. Networks 42(5), 557–577 (2003)Google Scholar
  37. 37.
    Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.Y.: Corpus-based schema matching. ICDE, pp. 57–68 (2005)Google Scholar
  38. 38.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008).
  39. 39.
    Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)Google Scholar
  40. 40.
    Sarma, A.D., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. SIGMOD conference, pp. 861–874 (2008)Google Scholar
  41. 41.
    Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. SIGMOD conference, pp. 1031–1042 (2008)Google Scholar
  42. 42.
    Shen, W., DeRose, P., Vu, L., Doan, A., Ramakrishnan, R.: Source-aware entity matching: a compositional approach. ICDE, pp. 196–205. IEEE Computer Society, Silver Spring, MD (2007)Google Scholar
  43. 43.
    Sizov, S., Biwer, M., Graupmann, J., Siersdorfer, S., Theobald, M., Weikum, G., Zimmer, P.: The bingo! system for information portal generation and expert web search. CIDR 2003, First Biennial conference on innovative data systems research, Asilomar, CA, USA, 2003Google Scholar
  44. 44.
    Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-driven crawler generation by example. In: Efthimiadis, E.N., Dumais, S.T.,  Hawking, D.,  Järvelin, K. (eds.) SIGIR, pp. 292–299. ACM, New York (2006)Google Scholar
  45. 45.
    Wu, M., Marian, A.: Corroborating answers from multiple web sources. WebDB (2007)Google Scholar
  46. 46.
    Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Dipartimento di Informatica e AutomazioneUniversità degli Studi Roma TreRomeItaly

Personalised recommendations