Quality of Web Data and Quality of Big Data: Open Problems

  • Monica Scannapieco
  • Laure Berti
Chapter
Part of the Data-Centric Systems and Applications book series (DCSA)

Abstract

In this chapter we discuss some open issues related to two typologies of information sources that nowadays are particularly significant, namely, Web data and Big Data.

Keywords

Sensor Network Sensor Node National Statistical Institute Twitter Data Provenance Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 19.
    Amann B, Constantin C, Caron C, Giroux P (2013) Weblab prov: computing fine-grained provenance links for xml artifacts. In: EDBT/ICDT Workshops, pp 298–306Google Scholar
  2. 22.
    Anand MK, Bowers S, Ludscher B (2010) Techniques for efficiently querying scientific workflow provenance graphs. In: International Conference on Extending Database Technology (EDBT), pp 287–298Google Scholar
  3. 35.
    Barcaroli G, Nurra A, Scarno M, Summa D (2014) Use of web scraping and text mining techniques in the istat survey on information and communication technology in enterprises. In: Proceedings of Quality Conference 2014 (Q2014), WienGoogle Scholar
  4. 53.
    Bender C, Orszag S (1999) Advanced Mathematical Methods for Scientists and Engineers: Asymptotic Methods and Perturbation Theory. Springer, New YorkCrossRefMATHGoogle Scholar
  5. 78.
    Bizer C (2007) Quality-driven information filtering in the context of web-based information systems. PhD thesis, Freie Universität BerlinGoogle Scholar
  6. 84.
    Bonatti PA, Hogan A, Polleres A, Sauro L (2011) Robust and scalable linked data reasoning incorporating provenance and trust annotations. Journal of Web Semantics 9(2):165–201CrossRefGoogle Scholar
  7. 87.
    Bowers S, McPhillips T, Ludscher B (2012) Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: International Provenance and Annotation Workshop (IPAW), pp 1–15Google Scholar
  8. 88.
    Boyd D (2009) Twitter: pointless babble or peripheral awareness + social grooming? Technical report, Apophenia Inc., URL http://www.zephoria.org/thoughts/archives/2009/08/16/twitterpointle.html Google Scholar
  9. 89.
    Boyd D, Crawford K (2012) Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Information, Communication, & Society 15(5)Google Scholar
  10. 98.
    Buneman P, Khanna S, Tan WC (2001) Why and where: a characterization of data provenance. In: Proceedings of the 8th International Conference on Database Theory (ICDT)Google Scholar
  11. 99.
    Burke J, Estrin D, Hansen M, Parker A, Ramanathan N, Reddy S, Srivastava MB (2006) Participatory sensing. In: Proceedings of the Workshop on World-Sensor-Web (WSW) at ACM Conference on Embedded Networked Sensor Systems (SenSys 2006), Boulder, pp 417–418Google Scholar
  12. 112.
    Carroll J (2003) Signing rdf graphs. Technical report, HPL-2003-142, HP LabsGoogle Scholar
  13. 125.
    Chen H, Ku W, Wang H, Sun M (2010) Leveraging spatio-temporal redundancy for rfid data cleansing. In: Proceedings of SIGMOD 2010, Indianapolis, pp 51–62Google Scholar
  14. 128.
    Cheney J, Chiticariu L, Tan W (2007) Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1:379–474CrossRefGoogle Scholar
  15. 132.
    Chirigati F, Freire J (2012) Towards integrating workflow and database provenance. In: 4th International Provenance and Annotation Workshop (IPAW 2012), pp 11–23Google Scholar
  16. 158.
    Cui Y, Widom J, Wiener JL (2000) Tracing the Lineage of View Data in a Warehousing Environment. ACM Transactions on Database Systems 25(2):179–227CrossRefGoogle Scholar
  17. 174.
    Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society 39:1–38MathSciNetMATHGoogle Scholar
  18. 176.
    Dividino R, Sizov S, Staab S, Schueler B (2009) Querying for provenance, trust, uncertainty and other meta knowledge in RDF. Web Semantics: Science, Services and Agents on the World Wide Web 7:204–219CrossRefGoogle Scholar
  19. 177.
    Division UNS (February 2015) http://unstats.un.org/unsd/methods/statorg/FP-English.htm (accessed)
  20. 181.
    Dong XL, Berti-Equille L, Srivastava D (2009) Truth discovery and copying detection in a dynamic world. PVLDB 2(1):562–573Google Scholar
  21. 185.
    Duda R, Hart P, Stork D (2000) Pattern Classification. Wiley, New YorkMATHGoogle Scholar
  22. 229.
    Fellegi IP, Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association 64Google Scholar
  23. 238.
    Flemming A (2011) Qualitätsmerkmale von Linked Data-veröffentlichenden Datenquellen. Diplomarbeit (Quality Criteria for Linked Data Sources), https://cs.uwaterloo.ca/~ohartig/files/DiplomarbeitAnnikaFlemming.pdf
  24. 252.
    Galland A, Abiteboul S, Marian A, Senellart P (2010) Corroborating information from disagreeing views. In: WSDM, pp 131–140Google Scholar
  25. 253.
    Gallegos I, Gates A, Tweedie C (2010) Dapros: a data property specification tool to capture scientific sensor data properties. In: Proceedings of ER Workshops. Vancouver, BC, pp 232–241Google Scholar
  26. 254.
    Gamble M, Goble C (2011) Quality, trust, and utility of scientific data on the web: towards a joint model. In: ACM WebScience, pp 1–8Google Scholar
  27. 263.
    Gil Y, Artz D (2007) Towards content trust of web resources. Web Semantics 5(4):227–239CrossRefGoogle Scholar
  28. 264.
    Gil Y, Ratnakar V (2002) Trusting information sources one citizen at a time. In: ISWC. Springer, New York, pp 162–176MATHGoogle Scholar
  29. 267.
    Glasson M, Trepanier J, Patruno V, Daas P, Skaliotis M, Khan A (2013) What does Big data mean for official statistics? Technical report, UNECE, URL http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=77170622 Google Scholar
  30. 269.
    Golbeck J (2004) Inferring reputation on the semantic web. In: WWWGoogle Scholar
  31. 298.
    Hartig O (2008) Trustworthiness of data on the web. In: STI Berlin and CSW PhD Workshop, BerlinGoogle Scholar
  32. 299.
    Hartig O (2009) Provenance information in the web of data. In: Proceedings of the Linked Data on the Web (LDOW’09), Workshop of the World Wide Web Conference (WWW)Google Scholar
  33. 304.
    Heath T, Bizer C (2011) Linked Data: Evolving the Web into a Global Data Space. Morgan & ClaypoolGoogle Scholar
  34. 315.
    Hopkins D, King G (2010) A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229–247CrossRefGoogle Scholar
  35. 336.
    Jacobi I, Kagal L, Khandelwal A (2011) Rule-based trust assessment on the semantic web. In: International Conference on Rule-Based Reasoning, Programming, and Applications Series, pp 227–241Google Scholar
  36. 338.
    James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. Springer, New YorkCrossRefMATHGoogle Scholar
  37. 346.
    Jeffery S, Alonso M Gand Franklin, Hong W, Widom J (2005) A Pipelined Framework for Online Cleaning of Sensor Data Streams. Technical report, Computer Science Division (EECS), University of California, uCB/CSD-5-1413Google Scholar
  38. 347.
    Jeffery S, Garofalakis M, Franklin M (2006) Adaptive cleansing for rfid data streams. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, 2006, pp 163–174Google Scholar
  39. 371.
    Klein A, Lehner W (2009) Representing data quality in sensor data streaming environments. Journal of Data and Information Quality 1(2)Google Scholar
  40. 386.
    Lantz B (2013) Machine Learning with R. Packt Publishing LtdGoogle Scholar
  41. 401.
    Li X, Dong XL, Lyons K, Srivastava D (1999) Truth finding on the deep web: is the problem solved? In: PVLDBGoogle Scholar
  42. 407.
    Linked Open Data (LOD) (2006) http://linkeddata.org/
  43. 423.
    Madhavan J, Ko D, Kot L, Ganapathy V, Rasmussen A, Halevy AY (2008) Google’s deep web crawl. PVLDB 1(2):1241–1252Google Scholar
  44. 427.
    Manzoor A, Truong H, S D (2008) On the evaluation of quality of context. In: European Conference on Smart Sensing & Context (EuroSSC), Zurich, pp 140–153Google Scholar
  45. 435.
  46. 436.
    Mendes P, Mühleisen H, Bizer C (2012) Sieve: linked data quality assessment and fusion. In: LWDMCrossRefGoogle Scholar
  47. 460.
  48. 496.
    Pei L, Dong XL, Maurino M, Srivastava D (2011) Linking temporal records. Frontiers of Computer ScienceMATHGoogle Scholar
  49. 497.
    Perkowitz M, Etzioni O (2000) Adaptive web-sites. Communication of the ACM 43(8)Google Scholar
  50. 506.
    Planet B (2000) The deep web: Surfacing hidden value. The Journal of Electronic PublishingGoogle Scholar
  51. 517.
    Rao J, Doraiswamy S, Thakkar H, Colby L (2006) A deferred cleansing method for rfid data analytics. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, pp 175–186Google Scholar
  52. 543.
    Salamone S, Scannapieco, Scarno M (2014) Web scraping and web mining: new tools for official statistics. In: Proceedings of Societa Italiana di Statistica (SIS 2014), Cagliari, SardegnaGoogle Scholar
  53. 551.
    Scannapieco M, Virgillito A, Zardetto D (2013) Placing big data in official statistics: a big challenge? In: Proceedings of 2013 New Techniques and Tools for Statistics (NTTS) Conference, BrusselsGoogle Scholar
  54. 562.
    Sha K, Shi W (2008) Consistency-driven data quality management of networked sensor systems. Journal of Parallel and Distributed Computing 68(9):1207–1221CrossRefGoogle Scholar
  55. 569.
    Shekarpour S, Katebi S (2010) Modeling and evaluation of trust with an extension in semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 8(1):26–36CrossRefGoogle Scholar
  56. 599.
    Talukdar PP, Jacob M, Mehmood MS, Crammer K, Ives ZG, Pereira F, Guha S (2008) Learning to create data-integrating queries. PVLDB 1(1):785–796Google Scholar
  57. 600.
    Talukdar PP, Ives ZG, Pereira F (2010) Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD Conference 2010, pp 387–398Google Scholar
  58. 602.
    Tan WC (2007) Provenance in databases: past, current, and future. IEEE Data Engineering Bulletin 30(4):3–12Google Scholar
  59. 607.
    Theoharis Y, Fundulaki I, Karvounarakis G, Christophides V (2011) On provenance of queries on semantic web data. IEEE Internet Computing 15(1):31–39CrossRefGoogle Scholar
  60. 609.
    Thirunarayan K, Anantharam P, Henson C, Sheth A (2013) Comparative trust management with applications: Bayesian approaches emphasis. Future Generation Computer SystemsGoogle Scholar
  61. 616.
  62. 634.
    Vydiswaran VGV, Zhai C, Roth D (2011) Content-driven trust propagation framework. In: KDD, pp 974–982Google Scholar
  63. 635.
    W3C (2013) An overview of the prov family of documents, http://www.w3.org/TR/prov-overview/
  64. 636.
    W3C (2013) W3c semantic web activity, URL http://www.w3.org/2001/sw/
  65. 679.
    Wu W, Yu CT, Doan A, Meng W (2004) An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, pp 95–106Google Scholar
  66. 688.
    Yin X, Han J (2007) Truth discovery with multiple conflicting information providers on the web. In: Proceedings of the 2007 ACM SIGKDD International Conference Knowledge Discovery in Databases (KDD’07)Google Scholar
  67. 690.
    Zardetto D, Scannapieco M, Catarci T (2010) Effective automated object matching. In: Proceedings of the International Conference on Data Engineering (ICDE 2010), pp 757–768Google Scholar
  68. 691.
    Zardetto D, Valentino L, Scannapieco M (2011) MAERLIN: new record linkage methods at work. In: Proceedings of the 6th International Conference on New Techniques and Technologies for Statistics (NTTS 2011)Google Scholar
  69. 694.
    Zhao B, Rubinstein BIP, Gemmell J, Han J (2012) A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6):550–561Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Monica Scannapieco
    • 1
  • Laure Berti
  1. 1.Istituto Nazionale di Statistica-IstatRomeItaly

Personalised recommendations