Skip to main content

Quality of Web Data and Quality of Big Data: Open Problems

  • Chapter
  • First Online:
Book cover Data and Information Quality

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

In this chapter we discuss some open issues related to two typologies of information sources that nowadays are particularly significant, namely, Web data and Big Data.

An erratum to this chapter can be found at http://dx.doi.org/10.1007/978-3-319-24106-7_15

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.w3.org/DesignIssues/UI.html.

  2. 2.

    http://open-biomed.sourceforge.net/opmv/ns.html.

  3. 3.

    http://www.w3.org/TR/prov-o/.

References

  1. Amann B, Constantin C, Caron C, Giroux P (2013) Weblab prov: computing fine-grained provenance links for xml artifacts. In: EDBT/ICDT Workshops, pp 298–306

    Google Scholar 

  2. Anand MK, Bowers S, Ludscher B (2010) Techniques for efficiently querying scientific workflow provenance graphs. In: International Conference on Extending Database Technology (EDBT), pp 287–298

    Google Scholar 

  3. Barcaroli G, Nurra A, Scarno M, Summa D (2014) Use of web scraping and text mining techniques in the istat survey on information and communication technology in enterprises. In: Proceedings of Quality Conference 2014 (Q2014), Wien

    Google Scholar 

  4. Bender C, Orszag S (1999) Advanced Mathematical Methods for Scientists and Engineers: Asymptotic Methods and Perturbation Theory. Springer, New York

    Book  MATH  Google Scholar 

  5. Bizer C (2007) Quality-driven information filtering in the context of web-based information systems. PhD thesis, Freie Universität Berlin

    Google Scholar 

  6. Bonatti PA, Hogan A, Polleres A, Sauro L (2011) Robust and scalable linked data reasoning incorporating provenance and trust annotations. Journal of Web Semantics 9(2):165–201

    Article  Google Scholar 

  7. Bowers S, McPhillips T, Ludscher B (2012) Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: International Provenance and Annotation Workshop (IPAW), pp 1–15

    Google Scholar 

  8. Boyd D (2009) Twitter: pointless babble or peripheral awareness + social grooming? Technical report, Apophenia Inc., URL http://www.zephoria.org/thoughts/archives/2009/08/16/twitterpointle.html

    Google Scholar 

  9. Boyd D, Crawford K (2012) Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Information, Communication, & Society 15(5)

    Google Scholar 

  10. Buneman P, Khanna S, Tan WC (2001) Why and where: a characterization of data provenance. In: Proceedings of the 8th International Conference on Database Theory (ICDT)

    Google Scholar 

  11. Burke J, Estrin D, Hansen M, Parker A, Ramanathan N, Reddy S, Srivastava MB (2006) Participatory sensing. In: Proceedings of the Workshop on World-Sensor-Web (WSW) at ACM Conference on Embedded Networked Sensor Systems (SenSys 2006), Boulder, pp 417–418

    Google Scholar 

  12. Carroll J (2003) Signing rdf graphs. Technical report, HPL-2003-142, HP Labs

    Google Scholar 

  13. Chen H, Ku W, Wang H, Sun M (2010) Leveraging spatio-temporal redundancy for rfid data cleansing. In: Proceedings of SIGMOD 2010, Indianapolis, pp 51–62

    Google Scholar 

  14. Cheney J, Chiticariu L, Tan W (2007) Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1:379–474

    Article  Google Scholar 

  15. Chirigati F, Freire J (2012) Towards integrating workflow and database provenance. In: 4th International Provenance and Annotation Workshop (IPAW 2012), pp 11–23

    Google Scholar 

  16. Cui Y, Widom J, Wiener JL (2000) Tracing the Lineage of View Data in a Warehousing Environment. ACM Transactions on Database Systems 25(2):179–227

    Article  Google Scholar 

  17. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society 39:1–38

    MathSciNet  MATH  Google Scholar 

  18. Dividino R, Sizov S, Staab S, Schueler B (2009) Querying for provenance, trust, uncertainty and other meta knowledge in RDF. Web Semantics: Science, Services and Agents on the World Wide Web 7:204–219

    Article  Google Scholar 

  19. Division UNS (February 2015) http://unstats.un.org/unsd/methods/statorg/FP-English.htm (accessed)

  20. Dong XL, Berti-Equille L, Srivastava D (2009) Truth discovery and copying detection in a dynamic world. PVLDB 2(1):562–573

    Google Scholar 

  21. Duda R, Hart P, Stork D (2000) Pattern Classification. Wiley, New York

    MATH  Google Scholar 

  22. Fellegi IP, Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association 64

    Google Scholar 

  23. Flemming A (2011) Qualitätsmerkmale von Linked Data-veröffentlichenden Datenquellen. Diplomarbeit (Quality Criteria for Linked Data Sources), https://cs.uwaterloo.ca/~ohartig/files/DiplomarbeitAnnikaFlemming.pdf

  24. Galland A, Abiteboul S, Marian A, Senellart P (2010) Corroborating information from disagreeing views. In: WSDM, pp 131–140

    Google Scholar 

  25. Gallegos I, Gates A, Tweedie C (2010) Dapros: a data property specification tool to capture scientific sensor data properties. In: Proceedings of ER Workshops. Vancouver, BC, pp 232–241

    Google Scholar 

  26. Gamble M, Goble C (2011) Quality, trust, and utility of scientific data on the web: towards a joint model. In: ACM WebScience, pp 1–8

    Google Scholar 

  27. Gil Y, Artz D (2007) Towards content trust of web resources. Web Semantics 5(4):227–239

    Article  Google Scholar 

  28. Gil Y, Ratnakar V (2002) Trusting information sources one citizen at a time. In: ISWC. Springer, New York, pp 162–176

    MATH  Google Scholar 

  29. Glasson M, Trepanier J, Patruno V, Daas P, Skaliotis M, Khan A (2013) What does Big data mean for official statistics? Technical report, UNECE, URL http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=77170622

    Google Scholar 

  30. Golbeck J (2004) Inferring reputation on the semantic web. In: WWW

    Google Scholar 

  31. Hartig O (2008) Trustworthiness of data on the web. In: STI Berlin and CSW PhD Workshop, Berlin

    Google Scholar 

  32. Hartig O (2009) Provenance information in the web of data. In: Proceedings of the Linked Data on the Web (LDOW’09), Workshop of the World Wide Web Conference (WWW)

    Google Scholar 

  33. Heath T, Bizer C (2011) Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool

    Google Scholar 

  34. Hopkins D, King G (2010) A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229–247

    Article  Google Scholar 

  35. Jacobi I, Kagal L, Khandelwal A (2011) Rule-based trust assessment on the semantic web. In: International Conference on Rule-Based Reasoning, Programming, and Applications Series, pp 227–241

    Google Scholar 

  36. James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. Springer, New York

    Book  MATH  Google Scholar 

  37. Jeffery S, Alonso M Gand Franklin, Hong W, Widom J (2005) A Pipelined Framework for Online Cleaning of Sensor Data Streams. Technical report, Computer Science Division (EECS), University of California, uCB/CSD-5-1413

    Google Scholar 

  38. Jeffery S, Garofalakis M, Franklin M (2006) Adaptive cleansing for rfid data streams. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, 2006, pp 163–174

    Google Scholar 

  39. Klein A, Lehner W (2009) Representing data quality in sensor data streaming environments. Journal of Data and Information Quality 1(2)

    Google Scholar 

  40. Lantz B (2013) Machine Learning with R. Packt Publishing Ltd

    Google Scholar 

  41. Li X, Dong XL, Lyons K, Srivastava D (1999) Truth finding on the deep web: is the problem solved? In: PVLDB

    Google Scholar 

  42. Linked Open Data (LOD) (2006) http://linkeddata.org/

  43. Madhavan J, Ko D, Kot L, Ganapathy V, Rasmussen A, Halevy AY (2008) Google’s deep web crawl. PVLDB 1(2):1241–1252

    Google Scholar 

  44. Manzoor A, Truong H, S D (2008) On the evaluation of quality of context. In: European Conference on Smart Sensing & Context (EuroSSC), Zurich, pp 140–153

    Google Scholar 

  45. Memorandum S (accessed 2014) http://epp.eurostat.ec.europa.eu/portal/page/portal/pgp_ess/0_DOCS/estat/SCHEVENINGEN_MEMORANDUM%20Final%20version_0.pdf

  46. Mendes P, Mühleisen H, Bizer C (2012) Sieve: linked data quality assessment and fusion. In: LWDM

    Book  Google Scholar 

  47. NASSCOM (2012) Big Data-The Next Big Thing. URL http://www.nasscom.in/sites/default/files/researchreports/softcopy/Big%20Data%20Report%202012.pdf

    Google Scholar 

  48. Pei L, Dong XL, Maurino M, Srivastava D (2011) Linking temporal records. Frontiers of Computer Science

    MATH  Google Scholar 

  49. Perkowitz M, Etzioni O (2000) Adaptive web-sites. Communication of the ACM 43(8)

    Google Scholar 

  50. Planet B (2000) The deep web: Surfacing hidden value. The Journal of Electronic Publishing

    Google Scholar 

  51. Rao J, Doraiswamy S, Thakkar H, Colby L (2006) A deferred cleansing method for rfid data analytics. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, pp 175–186

    Google Scholar 

  52. Salamone S, Scannapieco, Scarno M (2014) Web scraping and web mining: new tools for official statistics. In: Proceedings of Societa Italiana di Statistica (SIS 2014), Cagliari, Sardegna

    Google Scholar 

  53. Scannapieco M, Virgillito A, Zardetto D (2013) Placing big data in official statistics: a big challenge? In: Proceedings of 2013 New Techniques and Tools for Statistics (NTTS) Conference, Brussels

    Google Scholar 

  54. Sha K, Shi W (2008) Consistency-driven data quality management of networked sensor systems. Journal of Parallel and Distributed Computing 68(9):1207–1221

    Article  Google Scholar 

  55. Shekarpour S, Katebi S (2010) Modeling and evaluation of trust with an extension in semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 8(1):26–36

    Article  Google Scholar 

  56. Talukdar PP, Jacob M, Mehmood MS, Crammer K, Ives ZG, Pereira F, Guha S (2008) Learning to create data-integrating queries. PVLDB 1(1):785–796

    Google Scholar 

  57. Talukdar PP, Ives ZG, Pereira F (2010) Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD Conference 2010, pp 387–398

    Google Scholar 

  58. Tan WC (2007) Provenance in databases: past, current, and future. IEEE Data Engineering Bulletin 30(4):3–12

    Google Scholar 

  59. Theoharis Y, Fundulaki I, Karvounarakis G, Christophides V (2011) On provenance of queries on semantic web data. IEEE Internet Computing 15(1):31–39

    Article  Google Scholar 

  60. Thirunarayan K, Anantharam P, Henson C, Sheth A (2013) Comparative trust management with applications: Bayesian approaches emphasis. Future Generation Computer Systems

    Google Scholar 

  61. UNECE (accessed 2014) http://www1.unece.org/stat/platform/display/bigdata/Classification+of+Types+of+Big+Data

  62. Vydiswaran VGV, Zhai C, Roth D (2011) Content-driven trust propagation framework. In: KDD, pp 974–982

    Google Scholar 

  63. W3C (2013) An overview of the prov family of documents, http://www.w3.org/TR/prov-overview/

  64. W3C (2013) W3c semantic web activity, URL http://www.w3.org/2001/sw/

  65. Wu W, Yu CT, Doan A, Meng W (2004) An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, pp 95–106

    Google Scholar 

  66. Yin X, Han J (2007) Truth discovery with multiple conflicting information providers on the web. In: Proceedings of the 2007 ACM SIGKDD International Conference Knowledge Discovery in Databases (KDD’07)

    Google Scholar 

  67. Zardetto D, Scannapieco M, Catarci T (2010) Effective automated object matching. In: Proceedings of the International Conference on Data Engineering (ICDE 2010), pp 757–768

    Google Scholar 

  68. Zardetto D, Valentino L, Scannapieco M (2011) MAERLIN: new record linkage methods at work. In: Proceedings of the 6th International Conference on New Techniques and Technologies for Statistics (NTTS 2011)

    Google Scholar 

  69. Zhao B, Rubinstein BIP, Gemmell J, Han J (2012) A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6):550–561

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Scannapieco, M., Berti, L. (2016). Quality of Web Data and Quality of Big Data: Open Problems. In: Data and Information Quality. Data-Centric Systems and Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-24106-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24106-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24104-3

  • Online ISBN: 978-3-319-24106-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics