Quality of Web Data and Quality of Big Data: Open Problems

Scannapieco, Monica; Berti, Laure

doi:10.1007/978-3-319-24106-7_14

Monica Scannapieco⁵ &
Laure Berti

Part of the book series: Data-Centric Systems and Applications ((DCSA))

5269 Accesses
2 Citations

Abstract

In this chapter we discuss some open issues related to two typologies of information sources that nowadays are particularly significant, namely, Web data and Big Data.

An erratum to this chapter can be found at http://dx.doi.org/10.1007/978-3-319-24106-7_15

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Amann B, Constantin C, Caron C, Giroux P (2013) Weblab prov: computing fine-grained provenance links for xml artifacts. In: EDBT/ICDT Workshops, pp 298–306
Google Scholar
Anand MK, Bowers S, Ludscher B (2010) Techniques for efficiently querying scientific workflow provenance graphs. In: International Conference on Extending Database Technology (EDBT), pp 287–298
Google Scholar
Barcaroli G, Nurra A, Scarno M, Summa D (2014) Use of web scraping and text mining techniques in the istat survey on information and communication technology in enterprises. In: Proceedings of Quality Conference 2014 (Q2014), Wien
Google Scholar
Bender C, Orszag S (1999) Advanced Mathematical Methods for Scientists and Engineers: Asymptotic Methods and Perturbation Theory. Springer, New York
Book MATH Google Scholar
Bizer C (2007) Quality-driven information filtering in the context of web-based information systems. PhD thesis, Freie Universität Berlin
Google Scholar
Bonatti PA, Hogan A, Polleres A, Sauro L (2011) Robust and scalable linked data reasoning incorporating provenance and trust annotations. Journal of Web Semantics 9(2):165–201
Article Google Scholar
Bowers S, McPhillips T, Ludscher B (2012) Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: International Provenance and Annotation Workshop (IPAW), pp 1–15
Google Scholar
Boyd D (2009) Twitter: pointless babble or peripheral awareness + social grooming? Technical report, Apophenia Inc., URL http://www.zephoria.org/thoughts/archives/2009/08/16/twitterpointle.html
Google Scholar
Boyd D, Crawford K (2012) Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Information, Communication, & Society 15(5)
Google Scholar
Buneman P, Khanna S, Tan WC (2001) Why and where: a characterization of data provenance. In: Proceedings of the 8th International Conference on Database Theory (ICDT)
Google Scholar
Burke J, Estrin D, Hansen M, Parker A, Ramanathan N, Reddy S, Srivastava MB (2006) Participatory sensing. In: Proceedings of the Workshop on World-Sensor-Web (WSW) at ACM Conference on Embedded Networked Sensor Systems (SenSys 2006), Boulder, pp 417–418
Google Scholar
Carroll J (2003) Signing rdf graphs. Technical report, HPL-2003-142, HP Labs
Google Scholar
Chen H, Ku W, Wang H, Sun M (2010) Leveraging spatio-temporal redundancy for rfid data cleansing. In: Proceedings of SIGMOD 2010, Indianapolis, pp 51–62
Google Scholar
Cheney J, Chiticariu L, Tan W (2007) Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1:379–474
Article Google Scholar
Chirigati F, Freire J (2012) Towards integrating workflow and database provenance. In: 4th International Provenance and Annotation Workshop (IPAW 2012), pp 11–23
Google Scholar
Cui Y, Widom J, Wiener JL (2000) Tracing the Lineage of View Data in a Warehousing Environment. ACM Transactions on Database Systems 25(2):179–227
Article Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society 39:1–38
MathSciNet MATH Google Scholar
Dividino R, Sizov S, Staab S, Schueler B (2009) Querying for provenance, trust, uncertainty and other meta knowledge in RDF. Web Semantics: Science, Services and Agents on the World Wide Web 7:204–219
Article Google Scholar
Division UNS (February 2015) http://unstats.un.org/unsd/methods/statorg/FP-English.htm (accessed)
Dong XL, Berti-Equille L, Srivastava D (2009) Truth discovery and copying detection in a dynamic world. PVLDB 2(1):562–573
Google Scholar
Duda R, Hart P, Stork D (2000) Pattern Classification. Wiley, New York
MATH Google Scholar
Fellegi IP, Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association 64
Google Scholar
Flemming A (2011) Qualitätsmerkmale von Linked Data-veröffentlichenden Datenquellen. Diplomarbeit (Quality Criteria for Linked Data Sources), https://cs.uwaterloo.ca/~ohartig/files/DiplomarbeitAnnikaFlemming.pdf
Galland A, Abiteboul S, Marian A, Senellart P (2010) Corroborating information from disagreeing views. In: WSDM, pp 131–140
Google Scholar
Gallegos I, Gates A, Tweedie C (2010) Dapros: a data property specification tool to capture scientific sensor data properties. In: Proceedings of ER Workshops. Vancouver, BC, pp 232–241
Google Scholar
Gamble M, Goble C (2011) Quality, trust, and utility of scientific data on the web: towards a joint model. In: ACM WebScience, pp 1–8
Google Scholar
Gil Y, Artz D (2007) Towards content trust of web resources. Web Semantics 5(4):227–239
Article Google Scholar
Gil Y, Ratnakar V (2002) Trusting information sources one citizen at a time. In: ISWC. Springer, New York, pp 162–176
MATH Google Scholar
Glasson M, Trepanier J, Patruno V, Daas P, Skaliotis M, Khan A (2013) What does Big data mean for official statistics? Technical report, UNECE, URL http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=77170622
Google Scholar
Golbeck J (2004) Inferring reputation on the semantic web. In: WWW
Google Scholar
Hartig O (2008) Trustworthiness of data on the web. In: STI Berlin and CSW PhD Workshop, Berlin
Google Scholar
Hartig O (2009) Provenance information in the web of data. In: Proceedings of the Linked Data on the Web (LDOW’09), Workshop of the World Wide Web Conference (WWW)
Google Scholar
Heath T, Bizer C (2011) Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool
Google Scholar
Hopkins D, King G (2010) A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229–247
Article Google Scholar
Jacobi I, Kagal L, Khandelwal A (2011) Rule-based trust assessment on the semantic web. In: International Conference on Rule-Based Reasoning, Programming, and Applications Series, pp 227–241
Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. Springer, New York
Book MATH Google Scholar
Jeffery S, Alonso M Gand Franklin, Hong W, Widom J (2005) A Pipelined Framework for Online Cleaning of Sensor Data Streams. Technical report, Computer Science Division (EECS), University of California, uCB/CSD-5-1413
Google Scholar
Jeffery S, Garofalakis M, Franklin M (2006) Adaptive cleansing for rfid data streams. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, 2006, pp 163–174
Google Scholar
Klein A, Lehner W (2009) Representing data quality in sensor data streaming environments. Journal of Data and Information Quality 1(2)
Google Scholar
Lantz B (2013) Machine Learning with R. Packt Publishing Ltd
Google Scholar
Li X, Dong XL, Lyons K, Srivastava D (1999) Truth finding on the deep web: is the problem solved? In: PVLDB
Google Scholar
Linked Open Data (LOD) (2006) http://linkeddata.org/
Madhavan J, Ko D, Kot L, Ganapathy V, Rasmussen A, Halevy AY (2008) Google’s deep web crawl. PVLDB 1(2):1241–1252
Google Scholar
Manzoor A, Truong H, S D (2008) On the evaluation of quality of context. In: European Conference on Smart Sensing & Context (EuroSSC), Zurich, pp 140–153
Google Scholar
Memorandum S (accessed 2014) http://epp.eurostat.ec.europa.eu/portal/page/portal/pgp_ess/0_DOCS/estat/SCHEVENINGEN_MEMORANDUM%20Final%20version_0.pdf
Mendes P, Mühleisen H, Bizer C (2012) Sieve: linked data quality assessment and fusion. In: LWDM
Book Google Scholar
NASSCOM (2012) Big Data-The Next Big Thing. URL http://www.nasscom.in/sites/default/files/researchreports/softcopy/Big%20Data%20Report%202012.pdf
Google Scholar
Pei L, Dong XL, Maurino M, Srivastava D (2011) Linking temporal records. Frontiers of Computer Science
MATH Google Scholar
Perkowitz M, Etzioni O (2000) Adaptive web-sites. Communication of the ACM 43(8)
Google Scholar
Planet B (2000) The deep web: Surfacing hidden value. The Journal of Electronic Publishing
Google Scholar
Rao J, Doraiswamy S, Thakkar H, Colby L (2006) A deferred cleansing method for rfid data analytics. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, pp 175–186
Google Scholar
Salamone S, Scannapieco, Scarno M (2014) Web scraping and web mining: new tools for official statistics. In: Proceedings of Societa Italiana di Statistica (SIS 2014), Cagliari, Sardegna
Google Scholar
Scannapieco M, Virgillito A, Zardetto D (2013) Placing big data in official statistics: a big challenge? In: Proceedings of 2013 New Techniques and Tools for Statistics (NTTS) Conference, Brussels
Google Scholar
Sha K, Shi W (2008) Consistency-driven data quality management of networked sensor systems. Journal of Parallel and Distributed Computing 68(9):1207–1221
Article Google Scholar
Shekarpour S, Katebi S (2010) Modeling and evaluation of trust with an extension in semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 8(1):26–36
Article Google Scholar
Talukdar PP, Jacob M, Mehmood MS, Crammer K, Ives ZG, Pereira F, Guha S (2008) Learning to create data-integrating queries. PVLDB 1(1):785–796
Google Scholar
Talukdar PP, Ives ZG, Pereira F (2010) Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD Conference 2010, pp 387–398
Google Scholar
Tan WC (2007) Provenance in databases: past, current, and future. IEEE Data Engineering Bulletin 30(4):3–12
Google Scholar
Theoharis Y, Fundulaki I, Karvounarakis G, Christophides V (2011) On provenance of queries on semantic web data. IEEE Internet Computing 15(1):31–39
Article Google Scholar
Thirunarayan K, Anantharam P, Henson C, Sheth A (2013) Comparative trust management with applications: Bayesian approaches emphasis. Future Generation Computer Systems
Google Scholar
UNECE (accessed 2014) http://www1.unece.org/stat/platform/display/bigdata/Classification+of+Types+of+Big+Data
Vydiswaran VGV, Zhai C, Roth D (2011) Content-driven trust propagation framework. In: KDD, pp 974–982
Google Scholar
W3C (2013) An overview of the prov family of documents, http://www.w3.org/TR/prov-overview/
W3C (2013) W3c semantic web activity, URL http://www.w3.org/2001/sw/
Wu W, Yu CT, Doan A, Meng W (2004) An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, pp 95–106
Google Scholar
Yin X, Han J (2007) Truth discovery with multiple conflicting information providers on the web. In: Proceedings of the 2007 ACM SIGKDD International Conference Knowledge Discovery in Databases (KDD’07)
Google Scholar
Zardetto D, Scannapieco M, Catarci T (2010) Effective automated object matching. In: Proceedings of the International Conference on Data Engineering (ICDE 2010), pp 757–768
Google Scholar
Zardetto D, Valentino L, Scannapieco M (2011) MAERLIN: new record linkage methods at work. In: Proceedings of the 6th International Conference on New Techniques and Technologies for Statistics (NTTS 2011)
Google Scholar
Zhao B, Rubinstein BIP, Gemmell J, Han J (2012) A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6):550–561
Google Scholar

Download references

Author information

Authors and Affiliations

Istituto Nazionale di Statistica-Istat, Rome, Italy
Monica Scannapieco

Authors

Monica Scannapieco
View author publications
You can also search for this author in PubMed Google Scholar
Laure Berti
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Scannapieco, M., Berti, L. (2016). Quality of Web Data and Quality of Big Data: Open Problems. In: Data and Information Quality. Data-Centric Systems and Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-24106-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-24106-7_14
Published: 24 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24104-3
Online ISBN: 978-3-319-24106-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics