Data and Information Quality pp 421-449 | Cite as
Quality of Web Data and Quality of Big Data: Open Problems
Chapter
First Online:
Abstract
In this chapter we discuss some open issues related to two typologies of information sources that nowadays are particularly significant, namely, Web data and Big Data.
Keywords
Sensor Network Sensor Node National Statistical Institute Twitter Data Provenance Information
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
- 19.Amann B, Constantin C, Caron C, Giroux P (2013) Weblab prov: computing fine-grained provenance links for xml artifacts. In: EDBT/ICDT Workshops, pp 298–306Google Scholar
- 22.Anand MK, Bowers S, Ludscher B (2010) Techniques for efficiently querying scientific workflow provenance graphs. In: International Conference on Extending Database Technology (EDBT), pp 287–298Google Scholar
- 35.Barcaroli G, Nurra A, Scarno M, Summa D (2014) Use of web scraping and text mining techniques in the istat survey on information and communication technology in enterprises. In: Proceedings of Quality Conference 2014 (Q2014), WienGoogle Scholar
- 53.Bender C, Orszag S (1999) Advanced Mathematical Methods for Scientists and Engineers: Asymptotic Methods and Perturbation Theory. Springer, New YorkCrossRefMATHGoogle Scholar
- 78.Bizer C (2007) Quality-driven information filtering in the context of web-based information systems. PhD thesis, Freie Universität BerlinGoogle Scholar
- 84.Bonatti PA, Hogan A, Polleres A, Sauro L (2011) Robust and scalable linked data reasoning incorporating provenance and trust annotations. Journal of Web Semantics 9(2):165–201CrossRefGoogle Scholar
- 87.Bowers S, McPhillips T, Ludscher B (2012) Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: International Provenance and Annotation Workshop (IPAW), pp 1–15Google Scholar
- 88.Boyd D (2009) Twitter: pointless babble or peripheral awareness + social grooming? Technical report, Apophenia Inc., URL http://www.zephoria.org/thoughts/archives/2009/08/16/twitterpointle.html Google Scholar
- 89.Boyd D, Crawford K (2012) Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Information, Communication, & Society 15(5)Google Scholar
- 98.Buneman P, Khanna S, Tan WC (2001) Why and where: a characterization of data provenance. In: Proceedings of the 8th International Conference on Database Theory (ICDT)Google Scholar
- 99.Burke J, Estrin D, Hansen M, Parker A, Ramanathan N, Reddy S, Srivastava MB (2006) Participatory sensing. In: Proceedings of the Workshop on World-Sensor-Web (WSW) at ACM Conference on Embedded Networked Sensor Systems (SenSys 2006), Boulder, pp 417–418Google Scholar
- 112.Carroll J (2003) Signing rdf graphs. Technical report, HPL-2003-142, HP LabsGoogle Scholar
- 125.Chen H, Ku W, Wang H, Sun M (2010) Leveraging spatio-temporal redundancy for rfid data cleansing. In: Proceedings of SIGMOD 2010, Indianapolis, pp 51–62Google Scholar
- 128.Cheney J, Chiticariu L, Tan W (2007) Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1:379–474CrossRefGoogle Scholar
- 132.Chirigati F, Freire J (2012) Towards integrating workflow and database provenance. In: 4th International Provenance and Annotation Workshop (IPAW 2012), pp 11–23Google Scholar
- 158.Cui Y, Widom J, Wiener JL (2000) Tracing the Lineage of View Data in a Warehousing Environment. ACM Transactions on Database Systems 25(2):179–227CrossRefGoogle Scholar
- 174.Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society 39:1–38MathSciNetMATHGoogle Scholar
- 176.Dividino R, Sizov S, Staab S, Schueler B (2009) Querying for provenance, trust, uncertainty and other meta knowledge in RDF. Web Semantics: Science, Services and Agents on the World Wide Web 7:204–219CrossRefGoogle Scholar
- 177.Division UNS (February 2015) http://unstats.un.org/unsd/methods/statorg/FP-English.htm (accessed)
- 181.Dong XL, Berti-Equille L, Srivastava D (2009) Truth discovery and copying detection in a dynamic world. PVLDB 2(1):562–573Google Scholar
- 185.Duda R, Hart P, Stork D (2000) Pattern Classification. Wiley, New YorkMATHGoogle Scholar
- 229.Fellegi IP, Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association 64Google Scholar
- 238.Flemming A (2011) Qualitätsmerkmale von Linked Data-veröffentlichenden Datenquellen. Diplomarbeit (Quality Criteria for Linked Data Sources), https://cs.uwaterloo.ca/~ohartig/files/DiplomarbeitAnnikaFlemming.pdf
- 252.Galland A, Abiteboul S, Marian A, Senellart P (2010) Corroborating information from disagreeing views. In: WSDM, pp 131–140Google Scholar
- 253.Gallegos I, Gates A, Tweedie C (2010) Dapros: a data property specification tool to capture scientific sensor data properties. In: Proceedings of ER Workshops. Vancouver, BC, pp 232–241Google Scholar
- 254.Gamble M, Goble C (2011) Quality, trust, and utility of scientific data on the web: towards a joint model. In: ACM WebScience, pp 1–8Google Scholar
- 263.Gil Y, Artz D (2007) Towards content trust of web resources. Web Semantics 5(4):227–239CrossRefGoogle Scholar
- 264.Gil Y, Ratnakar V (2002) Trusting information sources one citizen at a time. In: ISWC. Springer, New York, pp 162–176MATHGoogle Scholar
- 267.Glasson M, Trepanier J, Patruno V, Daas P, Skaliotis M, Khan A (2013) What does Big data mean for official statistics? Technical report, UNECE, URL http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=77170622 Google Scholar
- 269.Golbeck J (2004) Inferring reputation on the semantic web. In: WWWGoogle Scholar
- 298.Hartig O (2008) Trustworthiness of data on the web. In: STI Berlin and CSW PhD Workshop, BerlinGoogle Scholar
- 299.Hartig O (2009) Provenance information in the web of data. In: Proceedings of the Linked Data on the Web (LDOW’09), Workshop of the World Wide Web Conference (WWW)Google Scholar
- 304.Heath T, Bizer C (2011) Linked Data: Evolving the Web into a Global Data Space. Morgan & ClaypoolGoogle Scholar
- 315.Hopkins D, King G (2010) A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229–247CrossRefGoogle Scholar
- 336.Jacobi I, Kagal L, Khandelwal A (2011) Rule-based trust assessment on the semantic web. In: International Conference on Rule-Based Reasoning, Programming, and Applications Series, pp 227–241Google Scholar
- 338.James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. Springer, New YorkCrossRefMATHGoogle Scholar
- 346.Jeffery S, Alonso M Gand Franklin, Hong W, Widom J (2005) A Pipelined Framework for Online Cleaning of Sensor Data Streams. Technical report, Computer Science Division (EECS), University of California, uCB/CSD-5-1413Google Scholar
- 347.Jeffery S, Garofalakis M, Franklin M (2006) Adaptive cleansing for rfid data streams. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, 2006, pp 163–174Google Scholar
- 371.Klein A, Lehner W (2009) Representing data quality in sensor data streaming environments. Journal of Data and Information Quality 1(2)Google Scholar
- 386.Lantz B (2013) Machine Learning with R. Packt Publishing LtdGoogle Scholar
- 401.Li X, Dong XL, Lyons K, Srivastava D (1999) Truth finding on the deep web: is the problem solved? In: PVLDBGoogle Scholar
- 407.Linked Open Data (LOD) (2006) http://linkeddata.org/
- 423.Madhavan J, Ko D, Kot L, Ganapathy V, Rasmussen A, Halevy AY (2008) Google’s deep web crawl. PVLDB 1(2):1241–1252Google Scholar
- 427.Manzoor A, Truong H, S D (2008) On the evaluation of quality of context. In: European Conference on Smart Sensing & Context (EuroSSC), Zurich, pp 140–153Google Scholar
- 435.
- 436.Mendes P, Mühleisen H, Bizer C (2012) Sieve: linked data quality assessment and fusion. In: LWDMCrossRefGoogle Scholar
- 460.NASSCOM (2012) Big Data-The Next Big Thing. URL http://www.nasscom.in/sites/default/files/researchreports/softcopy/Big%20Data%20Report%202012.pdf Google Scholar
- 496.Pei L, Dong XL, Maurino M, Srivastava D (2011) Linking temporal records. Frontiers of Computer ScienceMATHGoogle Scholar
- 497.Perkowitz M, Etzioni O (2000) Adaptive web-sites. Communication of the ACM 43(8)Google Scholar
- 506.Planet B (2000) The deep web: Surfacing hidden value. The Journal of Electronic PublishingGoogle Scholar
- 517.Rao J, Doraiswamy S, Thakkar H, Colby L (2006) A deferred cleansing method for rfid data analytics. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, pp 175–186Google Scholar
- 543.Salamone S, Scannapieco, Scarno M (2014) Web scraping and web mining: new tools for official statistics. In: Proceedings of Societa Italiana di Statistica (SIS 2014), Cagliari, SardegnaGoogle Scholar
- 551.Scannapieco M, Virgillito A, Zardetto D (2013) Placing big data in official statistics: a big challenge? In: Proceedings of 2013 New Techniques and Tools for Statistics (NTTS) Conference, BrusselsGoogle Scholar
- 562.Sha K, Shi W (2008) Consistency-driven data quality management of networked sensor systems. Journal of Parallel and Distributed Computing 68(9):1207–1221CrossRefGoogle Scholar
- 569.Shekarpour S, Katebi S (2010) Modeling and evaluation of trust with an extension in semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 8(1):26–36CrossRefGoogle Scholar
- 599.Talukdar PP, Jacob M, Mehmood MS, Crammer K, Ives ZG, Pereira F, Guha S (2008) Learning to create data-integrating queries. PVLDB 1(1):785–796Google Scholar
- 600.Talukdar PP, Ives ZG, Pereira F (2010) Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD Conference 2010, pp 387–398Google Scholar
- 602.Tan WC (2007) Provenance in databases: past, current, and future. IEEE Data Engineering Bulletin 30(4):3–12Google Scholar
- 607.Theoharis Y, Fundulaki I, Karvounarakis G, Christophides V (2011) On provenance of queries on semantic web data. IEEE Internet Computing 15(1):31–39CrossRefGoogle Scholar
- 609.Thirunarayan K, Anantharam P, Henson C, Sheth A (2013) Comparative trust management with applications: Bayesian approaches emphasis. Future Generation Computer SystemsGoogle Scholar
- 616.UNECE (accessed 2014) http://www1.unece.org/stat/platform/display/bigdata/Classification+of+Types+of+Big+Data
- 634.Vydiswaran VGV, Zhai C, Roth D (2011) Content-driven trust propagation framework. In: KDD, pp 974–982Google Scholar
- 635.W3C (2013) An overview of the prov family of documents, http://www.w3.org/TR/prov-overview/
- 636.W3C (2013) W3c semantic web activity, URL http://www.w3.org/2001/sw/
- 679.Wu W, Yu CT, Doan A, Meng W (2004) An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, pp 95–106Google Scholar
- 688.Yin X, Han J (2007) Truth discovery with multiple conflicting information providers on the web. In: Proceedings of the 2007 ACM SIGKDD International Conference Knowledge Discovery in Databases (KDD’07)Google Scholar
- 690.Zardetto D, Scannapieco M, Catarci T (2010) Effective automated object matching. In: Proceedings of the International Conference on Data Engineering (ICDE 2010), pp 757–768Google Scholar
- 691.Zardetto D, Valentino L, Scannapieco M (2011) MAERLIN: new record linkage methods at work. In: Proceedings of the 6th International Conference on New Techniques and Technologies for Statistics (NTTS 2011)Google Scholar
- 694.Zhao B, Rubinstein BIP, Gemmell J, Han J (2012) A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6):550–561Google Scholar
Copyright information
© Springer International Publishing Switzerland 2016