Measuring and Facilitating Data Repeatability in Web Science

Risch, Julian; Krestel, Ralf

doi:10.1007/s13222-019-00316-9

Measuring and Facilitating Data Repeatability in Web Science

Schwerpunktbeitrag
Published: 01 June 2019

Volume 19, pages 117–126, (2019)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

Julian Risch¹ &
Ralf Krestel¹

177 Accesses
3 Citations
Explore all metrics

Abstract

Accessible and reusable datasets are a necessity to accomplish repeatable research. This requirement poses a problem particularly for web science, since scraped data comes in various formats and can change due to the dynamic character of the web. Further, usage of web data is typically restricted by copyright-protection or privacy regulations, which hinder publication of datasets.

To alleviate these problems and reach what we define as “partial data repeatability”, we present a process that consists of multiple components. Researchers need to distribute only a scraper and not the data itself to comply with legal limitations. If a dataset is re-scraped for repeatability after some time, the integrity of different versions can be checked based on fingerprints. Moreover, fingerprints are sufficient to identify what parts of the data have changed and how much.

We evaluate an implementation of this process with a dataset of 250 million online comments collected from five different news discussion platforms. We re-scraped the dataset after pausing for one year and show that less than ten percent of the data has actually changed. These experiments demonstrate that providing a scraper and fingerprints enables recreating a dataset and supports the repeatability of web science experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Cohen KB, Xia J, Zweigenbaum P, Callahan TJ, Hargraves O, Goss F, Ide N, Névéol A, Grouin C, Hunter LE (2018) Three dimensions of reproducibility in natural language processing. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC), May 7–12, 2018, European Language Resources Association (ELRA), Miyazaki, p. 156–165
Rozier KY, Rozier EWD (2014) Reproducibility, correctness, and buildability: The three principles for ethical public dissemination of computer science and engineering research. In: Proceedings of the International Symposium on Ethics in Engineering, Science, and Technology (ETHICS) IEEE, pp 1–13
Google Scholar
Collberg C, Proebsting TA (2016) Repeatability in computer systems research. Commun ACM 59(3):62–69
Article Google Scholar
Kovačević J (2007) How to encourage and publish reproducible research. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE. vol 4, pp IV–1273
Google Scholar
Vandewalle P, Kovacevic J, Vetterli M (2009) Reproducible research in signal processing. Signal Process Mag 26(3):37–47
Article Google Scholar
Howe B (2012) Cde: A tool for creating portable experimental software packages. Comput Sci Eng 14(4):32–35
Article Google Scholar
Janin Y, Vincent C, Duraffort R (2014) Care, the comprehensive archiver for reproducible execution. In: Proceedings of the SIGPLAN Workshop on Reproducible Research Methodologies and New Publication Models in Computer Engineering ACM, pp 1:1–1:7
Google Scholar
Pham Q, Malik T, Foster IT (2013) Using provenance for repeatability. In: Proceedings of the Workshop on the USENIX Theory and Practice of Provenance, pp 2:1–2:4
Google Scholar
Chirigati F, Rampin R, Shasha D, Freire J (2016) Reprozip: computational reproducibility with ease. In: Proceedings of the International Conference on Management of Data (SIGMOD) ACM, pp 2085–2088
Google Scholar
Pedersen T (2008) Empiricism is not a matter of faith. Comput Linguist 34(3):465–470
Article Google Scholar
Sonnenburg S, Braun ML, Ong CS, Bengio S, Bottou L, Holmes G, LeCun Y, Müller KR, Pereira F, Rasmussen CE et al (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466
Google Scholar
Vitek J, Kalibera T (2011) Repeatability, reproducibility, and rigor in systems research. In: Proceedings of the International Conference on Embedded Software (EMSOFT) ACM, pp 33–38
Google Scholar
Drummond C (2008) Finding a balance between anarchy and orthodoxy. In: Proceedings of the International Conference on Machine Learning: Workshop on Evaluation Methods for Machine Learning (ICML)
Google Scholar
Blockeel H, Vanschoren J (2007) Experiment databases: towards an improved experimental methodology in machine learning. In: European Conference on Principles of Data Mining and Knowledge Discovery (ECML PKDD). Springer, Berlin Heidelberg, pp 6–17
Google Scholar
Pandit H, Hamed RG, Lawless S, Lewis D (2016) The use of open data to improve the repeatability of adaptivity and personalisation experiment. In: Proceedings of the Conference on User Modelling, Adaptation and Personalization (UMAP Extended Proceedings)
Google Scholar
Blanco R, Halpin H, Herzig DM, Mika P, Pound J, Thompson HS, Tran T (2013) Repeatable and reliable semantic search evaluation. Web Semant Sci Serv Agents World Wide Web 21:14–29
Article Google Scholar
Godbole S, Bhattacharya I, Gupta A, Verma A (2010) Building re-usable dictionary repositories for real-world text mining. In: Proceedings of the International Conference on Information and Knowledge Management (CIKM). ACM, New York, pp 1189–1198
Google Scholar
Napoles C, Tetreault J, Pappu A, Rosato E, Provenzale B (2017) Finding good conversations online: the yahoo news annotated comments corpus. In: Proceedings of the Linguistic Annotation Workshop, pp 13–23
Chapter Google Scholar
Schabus D, Skowron M, Trapp M (2017) One million posts: A data set of german online discussions. In: Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR), pp 1241–1244
Google Scholar
Wulczyn E, Thain N, Dixon L (2017) Ex machina: personal attacks seen at scale. In: International World Wide Web Conferences Steering Committee (ed) Proceedings of the International Conference on World Wide Web (WWW), pp 1391–1399
Chapter Google Scholar
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: Proceedings of the International Conference on Web and Social Media (ICWSM), pp 512–515
Google Scholar
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the ACM Symposium on Theory of Computing ACM, pp 380–388
Google Scholar
Manku GS, Jain A, Das Sarma A (2007) Detecting near-duplicates for web crawling. In: Proceedings of the International Conference on World Wide Web (WWW) ACM, pp 141–150
Chapter Google Scholar
Halder R, Pal S, Cortesi A (2010) Watermarking techniques for relational databases: survey, classification and comparison. J Univers Comput Sci 16(21):3164–3190
Google Scholar
Ambroselli C, Risch J, Krestel R, Loos A (2018) Prediction for the newsroom: Which articles will get the most comments? In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) ACL, pp 193–199
Google Scholar

Download references

Author information

Authors and Affiliations

Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2–3, 14482, Potsdam, Germany
Julian Risch & Ralf Krestel

Authors

Julian Risch
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Krestel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julian Risch.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Risch, J., Krestel, R. Measuring and Facilitating Data Repeatability in Web Science. Datenbank Spektrum 19, 117–126 (2019). https://doi.org/10.1007/s13222-019-00316-9

Download citation

Received: 06 February 2019
Accepted: 22 May 2019
Published: 01 June 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s13222-019-00316-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Measuring and Facilitating Data Repeatability in Web Science

Abstract

Access this article

Similar content being viewed by others

Practicing What is Preached: Exploring Reproducibility Compliance of Papers on Reproducible Research

ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of Jupyter Notebooks

Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Measuring and Facilitating Data Repeatability in Web Science

Abstract

Access this article

Similar content being viewed by others

Practicing What is Preached: Exploring Reproducibility Compliance of Papers on Reproducible Research

ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of Jupyter Notebooks

Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation