Detecting Duplicate Records in Scientific Workflow Results

Belhajjame, Khalid; Missier, Paolo; Goble, Carole A.

doi:10.1007/978-3-642-34222-6_10

Khalid Belhajjame¹⁸,
Paolo Missier¹⁹ &
Carole A. Goble¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7525))

Included in the following conference series:

International Provenance and Annotation Workshop

993 Accesses
1 Citations

Abstract

Scientific workflows are often data intensive. The data sets obtained by enacting scientific workflows have several applications, e.g., they can be used to identify data correlations or to understand phenomena, and therefore are worth storing in repositories for future analyzes. Our experience suggests that such datasets often contain duplicate records. Indeed, scientists tend to enact the same workflow multiple times using the same or overlapping datasets, which gives rise to duplicates in workflow results. The presence of duplicates may increase the complexity of workflow results interpretation and analyzes. Moreover, it unnecessarily increases the size of datasets within workflow results repositories. In this paper, we present an approach whereby duplicates detection is guided by workflow provenance trace. The hypothesis that we explore and exploit is that the operations that compose a workflow are likely to produce the same (or overlapping) dataset given the same (or overlapping) dataset. A preliminary analytic and empirical validation shows the effectiveness and applicability of the method proposed.

Download to read the full chapter text

Chapter PDF

To Show or Not to Show in Workflow Provenance

A Brief Tour Through Provenance in Scientific Workflows and Databases

Mechanisms for provenance collection in scientific workflow systems

Article 14 November 2017

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of the KDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 25–27 (2003)
Google Scholar
Belhajjame, K., Goble, C., Tanoh, F., Bhagat, J., Wolstencroft, K., Stevens, R., Nzuobontane, E., McWilliam, H., Laurent, T., Lopez, R.: BioCatalogue: A Curated Web Service Registry for the Life Science Community. In: Proceedings of the Microsoft eScience Conference (2008)
Google Scholar
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: Learning to scale up record linkage. In: ICDM, pp. 87–96. IEEE Computer Society (2006)
Google Scholar
Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD, pp. 1065–1068. ACM (2008)
Google Scholar
Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL 1979, pp. 269–282. ACM, New York (1979)
Google Scholar
Deelman, E., Chervenak, A.L.: Data management challenges of data-intensive scientific workflows. In: CCGRID, pp. 687–692. IEEE Computer Society (2008)
Google Scholar
Elfeky, M.G., Elmagarmid, A.K., Verykios, V.S.: Tailor: A record linkage tool box. In: ICDE, pp. 17–28. IEEE Computer Society (2002)
Google Scholar
Elfeky, M.G., Ghanem, T.M., Verykios, V.S., Huwait, A.R., Elmagarmid, A.K.: Record linkage: A machine learning approach, a toolbox, and a digital government web service (2003)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Article Google Scholar
Michalowski, M., Thakkar, S., Knoblock, C.: Exploiting secondary sources for automatic object consolidation. In: Proceedings of the KDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 34–36 (2003)
Google Scholar
Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI. AAAI Press (2006)
Google Scholar
Missier, P., Paton, N.W., Belhajjame, K.: Fine-grained and efficient lineage querying of collection-based workflow provenance. In: EDBT, pp. 299–310. ACM (2010)
Google Scholar
Parag, Domingos, P.: Multi-relational record linkage. In: Proceedings of the KDD 2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (August 2004)
Google Scholar
De Roure, D., Goble, C.A., Stevens, R.: The design and realisation of the myExperiment virtual research environment for social sharing of workflows. Future Generation Comp. Syst. 25(5), 561–567 (2009)
Article Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD, pp. 269–278. ACM (2002)
Google Scholar
Winkler, W.E.: Approximate string comparator search strategies for very large administrative lists. Technical report, Statistical Research Report Series, US Census Bureau (2005)
Google Scholar
Zhao, J., Sahoo, S.S., Missier, P., Sheth, A.P., Goble, C.A.: Extending semantic provenance into the web of data. IEEE Internet Computing 15(1), 40–48 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Manchester, Oxford Road, Manchester, UK
Khalid Belhajjame & Carole A. Goble
School of Computer Science, Newcastle University, Newcastle upon Tyne, UK
Paolo Missier

Authors

Khalid Belhajjame
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Missier
View author publications
You can also search for this author in PubMed Google Scholar
Carole A. Goble
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Free University Amsterdam, De Boelelaan 1105, 1081HV, Amsterdam, The Netherlands
Paul Groth
Bren School of Environmental Science and Management, University of California, 2400 Bren Hall, 93106-5131, Santa Barbara, CA, USA
James Frew

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Belhajjame, K., Missier, P., Goble, C.A. (2012). Detecting Duplicate Records in Scientific Workflow Results. In: Groth, P., Frew, J. (eds) Provenance and Annotation of Data and Processes. IPAW 2012. Lecture Notes in Computer Science, vol 7525. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34222-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-34222-6_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34221-9
Online ISBN: 978-3-642-34222-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Detecting Duplicate Records in Scientific Workflow Results

Abstract

Chapter PDF

Similar content being viewed by others

To Show or Not to Show in Workflow Provenance

A Brief Tour Through Provenance in Scientific Workflows and Databases

Mechanisms for provenance collection in scientific workflow systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Detecting Duplicate Records in Scientific Workflow Results

Abstract

Chapter PDF

Similar content being viewed by others

To Show or Not to Show in Workflow Provenance

A Brief Tour Through Provenance in Scientific Workflows and Databases

Mechanisms for provenance collection in scientific workflow systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation