Detecting Duplicate Records in Scientific Workflow Results

  • Khalid Belhajjame
  • Paolo Missier
  • Carole A. Goble
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7525)


Scientific workflows are often data intensive. The data sets obtained by enacting scientific workflows have several applications, e.g., they can be used to identify data correlations or to understand phenomena, and therefore are worth storing in repositories for future analyzes. Our experience suggests that such datasets often contain duplicate records. Indeed, scientists tend to enact the same workflow multiple times using the same or overlapping datasets, which gives rise to duplicates in workflow results. The presence of duplicates may increase the complexity of workflow results interpretation and analyzes. Moreover, it unnecessarily increases the size of datasets within workflow results repositories. In this paper, we present an approach whereby duplicates detection is guided by workflow provenance trace. The hypothesis that we explore and exploit is that the operations that compose a workflow are likely to produce the same (or overlapping) dataset given the same (or overlapping) dataset. A preliminary analytic and empirical validation shows the effectiveness and applicability of the method proposed.


Record Linkage Operation Parameter Analysis Operation False Match Identical Record 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of the KDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 25–27 (2003)Google Scholar
  2. 2.
    Belhajjame, K., Goble, C., Tanoh, F., Bhagat, J., Wolstencroft, K., Stevens, R., Nzuobontane, E., McWilliam, H., Laurent, T., Lopez, R.: BioCatalogue: A Curated Web Service Registry for the Life Science Community. In: Proceedings of the Microsoft eScience Conference (2008)Google Scholar
  3. 3.
    Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: Learning to scale up record linkage. In: ICDM, pp. 87–96. IEEE Computer Society (2006)Google Scholar
  4. 4.
    Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD, pp. 1065–1068. ACM (2008)Google Scholar
  5. 5.
    Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL 1979, pp. 269–282. ACM, New York (1979)CrossRefGoogle Scholar
  6. 6.
    Deelman, E., Chervenak, A.L.: Data management challenges of data-intensive scientific workflows. In: CCGRID, pp. 687–692. IEEE Computer Society (2008)Google Scholar
  7. 7.
    Elfeky, M.G., Elmagarmid, A.K., Verykios, V.S.: Tailor: A record linkage tool box. In: ICDE, pp. 17–28. IEEE Computer Society (2002)Google Scholar
  8. 8.
    Elfeky, M.G., Ghanem, T.M., Verykios, V.S., Huwait, A.R., Elmagarmid, A.K.: Record linkage: A machine learning approach, a toolbox, and a digital government web service (2003)Google Scholar
  9. 9.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  10. 10.
    Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)CrossRefGoogle Scholar
  11. 11.
    Michalowski, M., Thakkar, S., Knoblock, C.: Exploiting secondary sources for automatic object consolidation. In: Proceedings of the KDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 34–36 (2003)Google Scholar
  12. 12.
    Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI. AAAI Press (2006)Google Scholar
  13. 13.
    Missier, P., Paton, N.W., Belhajjame, K.: Fine-grained and efficient lineage querying of collection-based workflow provenance. In: EDBT, pp. 299–310. ACM (2010)Google Scholar
  14. 14.
    Parag, Domingos, P.: Multi-relational record linkage. In: Proceedings of the KDD 2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (August 2004) Google Scholar
  15. 15.
    De Roure, D., Goble, C.A., Stevens, R.: The design and realisation of the myExperiment virtual research environment for social sharing of workflows. Future Generation Comp. Syst. 25(5), 561–567 (2009)CrossRefGoogle Scholar
  16. 16.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD, pp. 269–278. ACM (2002)Google Scholar
  17. 17.
    Winkler, W.E.: Approximate string comparator search strategies for very large administrative lists. Technical report, Statistical Research Report Series, US Census Bureau (2005)Google Scholar
  18. 18.
    Zhao, J., Sahoo, S.S., Missier, P., Sheth, A.P., Goble, C.A.: Extending semantic provenance into the web of data. IEEE Internet Computing 15(1), 40–48 (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Khalid Belhajjame
    • 1
  • Paolo Missier
    • 2
  • Carole A. Goble
    • 1
  1. 1.School of Computer ScienceUniversity of ManchesterManchesterUK
  2. 2.School of Computer ScienceNewcastle UniversityNewcastle upon TyneUK

Personalised recommendations