Skip to main content

Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection

  • Chapter
  • First Online:
Conquering Big Data with High Performance Computing

Abstract

The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. File profiling tool (DROID): The national archives, http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/. Accessed 18 Feb 2016

  2. ExifTool by Phil Harvey, http://www.sno.phy.queensu.ca/~phil/exiftool/. Accessed 18 Feb 2016

  3. R. Datta, J. Li, J.Z. Wang, Content-based image retrieval: Approaches and trends of the new age, in Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR ’05), ACM, New York, 2005, pp. 253–262

    Google Scholar 

  4. D.-H. Kim, C.-W. Chung, Qcluster: Relevance feedback using adaptive clustering for content-based image retrieval, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD ’03), ACM, New York, 2003, pp. 599–610

    Google Scholar 

  5. P.A. Viola, M.J. Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of the 2001 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2001), 2001, pp. 511–518

    Google Scholar 

  6. Y. Ke, R. Sukthankar, L. Huston, An efficient parts-based near-duplicate and subimage retrieval system, in Proceedings of the 12th Annual ACM International Conference on Multimedia (MULTIMEDIA ’04), ACM, New York, 2004, pp. 869–876

    Google Scholar 

  7. The GNU image finding tool, https://www.gnu.org/software/gift/. Accessed 18 Feb 2016

  8. LIRE: Lucene image retrieval, http://www.lire-project.net/. Accessed 18 Feb 2016

  9. isk-daemon Github code repository, https://github.com/ricardocabral/iskdaemon. Accessed 18 Feb 2016

  10. Java content based image retrieval (JCBIR), https://code.google.com/archive/p/jcbir/. Accessed 18 Feb 2016

  11. Content based image retrieval using Matlab, http://www.mathworks.com/matlabcentral/fileexchange/42008-content-based-image-retrieval. Accessed 18 Feb 2016

  12. dupeGuru Picture Edition, https://www.hardcoded.net/dupeguru_pe/. Accessed 18 Feb 2016

  13. VisiPics, http://www.visipics.info/index.php?title=Main_Page. Accessed 18 Feb 2016

  14. Q. Huynh-Thu, M. Ghanbari, The accuracy of PSNR in predicting video quality for different video scenes and frame rates. Telecommun. Syst. 49(1), 35–48 (2010)

    Article  Google Scholar 

  15. R. Brunelli, Template Matching Techniques in Computer Vision: Theory and Practice (Wiley, 2009), ISBN: 978-0-470-51706-2

    Google Scholar 

  16. H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)

    Article  Google Scholar 

  17. G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library (O’Reilly Media, Sebastopol, CA, 2008), pp. 1–580

    Google Scholar 

  18. File profiling tool (DROID), http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/. Accessed 15 Feb 2016

  19. R. Arora, M. Esteva, J. Trelogan, Leveraging high performance computing for managing large and evolving data collections. Int. J. Digit. Curation 9(2), 17–27 (2014)

    Article  Google Scholar 

  20. OpenRefine, http://openrefine.org/. Accessed 18 Feb 2016

  21. M. Esteva, J. Trelogan, W. Xu, A. Solis, N. Lauland, Lost in the data, aerial views of an archaeological collection, in Proceedings of the 2013 Digital Humanities Conference, 2013, pp. 174–177. ISBN: 978-1-60962-036-3

    Google Scholar 

  22. Tableau, http://www.tableau.com/. Accessed 18 Feb 2016

  23. W. Xu, M. Esteva, J. Trelogan, T. Swinson, A case study on entity resolution for distant processing of big humanities data, in Proceedings of the 2013 IEEE International Conference on Big Data, 2013, pp. 113–120

    Google Scholar 

  24. Stampede supercomputer, https://www.tacc.utexas.edu/systems/stampede. Accessed 18 Feb 2016

  25. ImageMagick, http://www.imagemagick.org/script/index.php. Accessed 18 Feb 2016

Download references

Acknowledgments

We are grateful to ICA, XSEDE, TACC, and the STAR Scholar Program for providing us with resources to conduct this research. We are also grateful to our colleague Antonio Gomez for helping during the installation process of dupeGuru on Stampede.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ritu Arora .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Arora, R., Trelogan, J., Ba, T.N. (2016). Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection. In: Arora, R. (eds) Conquering Big Data with High Performance Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-33742-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-33742-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-33740-1

  • Online ISBN: 978-3-319-33742-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics