Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection

Arora, Ritu; Trelogan, Jessica; Ba, Trung Nguyen

doi:10.1007/978-3-319-33742-5_13

Ritu Arora²,
Jessica Trelogan³ &
Trung Nguyen Ba³

1785 Accesses
2 Citations
3 Altmetric

Abstract

The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

File profiling tool (DROID): The national archives, http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/. Accessed 18 Feb 2016
ExifTool by Phil Harvey, http://www.sno.phy.queensu.ca/~phil/exiftool/. Accessed 18 Feb 2016
R. Datta, J. Li, J.Z. Wang, Content-based image retrieval: Approaches and trends of the new age, in Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR ’05), ACM, New York, 2005, pp. 253–262
Google Scholar
D.-H. Kim, C.-W. Chung, Qcluster: Relevance feedback using adaptive clustering for content-based image retrieval, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD ’03), ACM, New York, 2003, pp. 599–610
Google Scholar
P.A. Viola, M.J. Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of the 2001 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2001), 2001, pp. 511–518
Google Scholar
Y. Ke, R. Sukthankar, L. Huston, An efficient parts-based near-duplicate and subimage retrieval system, in Proceedings of the 12th Annual ACM International Conference on Multimedia (MULTIMEDIA ’04), ACM, New York, 2004, pp. 869–876
Google Scholar
The GNU image finding tool, https://www.gnu.org/software/gift/. Accessed 18 Feb 2016
LIRE: Lucene image retrieval, http://www.lire-project.net/. Accessed 18 Feb 2016
isk-daemon Github code repository, https://github.com/ricardocabral/iskdaemon. Accessed 18 Feb 2016
Java content based image retrieval (JCBIR), https://code.google.com/archive/p/jcbir/. Accessed 18 Feb 2016
Content based image retrieval using Matlab, http://www.mathworks.com/matlabcentral/fileexchange/42008-content-based-image-retrieval. Accessed 18 Feb 2016
dupeGuru Picture Edition, https://www.hardcoded.net/dupeguru_pe/. Accessed 18 Feb 2016
VisiPics, http://www.visipics.info/index.php?title=Main_Page. Accessed 18 Feb 2016
Q. Huynh-Thu, M. Ghanbari, The accuracy of PSNR in predicting video quality for different video scenes and frame rates. Telecommun. Syst. 49(1), 35–48 (2010)
Article Google Scholar
R. Brunelli, Template Matching Techniques in Computer Vision: Theory and Practice (Wiley, 2009), ISBN: 978-0-470-51706-2
Google Scholar
H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)
Article Google Scholar
G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library (O’Reilly Media, Sebastopol, CA, 2008), pp. 1–580
Google Scholar
File profiling tool (DROID), http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/. Accessed 15 Feb 2016
R. Arora, M. Esteva, J. Trelogan, Leveraging high performance computing for managing large and evolving data collections. Int. J. Digit. Curation 9(2), 17–27 (2014)
Article Google Scholar
OpenRefine, http://openrefine.org/. Accessed 18 Feb 2016
M. Esteva, J. Trelogan, W. Xu, A. Solis, N. Lauland, Lost in the data, aerial views of an archaeological collection, in Proceedings of the 2013 Digital Humanities Conference, 2013, pp. 174–177. ISBN: 978-1-60962-036-3
Google Scholar
Tableau, http://www.tableau.com/. Accessed 18 Feb 2016
W. Xu, M. Esteva, J. Trelogan, T. Swinson, A case study on entity resolution for distant processing of big humanities data, in Proceedings of the 2013 IEEE International Conference on Big Data, 2013, pp. 113–120
Google Scholar
Stampede supercomputer, https://www.tacc.utexas.edu/systems/stampede. Accessed 18 Feb 2016
ImageMagick, http://www.imagemagick.org/script/index.php. Accessed 18 Feb 2016

Download references

Acknowledgments

We are grateful to ICA, XSEDE, TACC, and the STAR Scholar Program for providing us with resources to conduct this research. We are also grateful to our colleague Antonio Gomez for helping during the installation process of dupeGuru on Stampede.

Author information

Authors and Affiliations

University of Texas at Austin, Texas Advanced Computing Center, Austin, TX, USA
Ritu Arora
University of Texas, Austin, TX, USA
Jessica Trelogan & Trung Nguyen Ba

Authors

Ritu Arora
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Trelogan
View author publications
You can also search for this author in PubMed Google Scholar
Trung Nguyen Ba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ritu Arora .

Editor information

Editors and Affiliations

Texas Advanced Computing Center, Austin, Texas, USA
Ritu Arora

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Arora, R., Trelogan, J., Ba, T.N. (2016). Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection. In: Arora, R. (eds) Conquering Big Data with High Performance Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-33742-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-33742-5_13
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-33740-1
Online ISBN: 978-3-319-33742-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics