Abstract
The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
File profiling tool (DROID): The national archives, http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/. Accessed 18 Feb 2016
ExifTool by Phil Harvey, http://www.sno.phy.queensu.ca/~phil/exiftool/. Accessed 18 Feb 2016
R. Datta, J. Li, J.Z. Wang, Content-based image retrieval: Approaches and trends of the new age, in Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR ’05), ACM, New York, 2005, pp. 253–262
D.-H. Kim, C.-W. Chung, Qcluster: Relevance feedback using adaptive clustering for content-based image retrieval, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD ’03), ACM, New York, 2003, pp. 599–610
P.A. Viola, M.J. Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of the 2001 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2001), 2001, pp. 511–518
Y. Ke, R. Sukthankar, L. Huston, An efficient parts-based near-duplicate and subimage retrieval system, in Proceedings of the 12th Annual ACM International Conference on Multimedia (MULTIMEDIA ’04), ACM, New York, 2004, pp. 869–876
The GNU image finding tool, https://www.gnu.org/software/gift/. Accessed 18 Feb 2016
LIRE: Lucene image retrieval, http://www.lire-project.net/. Accessed 18 Feb 2016
isk-daemon Github code repository, https://github.com/ricardocabral/iskdaemon. Accessed 18 Feb 2016
Java content based image retrieval (JCBIR), https://code.google.com/archive/p/jcbir/. Accessed 18 Feb 2016
Content based image retrieval using Matlab, http://www.mathworks.com/matlabcentral/fileexchange/42008-content-based-image-retrieval. Accessed 18 Feb 2016
dupeGuru Picture Edition, https://www.hardcoded.net/dupeguru_pe/. Accessed 18 Feb 2016
VisiPics, http://www.visipics.info/index.php?title=Main_Page. Accessed 18 Feb 2016
Q. Huynh-Thu, M. Ghanbari, The accuracy of PSNR in predicting video quality for different video scenes and frame rates. Telecommun. Syst. 49(1), 35–48 (2010)
R. Brunelli, Template Matching Techniques in Computer Vision: Theory and Practice (Wiley, 2009), ISBN: 978-0-470-51706-2
H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)
G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library (O’Reilly Media, Sebastopol, CA, 2008), pp. 1–580
File profiling tool (DROID), http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/. Accessed 15 Feb 2016
R. Arora, M. Esteva, J. Trelogan, Leveraging high performance computing for managing large and evolving data collections. Int. J. Digit. Curation 9(2), 17–27 (2014)
OpenRefine, http://openrefine.org/. Accessed 18 Feb 2016
M. Esteva, J. Trelogan, W. Xu, A. Solis, N. Lauland, Lost in the data, aerial views of an archaeological collection, in Proceedings of the 2013 Digital Humanities Conference, 2013, pp. 174–177. ISBN: 978-1-60962-036-3
Tableau, http://www.tableau.com/. Accessed 18 Feb 2016
W. Xu, M. Esteva, J. Trelogan, T. Swinson, A case study on entity resolution for distant processing of big humanities data, in Proceedings of the 2013 IEEE International Conference on Big Data, 2013, pp. 113–120
Stampede supercomputer, https://www.tacc.utexas.edu/systems/stampede. Accessed 18 Feb 2016
ImageMagick, http://www.imagemagick.org/script/index.php. Accessed 18 Feb 2016
Acknowledgments
We are grateful to ICA, XSEDE, TACC, and the STAR Scholar Program for providing us with resources to conduct this research. We are also grateful to our colleague Antonio Gomez for helping during the installation process of dupeGuru on Stampede.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Arora, R., Trelogan, J., Ba, T.N. (2016). Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection. In: Arora, R. (eds) Conquering Big Data with High Performance Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-33742-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-33742-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-33740-1
Online ISBN: 978-3-319-33742-5
eBook Packages: Computer ScienceComputer Science (R0)