Process Design to Self-extract Text from Images for Similarity Check

Sharmeen, Adiba; Agarwal, Nidhi; Suman, Arpit; Agarwal, Sumanshu; Kumar, Kundan

doi:10.1007/978-981-19-0825-5_11

Adiba Sharmeen¹¹,
Nidhi Agarwal¹¹,
Arpit Suman¹¹,
Sumanshu Agarwal¹¹ &
…
Kundan Kumar¹¹

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 430))

277 Accesses
1 Citations

Abstract

In professional writing, plagiarism is an offensive fraud and a breach of academic ethics. While similarity check is one of the preliminary stages toward an effort to restrain the plagiarism in the domain, most of them fail to detect the text present over the images. The loophole can be used to put the plagiarized text in the form of an image to bypass the similarity check software. To overcome the problem, here, we propose an approach, supplement to the existing software, to extract machine-readable text from the images in scientific documents. The approach accepts portable document format (PDF) and uses the metadata retrieved from the document to automatically detect and localize the images in the document. Thereafter, the electronic form of the text is extracted from the images using optical character recognition (OCR) technique. The proposed framework is validated using a scientific article format containing a block of text and flow diagrams in the form of images. The results indicate that the proposed method is capable of detecting and localizing the text correctly present over the images in machine-readable form.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kakkonen T, Mozgovoy M (2010) Hermetic and web plagiarism detection systems for student essays—an evaluation of the state-of-the-art. J Educ Comput Res 42(2):135–159
Article Google Scholar
Chong MYM (2013) A study on plagiarism detection and plagiarism direction identification using natural language processing techniques, PhD thesis, University of Wolverhampton
Google Scholar
Hong ST (2017) Plagiarism continues to affect scholarly journals. J Korean Med Sci 32(2):183
Article Google Scholar
Alzahrani SM, Salim N, Abraham A (2012) Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(2):133–149
Article Google Scholar
Meuschke N, Gondek C, Seebacher D, Breitinger C, Keim D, Gipp B (2018) An adaptive image-based plagiarism detection approach. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries. ACM
Google Scholar
Hussain SF, Suryani A (2015) On retrieving intelligently plagiarized documents using semantic similarity. Eng Appl Artif Intell 45:246–258
Article Google Scholar
Chowdhury HA, Bhattacharyya DK (2018) Plagiarism: taxonomy, tools and detection techniques
Google Scholar
Ahuja L, Gupta V, Kumar R (2020) A new hybrid technique for detection of plagiarism from text documents. Arab J Sci Eng
Google Scholar
Mostafa HE, Benabbou F (2020) A deep learning based technique for plagiarism detection: a comparative study. IAES Int J Artif Intell (IJ-AI) 9(1):81
Google Scholar
Originality checking and plagiarism prevention tool. https://www.turnitin.com/products/similarity
Plagiarism checker by grammarly. https://www.grammarly.com/plagiarism-checker
Publish with confidence. http://www.ithenticate.com/
Free online plagiarism checker for teachers and students. http://plagiarisma.net/
Iwanowski M, Cacko A, Sarwas G (2016) Comparing images for document plagiarism detection
Google Scholar
Srivastava S, Mukherjee P, Lall B (2015) Implag: detecting image plagiarism using hierarchical near duplicate retrieval
Google Scholar
Ovhal PM, Phulpagar B (2015) Plagiarized image detection system based on CBIR. Int J Emerg Trends Technol Comput Sci 4(3)
Google Scholar
Kay A (2007) Tesseract: an open-source optical character recognition engine. Linux J 2007(159):2
Google Scholar
Smith R (2007) An overview of the tesseract OCR engine
Google Scholar
Pypdf2 1.26.0: Python package index (online). Available: https://pypi.org/project/PyPDF2/
Pytesseract 0.3.4: Python package index (online). Available: https://pypi.org/project/pytesseract/

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, ITER, Siksha ‘O’ Anusandhan (Deemed To Be University), Bhubaneswar, 751030, India
Adiba Sharmeen, Nidhi Agarwal, Arpit Suman, Sumanshu Agarwal & Kundan Kumar

Authors

Adiba Sharmeen
View author publications
You can also search for this author in PubMed Google Scholar
Nidhi Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Arpit Suman
View author publications
You can also search for this author in PubMed Google Scholar
Sumanshu Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Kundan Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kundan Kumar .

Editor information

Editors and Affiliations

Department of Electronics and Communication Engineering, Institute of Technical Education and Research (ITER), Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India
Mihir Narayan Mohanty
Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Swagatam Das

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sharmeen, A., Agarwal, N., Suman, A., Agarwal, S., Kumar, K. (2022). Process Design to Self-extract Text from Images for Similarity Check. In: Mohanty, M.N., Das, S. (eds) Advances in Intelligent Computing and Communication. Lecture Notes in Networks and Systems, vol 430. Springer, Singapore. https://doi.org/10.1007/978-981-19-0825-5_11

Download citation

DOI: https://doi.org/10.1007/978-981-19-0825-5_11
Published: 17 May 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-0824-8
Online ISBN: 978-981-19-0825-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics