Skip to main content

Process Design to Self-extract Text from Images for Similarity Check

  • Conference paper
  • First Online:
Advances in Intelligent Computing and Communication

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 430))

Abstract

In professional writing, plagiarism is an offensive fraud and a breach of academic ethics. While similarity check is one of the preliminary stages toward an effort to restrain the plagiarism in the domain, most of them fail to detect the text present over the images. The loophole can be used to put the plagiarized text in the form of an image to bypass the similarity check software. To overcome the problem, here, we propose an approach, supplement to the existing software, to extract machine-readable text from the images in scientific documents. The approach accepts portable document format (PDF) and uses the metadata retrieved from the document to automatically detect and localize the images in the document. Thereafter, the electronic form of the text is extracted from the images using optical character recognition (OCR) technique. The proposed framework is validated using a scientific article format containing a block of text and flow diagrams in the form of images. The results indicate that the proposed method is capable of detecting and localizing the text correctly present over the images in machine-readable form.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kakkonen T, Mozgovoy M (2010) Hermetic and web plagiarism detection systems for student essays—an evaluation of the state-of-the-art. J Educ Comput Res 42(2):135–159

    Article  Google Scholar 

  2. Chong MYM (2013) A study on plagiarism detection and plagiarism direction identification using natural language processing techniques, PhD thesis, University of Wolverhampton

    Google Scholar 

  3. Hong ST (2017) Plagiarism continues to affect scholarly journals. J Korean Med Sci 32(2):183

    Article  Google Scholar 

  4. Alzahrani SM, Salim N, Abraham A (2012) Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(2):133–149

    Article  Google Scholar 

  5. Meuschke N, Gondek C, Seebacher D, Breitinger C, Keim D, Gipp B (2018) An adaptive image-based plagiarism detection approach. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries. ACM

    Google Scholar 

  6. Hussain SF, Suryani A (2015) On retrieving intelligently plagiarized documents using semantic similarity. Eng Appl Artif Intell 45:246–258

    Article  Google Scholar 

  7. Chowdhury HA, Bhattacharyya DK (2018) Plagiarism: taxonomy, tools and detection techniques

    Google Scholar 

  8. Ahuja L, Gupta V, Kumar R (2020) A new hybrid technique for detection of plagiarism from text documents. Arab J Sci Eng

    Google Scholar 

  9. Mostafa HE, Benabbou F (2020) A deep learning based technique for plagiarism detection: a comparative study. IAES Int J Artif Intell (IJ-AI) 9(1):81

    Google Scholar 

  10. Originality checking and plagiarism prevention tool. https://www.turnitin.com/products/similarity

  11. Plagiarism checker by grammarly. https://www.grammarly.com/plagiarism-checker

  12. Publish with confidence. http://www.ithenticate.com/

  13. Free online plagiarism checker for teachers and students. http://plagiarisma.net/

  14. Iwanowski M, Cacko A, Sarwas G (2016) Comparing images for document plagiarism detection

    Google Scholar 

  15. Srivastava S, Mukherjee P, Lall B (2015) Implag: detecting image plagiarism using hierarchical near duplicate retrieval

    Google Scholar 

  16. Ovhal PM, Phulpagar B (2015) Plagiarized image detection system based on CBIR. Int J Emerg Trends Technol Comput Sci 4(3)

    Google Scholar 

  17. Kay A (2007) Tesseract: an open-source optical character recognition engine. Linux J 2007(159):2

    Google Scholar 

  18. Smith R (2007) An overview of the tesseract OCR engine

    Google Scholar 

  19. Pypdf2 1.26.0: Python package index (online). Available: https://pypi.org/project/PyPDF2/

  20. Pytesseract 0.3.4: Python package index (online). Available: https://pypi.org/project/pytesseract/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kundan Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sharmeen, A., Agarwal, N., Suman, A., Agarwal, S., Kumar, K. (2022). Process Design to Self-extract Text from Images for Similarity Check. In: Mohanty, M.N., Das, S. (eds) Advances in Intelligent Computing and Communication. Lecture Notes in Networks and Systems, vol 430. Springer, Singapore. https://doi.org/10.1007/978-981-19-0825-5_11

Download citation

Publish with us

Policies and ethics