Reproducible Research in Document Analysis and Recognition

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 738)

Abstract

With reproducible research becoming a de facto standard in computational sciences, many approaches have been explored to enable researchers in other disciplines to adopt this standard. In this paper, we explore the importance of reproducible research in the field of document analysis and recognition and in the Computer Science field as a whole. First, we report on the difficulties that one can face in trying to reproduce research in the current publication standards. These difficulties for a large percentage of research may include missing raw or original data, a lack of tidied up version of the data, no source code available, or lacking the software to run the experiment. Furthermore, even when we have all these tools available, we found it was not a trivial task to replicate the research due to lack of documentation and deprecated dependencies. In this paper, we offer a solution to these reproducible research issues by utilizing container technologies such as Docker. As an example, we revisit the installation and execution of OCRSpell which we reported on and implemented in 1994. While the code for OCRSpell is freely available on github, we continuously get emails from individuals who have difficulties compiling and using it in modern hardware platforms. We walk through the development of an OCRSpell Docker container for creating an image, uploading such an image, and enabling others to easily run this program by simply downloading the image and running the container.

Keywords

Reproducible research Containers Docker OCRSpell Document analysis and recognition 

References

  1. 1.
    R.D. Peng, Reproducible research in computational science. Science 334(6060), 1226–1227 (2011)CrossRefGoogle Scholar
  2. 2.
    G.K. Sandve, A. Nekrutenko, J. Taylor, E. Hovig, Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9(10), e1003285 (2013)CrossRefGoogle Scholar
  3. 3.
    K. Ram, Git can facilitate greater reproducibility and increased transparency in science. Source Code Biol. Med. 8(1) 7 (2013)CrossRefGoogle Scholar
  4. 4.
    H. Wickham et al., Tidy data. J. Stat. Softw. 59(10), 1–23 (2014)Google Scholar
  5. 5.
    C. Collberg, T. Proebsting, G. Moraila, A. Shankaran, Z. Shi, A.M. Warren, Measuring reproducibility in computer systems research, Technical report, 2014Google Scholar
  6. 6.
    N. Barnes, Publish your computer code: it is good enough. Nature 467(7317), 753 (2010)Google Scholar
  7. 7.
    J.P. Ioannidis, Why most published research findings are false. PLos Med 2(8), e124 (2005)CrossRefGoogle Scholar
  8. 8.
    T.H. Vines, R.L. Andrew, D.G. Bock, M.T. Franklin, K.J. Gilbert, N.C. Kane, J.-S. Moore, B.T. Moyers, S. Renaut, D.J. Rennison et al., Mandated data archiving greatly improves access to research data. FASEB J 27(4), 1304–1308 (2013)CrossRefGoogle Scholar
  9. 9.
    Testimony on scientific integrity & transparency. https://www.gpo.gov/fdsys/pkg/CHRG-113hhrg79929/pdf/CHRG-113hhrg79929.pdf. Accessed 2017-03-01
  10. 10.
    J.T. Leek, R.D. Peng, Opinion: reproducible research can still be wrong: Adopting a prevention approach. Proc. Natl. Acad. Sci. 112(6), 1645–1646 (2015)CrossRefGoogle Scholar
  11. 11.
    G. Marcus, E. Davis, Eight (no, nine!) problems with big data. New York Times 6(04), 2014 (2014)Google Scholar
  12. 12.
    C. Boettiger, An introduction to docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49(1), 71–79 (2015)CrossRefGoogle Scholar
  13. 13.
    I. Jimenez, C. Maltzahn, A. Moody, K. Mohror, J. Lofstead, R. Arpaci-Dusseau, A. Arpaci-Dusseau, The role of container technology in reproducible computer systems research, in 2015 IEEE International Conference on Cloud Engineering (IC2E) (IEEE, New York, 2015), pp. 379–385Google Scholar
  14. 14.
    L.-H. Hung, D. Kristiyanto, S.B. Lee, K.Y. Yeung, Guidock: using docker containers with a common graphics user interface to address the reproducibility of research. PloS One 11(4), e0152686 (2016)CrossRefGoogle Scholar
  15. 15.
    P. Di Tommaso, E. Palumbo, M. Chatzou, P. Prieto, M.L. Heuer, C. Notredame, The impact of docker containers on the performance of genomic pipelines. PeerJ 3, e1273 (2015)CrossRefGoogle Scholar
  16. 16.
    D. Hládek, J. Staš, S. Ondáš, J. Juhár, L. Kovács, Learning string distance with smoothing for OCR spelling correction. Multimedia Tools and Applications 76(22), 24549–24567 (2017)CrossRefGoogle Scholar
  17. 17.
    K. Taghva, E. Stofsky, Ocrspell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001)CrossRefGoogle Scholar
  18. 18.
    K. Taghva, T. Nartker, J. Borsack, Information access in the presence of OCR errors, in Proceedings of the 1st ACM Workshop on Hardcopy Document Processing (ACM, New York, 2004), pp. 1–8Google Scholar
  19. 19.
    P. Belmann, J. Dröge, A. Bremges, A.C. McHardy, A. Sczyrba, M.D. Barton, Bioboxes: standardised containers for interchangeable bioinformatics software. Gigascience 4(1), 47 (2015)Google Scholar
  20. 20.
    A. Hosny, P. Vera-Licona, R. Laubenbacher, T. Favre, Algorun, a docker-based packaging system for platform-agnostic implemented algorithms. Bioinformatics 32, btw120 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of NevadaLas VegasUSA

Personalised recommendations