Skip to main content

Efficient Search in Hidden Text of Large DjVu Documents

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 6699)

Abstract

The paper describes an open-source tool which allows to present end-users with results of advanced language technologies. It relies on the DjVu format, which for some applications is still superior to other modern formats including PDF/A. The DjVu GPLed tools are not limited just to the DjVuLibre library, but are being supplemented by various new programs, such as pdf2djvu developed by Jakub Wilk. It allows in particular to convert to DjVu the PDF output of popular OCR programs like FineReader preserving the hidden text layer and some other features.

The tool in question has been conceived by the present author and consist of a modification of the Poliqarp corpus query tool, used for National Corpus of Polish; his ideas have been very succesfully implemented by Jakub Wilk. The new system, called here simply Poliqarp for DjVu, inherits from its origin not only the powerfull search facilities based two-level regular expressions, but also the ability to represent low-level ambiguities and other linguistic phenomena. Although at present the tool is used mainly to facilitate access to the results of dirty OCR, it is ready to handle also more sophisticated output of linguistic technologies.

Keywords

  • Regular Expression
  • National Corpus
  • Portable Document Format
  • Electronic Edition
  • Preceding Element

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bień, J.S.: Digitalizing dictionaries of polish. In: Bogacki, K., Cholewa, J., Rozumko, A. (eds.) Methods of Lexical Analysis: Theoretical Assumption and Practical Applications, Wydawnictwo Uniwersytetu w Białymstoku, Białystok, pp. 37–45 (2009), http://bc.klf.uw.edu.pl/71/

  2. Bień, J.S.: Facilitating access to digitalized dictionaries in djvu format. Studia Kognitywne - Études Cognitives 9, 161–170 (2009), http://bc.klf.uw.edu.pl/160/

    Google Scholar 

  3. Breuel, T.: The hOCR microformat for OCR workflow and results. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, pp. 1063–1067. IEEE Computer Society, Los Alamitos (2007), http://madm.dfki.de/publication&pubid=4373

    Google Scholar 

  4. Le Cun, Y., Bottou, L., Erofeev, A., Haffner, P., Riemers, B.W.: DjVu document browsing with on-demand loading and rendering of image components. In: Internet Imaging, San Jose (January 2001), http://leon.bottou.org/papers/lecun-2001

  5. Piotrowski, T.: Digitization of Polish historic(al) dictionaries. Review of the National Center for Digitization 6, 95–102 (2005), http://elib.mi.sanu.ac.rs/files/journals/ncd/6/d009download.pdf

    Google Scholar 

  6. Pletschacher, S., Antonacopoulos, A.: The PAGE (Page Analysis and Ground-Truth Elements) format framework. In: International Conference on Pattern Recognition, pp. 257–260. IEEE Computer Society, Los Alamitos (2010), http://www.cse.salford.ac.uk/prima/papers/ICPR2010_Pletschacher_PAGE.pdf

    Google Scholar 

  7. Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version. Institute of Computer Science, Polish Academy of Sciences, Warsaw (2004), http://nlp.ipipan.waw.pl/~adamp/Papers/2004-corpus/

  8. Przepiórkowski, A.: TEI P5 as an XML standard for treebank encoding. In: Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories (TLT,8), pp. 149–160 (2009), http://nlp.ipipan.waw.pl/~adamp/Papers/2009-tlt-tei/

  9. Przepiórkowski, A., Buczyński, A., Wilk, J.: The National Corpus of Polish Cheatsheet (2006), http://nkjp.pl/poliqarp/help/en.html (accessed 2011-02-08)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bień, J.S. (2011). Efficient Search in Hidden Text of Large DjVu Documents. In: Bernardi, R., Chambers, S., Gottfried, B., Segond, F., Zaihrayeu, I. (eds) Advanced Language Technologies for Digital Libraries. NLP4DL AT4DL 2009 2009. Lecture Notes in Computer Science, vol 6699. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23160-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23160-5_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23159-9

  • Online ISBN: 978-3-642-23160-5

  • eBook Packages: Computer ScienceComputer Science (R0)