Digital weight watching: reconstruction of scanned documents

Marx, Maarten; Gielissen, Tim

doi:10.1007/s10032-010-0135-3

Digital weight watching: reconstruction of scanned documents

Original Paper
Open access
Published: 31 October 2010

Volume 14, pages 229–239, (2011)
Cite this article

Download PDF

You have full access to this open access article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Digital weight watching: reconstruction of scanned documents

Download PDF

Maarten Marx¹ &
Tim Gielissen¹

863 Accesses
4 Citations
4 Altmetric
Explore all metrics

Abstract

A web portal providing access to over 250.000 scanned and OCRed cultural heritage documents is analyzed. The collection consists of the complete Dutch Hansard from 1917 to 1995. Each document consists of facsimile images of the original pages plus hidden OCRed text. The inclusion of images yields large file sizes of which less than 2% is the actual text. The search user interface of the portal provides poor ranking and not very informative document summaries (snippets). Thus, users are instrumental in weeding out non-relevant results. For that, they must assess the complete documents. This is a time-consuming and frustrating process because of long download and processing times of the large files. Instead of using the scanned images for relevance assessment, we propose to use a reconstruction of the original document from a purely semantic representation. Evaluation on the Dutch dataset shows that these reconstructions become two orders of magnitude smaller and still resemble the original to a high degree. In addition, they are easier to speed-read and evaluate for relevance, due to added hyperlinks and a presentation optimized for reading from a terminal. We describe the reconstruction process and evaluate the costs, the benefits, and the quality.

Article PDF

Content-Based Filtering for Fast 3D Reconstruction from Unstructured Web-Based Image Data

Assessing Cross-Cut Shredded Document Assembly

Print Processing in Contentus: Restoration of Digitized Print Media

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Alonso, J. et al.: Improving Access to Government Through Better Use of the Web. W3C Interest Group Note 12 May 2009. http://www.w3.org/TR/egov-improving/
Bennet, D., Harvey, A.: Publishing Open Government Data (W3C Working Draft 8 September 2009). http://www.w3.org/TR/gov-data/
Breuel, Th.: High performance document layout analysis. In: Doermann, D. (eds.) Proceedings 2003 Symposium on Document Image Understanding Technology, pp. 209–218 (2003)
Clarke, Ch., Agichtein, E., Dumais, S., White, R.: The influence of caption features on clickthrough patterns in web search. In: Proceedings SIGIR ’07, pp. 135–142 (2007)
Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: State of the art and research directions. In: Proceedings SIGMOD ’06, pp. 799–800 (2006)
Gielissen, T., Marx, M.: Exemelification of parliamentary debates. In: Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop (DIR 2009), pp. 19–25. Twente, The Netherlands (2009)
Gladney H.M., Lorie R.A.: Trustworthy 100-year digital objects: Durable encoding for when it’s too late to ask. ACM Trans. Inf. Syst. 23(3), 299–324 (2005)
Article Google Scholar
He, F., Ding, X.: Hierarchical logical structure extraction of book documents by analyzing table of contents. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval XI, pp. 6–13 (2004)
Hearst, M.: Design recommendations for hierarchical faceted search interfaces. In: ACM SIGIR Workshop on Faceted Search (2006)
Hearst M.: Search User Interfaces. Cambridge University Press, Cambridge (2009)
Google Scholar
Hulth, A., Karlgren, J., Jonsson, A., Boström, H., Asker, L.: Automatic keyword extraction using domain knowledge. In: Proceedings CICLing 2001, pp. 472–482. Springer (2001)
Kaptein, R., Marx, M., Kamps, J.: Who said what to whom? Capturing the structure of debates. In: Proceedings SIGIR ’09, pp. 831–832 (2009)
Kay M.: XPath 2.0 Programmer’s Reference. Wrox, Birmingham (2004)
Google Scholar
Kay M.: XSLT 2.0 3rd edn Programmer’s Reference. Wrox, Birmingham (2004)
Google Scholar
Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proceedings of International Workshop on Document Analysis Systems (2000)
Knight, G., Pennock, M.: Data without meaning: Establishing the significant properties of digital research. In: iPRES 2008 Conference Proceedings (2008)
Koninklijke Bibliotheek: Staten-generaal digitaal (2009). http://www.statengeneraaldigitaal.nl/backgrounds.html
Ludäscher, B., Mukhopadhyay, P., Papakonstantinou, Y.: A transducer-based XML query processor. In: Proceedings VLDB ’02, pp. 227–238. VLDB Endowment (2002)
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: A literature survey. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval X, pp. 197–207 (2003)
Mao, S., Kim, J., Thoma, G.: Style-independent document labeling: Design and performance evaluation. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval XI, pp. 14–22 (2004)
Marx, M.: (2009) Long, often quite boring, notes of meetings. In: ESAIR ’09: Proceedings of the WSDM ’09 Workshop on Exploiting Semantic Annotations in Information Retrieval, pp. 46–53. ACM (2009)
Marx, M., Schuth, A.: DutchParl. A corpus of parliamentary documents in Dutch. In: Proceedings Language Resources and Evaluation (LREC) pp. 3670–3677 (2010)
Message Understanding Conference Proceedings MUC-7: National Institute of Standards and Technology (NIST) Gaithersburg, Maryland, USA (1997)
Murdock V., Lalmas M.: Workshop on aggregated search. SIGIR Forum 42(2), 80–83 (2008)
Article Google Scholar
Proceedings of the First Text Analysis Conference (TAC 2008): National Institute of Standards and Technology (NIST) Gaithersburg, Maryland, USA (2008)
Rada, M., Andras, C.: Wikify!: Linking documents to encyclopedic knowledge. In: Proceedings CIKM ’07, pp. 233–242 (2007)
Rahm E., Do H.H.: Data cleaning: Problems and current approaches. IEEE Tech. Bull. Data Eng. 23(4), 3–13 (2000)
Google Scholar
Reynaert, M.: Non-interactive OCR post-correction for giga-scale digitization projects. In: Proceedings of the CICLing (Computational Linguistics and Intelligent Text Processing, 9th International Conference), pp. 617–630 (2008)
Salminen, A.: Building digital government by XML. In: Proceedings of the Thirty-Eighth Hawaii International Conference on System Sciences. IEEE Computer Society (2005)
Sigurbjörnsson, B.: Focused information access using XML element retrieval. PhD thesis, University of Amsterdam (2006)
Van Der Hoeven J.R., Van Diessen R.J., Van Der Meer K.: Development of a universal virtual computer (uvc) for long-term preservation of digital objects. J. Inf. Sci. 31(3), 196–208 (2005)
Article Google Scholar

Download references

Acknowledgments

Maarten Marx acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under the FET-Open grant agreement FOX, number FP7-ICT-233599.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and Affiliations

ISLA, University of Amsterdam, Science Park 107, 1098 XG, Amsterdam, The Netherlands
Maarten Marx & Tim Gielissen

Authors

Maarten Marx
View author publications
You can also search for this author in PubMed Google Scholar
Tim Gielissen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maarten Marx.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Marx, M., Gielissen, T. Digital weight watching: reconstruction of scanned documents. IJDAR 14, 229–239 (2011). https://doi.org/10.1007/s10032-010-0135-3

Download citation

Received: 07 December 2009
Revised: 02 July 2010
Accepted: 14 October 2010
Published: 31 October 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10032-010-0135-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Digital weight watching: reconstruction of scanned documents

Abstract

Article PDF

Similar content being viewed by others

Content-Based Filtering for Fast 3D Reconstruction from Unstructured Web-Based Image Data

Assessing Cross-Cut Shredded Document Assembly

Print Processing in Contentus: Restoration of Digitized Print Media

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Digital weight watching: reconstruction of scanned documents

Abstract

Article PDF

Similar content being viewed by others

Content-Based Filtering for Fast 3D Reconstruction from Unstructured Web-Based Image Data

Assessing Cross-Cut Shredded Document Assembly

Print Processing in Contentus: Restoration of Digitized Print Media

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation