Abstract
A web portal providing access to over 250.000 scanned and OCRed cultural heritage documents is analyzed. The collection consists of the complete Dutch Hansard from 1917 to 1995. Each document consists of facsimile images of the original pages plus hidden OCRed text. The inclusion of images yields large file sizes of which less than 2% is the actual text. The search user interface of the portal provides poor ranking and not very informative document summaries (snippets). Thus, users are instrumental in weeding out non-relevant results. For that, they must assess the complete documents. This is a time-consuming and frustrating process because of long download and processing times of the large files. Instead of using the scanned images for relevance assessment, we propose to use a reconstruction of the original document from a purely semantic representation. Evaluation on the Dutch dataset shows that these reconstructions become two orders of magnitude smaller and still resemble the original to a high degree. In addition, they are easier to speed-read and evaluate for relevance, due to added hyperlinks and a presentation optimized for reading from a terminal. We describe the reconstruction process and evaluate the costs, the benefits, and the quality.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Alonso, J. et al.: Improving Access to Government Through Better Use of the Web. W3C Interest Group Note 12 May 2009. http://www.w3.org/TR/egov-improving/
Bennet, D., Harvey, A.: Publishing Open Government Data (W3C Working Draft 8 September 2009). http://www.w3.org/TR/gov-data/
Breuel, Th.: High performance document layout analysis. In: Doermann, D. (eds.) Proceedings 2003 Symposium on Document Image Understanding Technology, pp. 209–218 (2003)
Clarke, Ch., Agichtein, E., Dumais, S., White, R.: The influence of caption features on clickthrough patterns in web search. In: Proceedings SIGIR ’07, pp. 135–142 (2007)
Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: State of the art and research directions. In: Proceedings SIGMOD ’06, pp. 799–800 (2006)
Gielissen, T., Marx, M.: Exemelification of parliamentary debates. In: Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop (DIR 2009), pp. 19–25. Twente, The Netherlands (2009)
Gladney H.M., Lorie R.A.: Trustworthy 100-year digital objects: Durable encoding for when it’s too late to ask. ACM Trans. Inf. Syst. 23(3), 299–324 (2005)
He, F., Ding, X.: Hierarchical logical structure extraction of book documents by analyzing table of contents. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval XI, pp. 6–13 (2004)
Hearst, M.: Design recommendations for hierarchical faceted search interfaces. In: ACM SIGIR Workshop on Faceted Search (2006)
Hearst M.: Search User Interfaces. Cambridge University Press, Cambridge (2009)
Hulth, A., Karlgren, J., Jonsson, A., Boström, H., Asker, L.: Automatic keyword extraction using domain knowledge. In: Proceedings CICLing 2001, pp. 472–482. Springer (2001)
Kaptein, R., Marx, M., Kamps, J.: Who said what to whom? Capturing the structure of debates. In: Proceedings SIGIR ’09, pp. 831–832 (2009)
Kay M.: XPath 2.0 Programmer’s Reference. Wrox, Birmingham (2004)
Kay M.: XSLT 2.0 3rd edn Programmer’s Reference. Wrox, Birmingham (2004)
Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proceedings of International Workshop on Document Analysis Systems (2000)
Knight, G., Pennock, M.: Data without meaning: Establishing the significant properties of digital research. In: iPRES 2008 Conference Proceedings (2008)
Koninklijke Bibliotheek: Staten-generaal digitaal (2009). http://www.statengeneraaldigitaal.nl/backgrounds.html
Ludäscher, B., Mukhopadhyay, P., Papakonstantinou, Y.: A transducer-based XML query processor. In: Proceedings VLDB ’02, pp. 227–238. VLDB Endowment (2002)
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: A literature survey. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval X, pp. 197–207 (2003)
Mao, S., Kim, J., Thoma, G.: Style-independent document labeling: Design and performance evaluation. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval XI, pp. 14–22 (2004)
Marx, M.: (2009) Long, often quite boring, notes of meetings. In: ESAIR ’09: Proceedings of the WSDM ’09 Workshop on Exploiting Semantic Annotations in Information Retrieval, pp. 46–53. ACM (2009)
Marx, M., Schuth, A.: DutchParl. A corpus of parliamentary documents in Dutch. In: Proceedings Language Resources and Evaluation (LREC) pp. 3670–3677 (2010)
Message Understanding Conference Proceedings MUC-7: National Institute of Standards and Technology (NIST) Gaithersburg, Maryland, USA (1997)
Murdock V., Lalmas M.: Workshop on aggregated search. SIGIR Forum 42(2), 80–83 (2008)
Proceedings of the First Text Analysis Conference (TAC 2008): National Institute of Standards and Technology (NIST) Gaithersburg, Maryland, USA (2008)
Rada, M., Andras, C.: Wikify!: Linking documents to encyclopedic knowledge. In: Proceedings CIKM ’07, pp. 233–242 (2007)
Rahm E., Do H.H.: Data cleaning: Problems and current approaches. IEEE Tech. Bull. Data Eng. 23(4), 3–13 (2000)
Reynaert, M.: Non-interactive OCR post-correction for giga-scale digitization projects. In: Proceedings of the CICLing (Computational Linguistics and Intelligent Text Processing, 9th International Conference), pp. 617–630 (2008)
Salminen, A.: Building digital government by XML. In: Proceedings of the Thirty-Eighth Hawaii International Conference on System Sciences. IEEE Computer Society (2005)
Sigurbjörnsson, B.: Focused information access using XML element retrieval. PhD thesis, University of Amsterdam (2006)
Van Der Hoeven J.R., Van Diessen R.J., Van Der Meer K.: Development of a universal virtual computer (uvc) for long-term preservation of digital objects. J. Inf. Sci. 31(3), 196–208 (2005)
Acknowledgments
Maarten Marx acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under the FET-Open grant agreement FOX, number FP7-ICT-233599.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Marx, M., Gielissen, T. Digital weight watching: reconstruction of scanned documents. IJDAR 14, 229–239 (2011). https://doi.org/10.1007/s10032-010-0135-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-010-0135-3