Digitization, Coded Character Sets, and Optical Character Recognition for Multi-script Information Resources: The Case of the Letopis’ Zhurnal’nykh Statei

Spencer, George Andrew

doi:10.1007/3-540-44796-2_36

Digitization, Coded Character Sets, and Optical Character Recognition for Multi-script Information Resources: The Case of the Letopis’ Zhurnal’nykh Statei

George Andrew Spencer⁷

Conference paper
First Online: 01 January 2001

810 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2163))

Abstract

Multi-lingual information resources that consist of texts in more scripts than can be represented by a single 8-bit encoding scheme can currently be best represented by use of the Unicode multi-byte character-encoding scheme. However use of Unicode could lead to a decrease in the accuracy of Optical Character Recognition (OCR) software because of the similarity of glyphs between certain scripts. This decrease in OCR accuracy can dramatically increase the amount of time needed to proofread the resulting electronic texts. An Indiana University - Digital Library Program project for digitizing a 20-year portion of the Letopis’ Zhurnal’nykh Statei is presented as an example of a digital library project dealing with a multi-script information resource for which Unicode has been used.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adams, Glenn: Introduction to Unicode. Cambridge, Mass.: Institute for Advanced Professional Studies, 1994.
Google Scholar
Гончаров, M. B., [и др.]: П роблемы предста вления кириллич еской информации в электроннй фор ме элуктронные б иблиотеки (1998) ткм 1, вып. 2 http://www.iis.ru/el-bib/1998/199802/EGHS/eghs.ru.html
Indiana University Digital Library Program: http://www.dlib.indiana.edu/
Indiana University Digital Library Program, Letopis’ Zhurnal’nykh Statei Project. http://www.dlib.indiana.edu/collections/letopis/letopismain.html
Internet Assigned Numbers Authority (IANA): Character Sets: http://www.iana.org/assignments/character-sets
MacKenzie, Charles E.: Coded Character Sets, History and Development. Reading, MA: Addison-Wesley, 1980.
Google Scholar
Microsoft Corp.: Character sets and codepages http://www.microsoft.com/typography/unicode/cscp.htm
Phinney, Thomas: TrueType & PostScript Type 1: What’s the Difference? http://www.fontsite.com/Pages/Features/T1vsTTb.html
Unicode Consortium: The Unicode Standard: A Technical Introduction. http://www.unicode.org/unicode/standard/principles.html
Unicode Consortium: The Unicode Standard: Version 3.0. Reading, Mass.: Addison-Wesley, 2000.
Google Scholar
Wood, Alan: Setting up Macintosh OS 9 Web Browsers for Multilingual and Unicode Support. http://www.hclrss.demon.co.uk/unicode/macbrowsers.html
World Wide Web Consortium (W3C): i18n/l10n: languages, countries and character sets. http://www.w3.org/International/O-charset-lang.html
World Wide Web Consortium (W3C): Extensible Markup Language (XML) version 1.0 (Second Edition) section 4.3.3 http://www.w3.org/TR/REC-xml#charencoding

Download references

Author information

Authors and Affiliations

Digital Library Program, Indiana University, Bloomington, IN, 47405, USA
George Andrew Spencer

Authors

George Andrew Spencer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Crete, Leof. Knossou, P.O. Box 1470, 71409, Heraklion, Greece
Panos Constantopoulos
Foundation for Research and Technology - Hellas, Institute of Computer Science, Vassilika Vouton, P.O. Box 1385, 71110, Heraklion, Greece
Panos Constantopoulos
Department of Computer and Information Science, The Norwegian University of Science and Technology, 7491, Trondheim, Norway
Ingeborg T. Sølvberg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Spencer, G.A. (2001). Digitization, Coded Character Sets, and Optical Character Recognition for Multi-script Information Resources: The Case of the Letopis’ Zhurnal’nykh Statei . In: Constantopoulos, P., Sølvberg, I.T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2001. Lecture Notes in Computer Science, vol 2163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44796-2_36

Download citation

DOI: https://doi.org/10.1007/3-540-44796-2_36
Published: 30 August 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42537-3
Online ISBN: 978-3-540-44796-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics