Abstract
Multi-lingual information resources that consist of texts in more scripts than can be represented by a single 8-bit encoding scheme can currently be best represented by use of the Unicode multi-byte character-encoding scheme. However use of Unicode could lead to a decrease in the accuracy of Optical Character Recognition (OCR) software because of the similarity of glyphs between certain scripts. This decrease in OCR accuracy can dramatically increase the amount of time needed to proofread the resulting electronic texts. An Indiana University - Digital Library Program project for digitizing a 20-year portion of the Letopis’ Zhurnal’nykh Statei is presented as an example of a digital library project dealing with a multi-script information resource for which Unicode has been used.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adams, Glenn: Introduction to Unicode. Cambridge, Mass.: Institute for Advanced Professional Studies, 1994.
Гончаров, M. B., [и др.]: П роблемы предста вления кириллич еской информации в электроннй фор ме элуктронные б иблиотеки (1998) ткм 1, вып. 2 http://www.iis.ru/el-bib/1998/199802/EGHS/eghs.ru.html
Indiana University Digital Library Program: http://www.dlib.indiana.edu/
Indiana University Digital Library Program, Letopis’ Zhurnal’nykh Statei Project. http://www.dlib.indiana.edu/collections/letopis/letopismain.html
Internet Assigned Numbers Authority (IANA): Character Sets: http://www.iana.org/assignments/character-sets
MacKenzie, Charles E.: Coded Character Sets, History and Development. Reading, MA: Addison-Wesley, 1980.
Microsoft Corp.: Character sets and codepages http://www.microsoft.com/typography/unicode/cscp.htm
Phinney, Thomas: TrueType & PostScript Type 1: What’s the Difference? http://www.fontsite.com/Pages/Features/T1vsTTb.html
Unicode Consortium: The Unicode Standard: A Technical Introduction. http://www.unicode.org/unicode/standard/principles.html
Unicode Consortium: The Unicode Standard: Version 3.0. Reading, Mass.: Addison-Wesley, 2000.
Wood, Alan: Setting up Macintosh OS 9 Web Browsers for Multilingual and Unicode Support. http://www.hclrss.demon.co.uk/unicode/macbrowsers.html
World Wide Web Consortium (W3C): i18n/l10n: languages, countries and character sets. http://www.w3.org/International/O-charset-lang.html
World Wide Web Consortium (W3C): Extensible Markup Language (XML) version 1.0 (Second Edition) section 4.3.3 http://www.w3.org/TR/REC-xml#charencoding
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Spencer, G.A. (2001). Digitization, Coded Character Sets, and Optical Character Recognition for Multi-script Information Resources: The Case of the Letopis’ Zhurnal’nykh Statei . In: Constantopoulos, P., Sølvberg, I.T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2001. Lecture Notes in Computer Science, vol 2163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44796-2_36
Download citation
DOI: https://doi.org/10.1007/3-540-44796-2_36
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42537-3
Online ISBN: 978-3-540-44796-2
eBook Packages: Springer Book Archive