Automatic Alphabet Recognition

Geffet, Maayan; Wiseman, Yair; Feitelson, Dror

doi:10.1023/B:INRT.0000048495.64628.ea

Automatic Alphabet Recognition

Published: January 2005

Volume 8, pages 25–40, (2005)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Automatic Alphabet Recognition

Download PDF

Maayan Geffet¹,
Yair Wiseman¹ &
Dror Feitelson¹

393 Accesses
1 Citation
Explore all metrics

Abstract

The last step of the Information Retrieval process is to display the found documents to the user. However, some difficulties might occur at that point. English texts are usually written in the ASCII standard. Unlike the English language, many languages have different character sets, and do not have one standard. This plurality of standards causes problems, especially in a web environment, where one may download a document with an unknown standard. This paper suggests a purely automatic way of finding the standard which was used by the document writer based on the statistical letters distribution in the language. We developed a vector-space-based method that creates frequencies vectors for each letter of the language and then matches a new document's vectors to the pre-computed templates. The algorithm was applied on various types of corpora in Hebrew, Russian and English, and provides an efficient solution to the stated problem in most cases.

Avoid common mistakes on your manuscript.

References

Benedetto D, Caglioti E and Loreto V (2002) Language trees and zipping. Physical Review Letter., 88(4).
Bookstein A and Klein ST (1990) Compression, information theory and grammars: A unified approach. ACM Trans. on Information Systems, 8:27–49.
Google Scholar
Bracewell M and Karp DA (1998) O'Reilly Utilities—Quick Solutions for Windows 98 Annoyances, O'Reilly & Associates, Inc.
Cormack GV and Horspool RN (1987) Data compression using dynamic Markov modelling. Computer Journal, 30(6):541–550.
Google Scholar
Damashek M (1995) Gauging similarity via N-Grams: Language-independent categorization of text. Science, 246:843–848.
Google Scholar
Graham IS (1998) HTML 4.0 Sourcebook. Wiley Computer Publishing, New York, pp. 450–451. Hebrew resources, http://www.snunit.k12.il/.
Google Scholar
Horspool RN and Cormack GV (1986) Dynamic Markov modeling—A prediction algorithm. In: Proc. 19th Hawaii International Conference on System, Sciences, vol. II, pp. 700–707.
Google Scholar
Huffman S (1995) Acquaintance: Language-independent document categorization by N-grams. The Fourth Text REtrieval Conference (TREC-4), Nov. Gaithersburg, Maryland, USA.
IBM Character Data Representation Architecture, Reference and Registry; (Dec. 1996) SC09–2196–00.
Information Technology (1998) ISONET Manual, ISO/IEC 8859, Jersey City. NJ.
Jaeger G (2002) Some notes on the formal froperties of bidirectional optimality theory, Journal of Logic, Language, and Information, 11(4):427–451.
Google Scholar
Kennedy B and Musciano C (2000) HTML & XHTML: The Definitive Guide, O'Reilly & Associates, Inc., 4th edition, Section 15.1.
Northrup A (1999) Introducing Microsoft Windows2000 Server, Microsoft Press, Washington, pp. 15–16.
Google Scholar
Russian transliteration, www.geocities.com/Colosseum/Track/7635/
Russian resources, http://ruslit.virtualave.net.
Segal E and Itai A, Hebrew transliteration, http://www.cs.technion.ac.il/~erelsgl/bxi/hmntx/teud.html
Shannon CE (1948) A mathematical theory of communication. Bell System Tech. Journal, 27:398–403.
Google Scholar
Sloboda T (1995) Dictionary learning: Performance through consistency. In: Proc. of ICASSP '95, Detroit, MI, pp. 453–456.
Smith B (2001) SUN Microsystems Unveils Netscape 6 for Solaris, Sun's Press Releases, Brookline, MA.
The Conversation English resource, http://www.athel.com
The Scientific English resource, http://citeseer.nj.nec.com/cs
The English Literature resources, http://www.chemicool.com/
The Unicode Consortium (2000) The Unicode Standard, Version 3.0, Addison-Wesley Developers Press, Reading, MA.
Google Scholar
Wiseman Y (2000) Parallel Compression, Ph.D. Thesis, Computer Science Dept., Bar-Ilan University, Ramat-Gan, Israel, pp. 76–79.
Google Scholar
Yalin D (1942) Grammar of the Hebrew Language, R. Mass Press, Jerusalem, (In Hebrew).
Google Scholar
Yoon HS, Soh J, Min B and Yang HS (1999) Recognition of Alphabetical Hand Gestures Using Hidden Markov Model, IEICE Transactions Fundamentals, E82-A(7):1358–1366.
Google Scholar
Ziv J and Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans. on Information Theory IT-24, pp. 530–536.

Download references

Author information

Authors and Affiliations

Hebrew University, Jerusalem
Maayan Geffet, Yair Wiseman & Dror Feitelson

Authors

Maayan Geffet
View author publications
You can also search for this author in PubMed Google Scholar
Yair Wiseman
View author publications
You can also search for this author in PubMed Google Scholar
Dror Feitelson
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Geffet, M., Wiseman, Y. & Feitelson, D. Automatic Alphabet Recognition. Information Retrieval 8, 25–40 (2005). https://doi.org/10.1023/B:INRT.0000048495.64628.ea

Download citation

Issue Date: January 2005
DOI: https://doi.org/10.1023/B:INRT.0000048495.64628.ea

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Automatic Alphabet Recognition

Abstract

Article PDF

Similar content being viewed by others

Siamese Neural Networks: An Overview

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Importance and challenges of handwriting recognition with the implementation of machine learning techniques: a survey

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Automatic Alphabet Recognition

Abstract

Article PDF

Similar content being viewed by others

Siamese Neural Networks: An Overview

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Importance and challenges of handwriting recognition with the implementation of machine learning techniques: a survey

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation