Abstract
The last step of the Information Retrieval process is to display the found documents to the user. However, some difficulties might occur at that point. English texts are usually written in the ASCII standard. Unlike the English language, many languages have different character sets, and do not have one standard. This plurality of standards causes problems, especially in a web environment, where one may download a document with an unknown standard. This paper suggests a purely automatic way of finding the standard which was used by the document writer based on the statistical letters distribution in the language. We developed a vector-space-based method that creates frequencies vectors for each letter of the language and then matches a new document's vectors to the pre-computed templates. The algorithm was applied on various types of corpora in Hebrew, Russian and English, and provides an efficient solution to the stated problem in most cases.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Benedetto D, Caglioti E and Loreto V (2002) Language trees and zipping. Physical Review Letter., 88(4).
Bookstein A and Klein ST (1990) Compression, information theory and grammars: A unified approach. ACM Trans. on Information Systems, 8:27–49.
Bracewell M and Karp DA (1998) O'Reilly Utilities—Quick Solutions for Windows 98 Annoyances, O'Reilly & Associates, Inc.
Cormack GV and Horspool RN (1987) Data compression using dynamic Markov modelling. Computer Journal, 30(6):541–550.
Damashek M (1995) Gauging similarity via N-Grams: Language-independent categorization of text. Science, 246:843–848.
Graham IS (1998) HTML 4.0 Sourcebook. Wiley Computer Publishing, New York, pp. 450–451. Hebrew resources, http://www.snunit.k12.il/.
Horspool RN and Cormack GV (1986) Dynamic Markov modeling—A prediction algorithm. In: Proc. 19th Hawaii International Conference on System, Sciences, vol. II, pp. 700–707.
Huffman S (1995) Acquaintance: Language-independent document categorization by N-grams. The Fourth Text REtrieval Conference (TREC-4), Nov. Gaithersburg, Maryland, USA.
IBM Character Data Representation Architecture, Reference and Registry; (Dec. 1996) SC09–2196–00.
Information Technology (1998) ISONET Manual, ISO/IEC 8859, Jersey City. NJ.
Jaeger G (2002) Some notes on the formal froperties of bidirectional optimality theory, Journal of Logic, Language, and Information, 11(4):427–451.
Kennedy B and Musciano C (2000) HTML & XHTML: The Definitive Guide, O'Reilly & Associates, Inc., 4th edition, Section 15.1.
Northrup A (1999) Introducing Microsoft Windows2000 Server, Microsoft Press, Washington, pp. 15–16.
Russian transliteration, www.geocities.com/Colosseum/Track/7635/
Russian resources, http://ruslit.virtualave.net.
Segal E and Itai A, Hebrew transliteration, http://www.cs.technion.ac.il/~erelsgl/bxi/hmntx/teud.html
Shannon CE (1948) A mathematical theory of communication. Bell System Tech. Journal, 27:398–403.
Sloboda T (1995) Dictionary learning: Performance through consistency. In: Proc. of ICASSP '95, Detroit, MI, pp. 453–456.
Smith B (2001) SUN Microsystems Unveils Netscape 6 for Solaris, Sun's Press Releases, Brookline, MA.
The Conversation English resource, http://www.athel.com
The Scientific English resource, http://citeseer.nj.nec.com/cs
The English Literature resources, http://www.chemicool.com/
The Unicode Consortium (2000) The Unicode Standard, Version 3.0, Addison-Wesley Developers Press, Reading, MA.
Wiseman Y (2000) Parallel Compression, Ph.D. Thesis, Computer Science Dept., Bar-Ilan University, Ramat-Gan, Israel, pp. 76–79.
Yalin D (1942) Grammar of the Hebrew Language, R. Mass Press, Jerusalem, (In Hebrew).
Yoon HS, Soh J, Min B and Yang HS (1999) Recognition of Alphabetical Hand Gestures Using Hidden Markov Model, IEICE Transactions Fundamentals, E82-A(7):1358–1366.
Ziv J and Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans. on Information Theory IT-24, pp. 530–536.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Geffet, M., Wiseman, Y. & Feitelson, D. Automatic Alphabet Recognition. Information Retrieval 8, 25–40 (2005). https://doi.org/10.1023/B:INRT.0000048495.64628.ea
Issue Date:
DOI: https://doi.org/10.1023/B:INRT.0000048495.64628.ea