Disentangling from Babylonian Confusion – Unsupervised Language Identification

  • Chris Biemann
  • Sven Teresniak
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3406)

Abstract

This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approaches and works almost error-free from 100 sentences per language on.

Keywords

Class Label Language Identification Graph Algorithm Language Pair German Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Barabási et al. 2000]
    Barabási, A.L., Albert, R., Jeong, H.: Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281), 70–77 (2000)Google Scholar
  2. [Biemann et al. 2004a]
    Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C.: Language-independent Methods for Compiling Monolingual Lexical Data. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 217–228. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. [Biemann et al. 2004b]
    Biemann, C., Böhm, K., Heyer, G., Melz, R.: Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems. In: Proceedings of I2CS, Guadalajara, Mexico (2004)Google Scholar
  4. [Cavnar & Trenkle 1994]
    Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175. UNLV Publications/Reprographics (1994)Google Scholar
  5. [Dunning 94]
    Dunning, T.: Statistical Identification of Language. Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University (March 1994)Google Scholar
  6. [Ferrer-i-Cancho & Sole 2001]
    Ferrer-i-Cancho, R., Sole, R.V.: The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences 268(1482), 2261–2265 (2001)CrossRefGoogle Scholar
  7. [Grefenstette 1995]
    Grefenstette, G.: Comparing Two Language Identification Schemes. In: The proceedings of 3rd International Conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (December 1995)Google Scholar
  8. [Johnson 1993]
    Johnson, S.: Solving the problem of language recognition. Technical Report, School of Computer Studies, University of Leeds (1993)Google Scholar
  9. [Quasthoff & Wolff 2002]
    Quasthoff, U., Wolff, C.: The Poisson Collocation Measure and its Applications. In: Proc. Second International Workshop on Computational Approaches to Collocations, Wien (2002)Google Scholar
  10. [Pantel et al. 2004]
    Pantel, P., Ravichandran, D., Hovy, E.: Towards Terascale Semantic Acquisition. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland (2004)Google Scholar
  11. [Rehm 2002]
    Rehm, G.: Towards Automatic Web Genre Identification. In: Proceedings of the 35th Hawaii International Conference on System Sciences, Hawaii (2002)Google Scholar
  12. [Reuters 2000]
    Reuters Corpus. vol. 1, English language (2000), http://about.reuters.com/researchandstandards/corpus
  13. [Schulze 2000]
    Schulze, B.M.: Automatic language identification using both N-gram and word information. US Patent No. 6,167,369 (2000)Google Scholar
  14. [Zipf 1929]
    Zipf, G.K.: Relative Frequency as a Determinant of Phonetic Change (1929); Reprinted in Harvard Studies in Classical Philology, vol. XIGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Chris Biemann
    • 1
  • Sven Teresniak
    • 1
  1. 1.Computer Science Institute, NLP Dept.Leipzig UniversityLeipzigGermany

Personalised recommendations