CICLing 2005: Computational Linguistics and Intelligent Text Processing pp 773-784 | Cite as
Disentangling from Babylonian Confusion – Unsupervised Language Identification
Conference paper
Abstract
This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approaches and works almost error-free from 100 sentences per language on.
Keywords
Class Label Language Identification Graph Algorithm Language Pair German Corpus
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Preview
Unable to display preview. Download preview PDF.
References
- [Barabási et al. 2000]Barabási, A.L., Albert, R., Jeong, H.: Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281), 70–77 (2000)Google Scholar
- [Biemann et al. 2004a]Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C.: Language-independent Methods for Compiling Monolingual Lexical Data. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 217–228. Springer, Heidelberg (2004)CrossRefGoogle Scholar
- [Biemann et al. 2004b]Biemann, C., Böhm, K., Heyer, G., Melz, R.: Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems. In: Proceedings of I2CS, Guadalajara, Mexico (2004)Google Scholar
- [Cavnar & Trenkle 1994]Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175. UNLV Publications/Reprographics (1994)Google Scholar
- [Dunning 94]Dunning, T.: Statistical Identification of Language. Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University (March 1994)Google Scholar
- [Ferrer-i-Cancho & Sole 2001]Ferrer-i-Cancho, R., Sole, R.V.: The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences 268(1482), 2261–2265 (2001)CrossRefGoogle Scholar
- [Grefenstette 1995]Grefenstette, G.: Comparing Two Language Identification Schemes. In: The proceedings of 3rd International Conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (December 1995)Google Scholar
- [Johnson 1993]Johnson, S.: Solving the problem of language recognition. Technical Report, School of Computer Studies, University of Leeds (1993)Google Scholar
- [Quasthoff & Wolff 2002]Quasthoff, U., Wolff, C.: The Poisson Collocation Measure and its Applications. In: Proc. Second International Workshop on Computational Approaches to Collocations, Wien (2002)Google Scholar
- [Pantel et al. 2004]Pantel, P., Ravichandran, D., Hovy, E.: Towards Terascale Semantic Acquisition. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland (2004)Google Scholar
- [Rehm 2002]Rehm, G.: Towards Automatic Web Genre Identification. In: Proceedings of the 35th Hawaii International Conference on System Sciences, Hawaii (2002)Google Scholar
- [Reuters 2000]Reuters Corpus. vol. 1, English language (2000), http://about.reuters.com/researchandstandards/corpus
- [Schulze 2000]Schulze, B.M.: Automatic language identification using both N-gram and word information. US Patent No. 6,167,369 (2000)Google Scholar
- [Zipf 1929]Zipf, G.K.: Relative Frequency as a Determinant of Phonetic Change (1929); Reprinted in Harvard Studies in Classical Philology, vol. XIGoogle Scholar
Copyright information
© Springer-Verlag Berlin Heidelberg 2005