Characterization of Written Languages Using Structural Features from Common Corpora

  • Younis Al Rozz
  • Harith Hamoodat
  • Ronaldo Menezes
Conference paper
Part of the Springer Proceedings in Complexity book series (SPCOM)


For more than 5,000 years, we have been communicating using some form of written language. For many scholars, the advent of written language contributed to the development of societies because it enabled knowledge to be passed to future generations without considerable loss of information or ambiguity. Today, it is estimated that we use about 7,000 languages to communicate, but the majority of these do not have a written form; in fact, there are no reliable estimates of how many written languages exist today. There are three main families of written languages: Afro-Asiatic, Indo-European, and Turkic. These families of languages are based on historical family-trees. However, with the amount of data available today, one can start looking at language classification using regularities extracted from corpora of text. This paper focus on regularities of 10 languages from the mentioned families. In order to find features for these languages we use (1) Heaps’ law, which models the number of distinct words in a corpus as a function of the total number of words in the same corpora, and (2) structural properties of networks created from word co-occurrence in large corpora for different languages. Using clustering approaches we show that despite differences from years of being used in separate countries, the clustering still seem to respect some historical organization of families.


Co-occurrence networks Language classification Heaps’ law Clustering 


  1. 1.
    Abramov, O., Mehler, A.: Automatic language classification by means of syntactic dependency networks. J. Quant. Linguist. 18(4), 291–336 (2011)CrossRefGoogle Scholar
  2. 2.
    Amancio, D.R., Antiqueira, L., Pardo, T.A.S., da F. Costa, L., Oliveira Jr., O.N., Nunes, M.G.V.: Complex networks analysis of manual and machine translations. Int. J. Mod. Phys. C 19(04), 583–598 (2008)Google Scholar
  3. 3.
    Antiqueira, L., Oliveira, O.N., da Fontoura Costa, L., das Graças Volpe Nunes, M.: A complex network approach to text summarization. Inf. Sci. 179(5), 584–599 (2009)Google Scholar
  4. 4.
    Arbesman, S., Strogatz, S.H., Vitevitch, M.S.: The structure of phonological networks across multiple languages. Int. J. Bifurc. Chaos 20(03), 679–685 (2010)Google Scholar
  5. 5.
    Arenas, A., Danon, L., Diaz-Guilera, A., Gleiser, P.M., Guimera, R.: Community analysis in social networks. Eur. Phys. J. B Condens. Matter Complex Syst. 38(2), 373–380 (2004)Google Scholar
  6. 6.
    Ban, K., Meštrović, A., Martinčić-ipšić, A.: Initial comparison of linguistic networks measures for parallel texts. In: 5th International Conference on Information Technologies and Information Society (ITIS), 97104. Citeseer (2013)Google Scholar
  7. 7.
    Beckage, N.M., Colunga, E.: Language networks as models of cognition: understanding cognition through language. In: Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, pp. 3–28. Springer (2016)Google Scholar
  8. 8.
    Bickel, B.: Typology in the 21st century: major current developments. Linguist. Typol. 11(1), 239–251 (2007)Google Scholar
  9. 9.
    Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C.: Language-independent methods for compiling monolingual lexical data. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 217–228. Springer (2004)Google Scholar
  10. 10.
    Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput. Linguist. 21(4), 543–565 (1995)MathSciNetGoogle Scholar
  11. 11.
    Campbell, L., Poser, W.J.: Language Classification: History and Method. Cambridge (2008)Google Scholar
  12. 12.
    Chen, X., Liu, H.: Function nodes in Chinese syntactic networks. In: Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, pp. 187–201. Springer (2016)Google Scholar
  13. 13.
    Choudhury, M., Mukherjee, A.: The structure and dynamics of linguistic networks. In: Dynamics on and of Complex Networks, pp. 145–166. Springer (2009)Google Scholar
  14. 14.
    Choudhury, M., Thomas, M., Mukherjee, A., Basu, A., Ganguly, N.: How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach. In: TextGraphs-2: Graph-Based Algorithms for Natural Language Processing, p. 81 (2007)Google Scholar
  15. 15.
    Coulmas, F.: The Writing Systems of the World. B. Blackwell (1989)Google Scholar
  16. 16.
    de Arruda, H.F., da F. Costa, L., Amancio, D.R.: Topic segmentation via community detection in complex networks. Chaos: an interdisciplinary. J. Nonlinear Sci. 26(6), 063120 (2016)Google Scholar
  17. 17.
    Deutschland and Statistisches Bundesamt Deutschland. Statistisches Jahrbuch Deutschland und Internationales. Statistisches Bundesamt (2012)Google Scholar
  18. 18.
    Font-Clos, F., Boleda, G., Corral, Á.: A scaling law beyond Zipf’s law and its relation to Heaps’ law. New J. Phys. 15(9), 093033 (2013)ADSCrossRefGoogle Scholar
  19. 19.
    Gao, Y., Liang, W., Shi, Y., Huang, Q.: Comparison of directed and weighted co-occurrence networks of six languages. Phys. A. Stat. Mech. Appl. 393, 579–589 (2014)CrossRefGoogle Scholar
  20. 20.
    Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the Leipzig corpora collection: from 100 to 200 languages. In: LREC, pp. 759–765 (2012)Google Scholar
  21. 21.
    Herdan, G.: Type-Token Mathematics, vol. 4. Mouton (1960)Google Scholar
  22. 22.
    i Cancho, R.F.: The structure of syntactic dependency networks: insights from recent advances in network theory. In: Problems of Quantitative Linguistics, pp. 60–75 (2005)Google Scholar
  23. 23.
    Liu, H.T., Cong, J.: Language clustering with word co-occurrence networks based on parallel texts. Chin. Sci. Bull. 58(10), 1139–1144 (2013)CrossRefGoogle Scholar
  24. 24.
    Liu, H., Chunshan, X.: Can syntactic networks indicate morphological complexity of a language? EPL (Europhys. Lett.) 93(2), 28005 (2011)ADSCrossRefGoogle Scholar
  25. 25.
    Mamede, N., Correia, J., Baptista, J.: Syntax deep explorer. In: Computational Processing of the Portuguese Language: 12th International Conference, PROPOR 2016, Tomar, Portugal, July 13–15, 2016, Proceedings, vol. 9727, p. 189. Springer (2016)Google Scholar
  26. 26.
    Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006)ADSCrossRefGoogle Scholar
  27. 27.
    Siew, C.S.Q.: Community structure in the phonological network. Front. Psychol. 4, 553 (2013)CrossRefGoogle Scholar
  28. 28.
    Singhal, A.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001)Google Scholar
  29. 29.
    Soares, M.M., Corso, G., Lucena, L.S.: The network of syllables in Portuguese. Phys. A Stat. Mech. Appl. 355(2), 678–684 (2005)Google Scholar
  30. 30.
    Solé, R.V., Corominas-Murtra, B., Valverde, S., Steels, L.: Language networks: their structure, function, and evolution. Complexity 15(6), 20–26 (2010)Google Scholar
  31. 31.
    Song, J.J.: The Oxford Handbook of Linguistic Typology. Oxford University Press (2010)Google Scholar
  32. 32.
    Steyvers, M., Tenenbaum, J.B.: The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cogn. Sci. 29(1), 41–78 (2005)Google Scholar
  33. 33.
    Watts, D.J., Strogatz, S.H.: Collective dynamics of small-worldnetworks. Nature 393(6684), 440–442 (1998)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Younis Al Rozz
    • 1
  • Harith Hamoodat
    • 1
  • Ronaldo Menezes
    • 1
  1. 1.BioComplex LaboratorySchool of Computing Florida Institute of TechnologyMelbourneUSA

Personalised recommendations