Skip to main content
Log in

Syntactic Structures and the General Markov Models

  • Published:
Mathematics in Computer Science Aims and scope Submit manuscript

Abstract

We study phylogenetic signal present in syntactic information by considering the syntactic structures data from Longobardi (Linguist Anal 41:517–557, 2017), Collins (Syntactic structures of the world’s language: a cross-linguistic database. 27 September (2010), Colloquium: https://ling.yale.edu/syntactic-structures-worlds-language-cross-linguistic-database, 2010), Ceolin et al. (Front Psychol 11:2384, 2020) and Koopman (SSWL syntactic structures of the world’s languages: an open-ended database for the linguistic community and by the linguistic community. mit 50, 12. http://sswl.railsplayground.net/, 2011). Focusing first on the general Markov models, we explore how well the the syntactic structures data conform to the hypothesis required by these models. We do this by comparing derived phylogenetic trees against trees agreed on by the linguistics community. We then interpret the methods of Ceolin et al. (2020) as an infinite sites evolutionary model and compare the consistency of the data with this alternative. The ideas and methods discussed in the present paper are more generally applicable than to the specific setting of syntactic structures, and can be used in other contexts, when analyzing consistency of data with against hypothesized evolutionary models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Code/Data availability

The code and data used in this paper are available at https://github.com/minorllama/syntactic_structures_phylogenetics.

Notes

  1. The slight issue with negative determinants in Lake’s definition can be sidestepped using a constant scaling of the metric and moving it inside the logarithm.

  2. Rate matrix is any matrix where each row sums to zero, and all entries are positive off diagonal and non-positive on it; each edge is thought of as a continuous Markov chain associated to the rate matrix.

  3. In the updated SSWl Hittite has “11 Adposition Noun Phrase” set to value 0 and Armenian (Western Armenian) has “Neg 01 Standard Negation is Particle that Precedes the Verb” set to value 1.

  4. 01 Subject Verb, 06 Subject Object Verb, 11 Adposition Noun Phrase, 13 Adjective Noun, 15 Numeral Noun, 17 Demonstrative Noun, 19 Possessor Noun, 21 Pronominal Possessor Noun, Neg 03 Standard Negation is Prefix, Neg 08 Standard Negation is Tone plus Other Modification, Neg 10 Standard Negation is Infix, Neg 12 Distinct Negation of identity, Neg 13 Distinct Negation of Existence, Neg 14 Distinct Negation of Location, Order N3 01 Demonstrative Adjective Noun, Neg 04 Standard Negation is Suffix, 12 Noun Phrase Adposition.

  5. This convergence can be quantified with the Berry–Esseem theorem, see Durrett [13].

References

  1. Allman, E., Rhodes, J.: Phylogenetic ideals and varieties for general Markov models. Adv. Appl. Math. 40, 127–148 (2008)

    Article  MathSciNet  Google Scholar 

  2. Allman, E.S., Rhodes, J.A., Sullivant, S.: When do phylogenetic mixture models mimic other phylogenetic models? Syst. Biol. 61(6), 1049–1059 (2012)

    Article  Google Scholar 

  3. Baker, M.C.: The Atoms of Language. Basic Books, New York (2002)

    Google Scholar 

  4. Biberauer, T.: The Limits of Syntactic Variation. John Benjamins Publishing, Amsterdam (2008)

    Book  Google Scholar 

  5. Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S.J., Alekseyenko, A.V., Drummond, A.J., Gray, R.D., Suchard, M.A., Atkinson, Q.D.: Mapping the origins and expansion of the Indo-European language family. Science 337(6097), 957–960 (2012)

    Article  Google Scholar 

  6. Bouckaert, R., Heled, J., Kühnert, D., Vaughan, T., Wu, C.H., Xie, D., Suchard, M.A., Rambaut, A., Drummond, A.J.: BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 10(4), e1003537 (2014)

    Article  Google Scholar 

  7. Ceolin, A., Guardiano, C., Irimia, M.A., Longobardi, G.: Formal syntax and deep history. Front. Psychol. 11, 2384 (2020)

    Article  Google Scholar 

  8. Ceolin, A., Guardiano, C., Longobardi, G., Irimia, M.A., Bortolussi, L., Sgarro, A.: At the boundaries of syntactic prehistory. Philos. Trans. R. Soc. B 376(1824), 20200197 (2021)

    Article  Google Scholar 

  9. Chomsky, N.: Lectures on Government and Binding. Walter de Gruyter, Basel (1981)

    Google Scholar 

  10. Chomsky, N., Lasnik, H.: The theory of principles and parameters. In: Jacobs, J., von Stechow, A., Sternefeld, W., Vennemann, T. (eds.) Syntax: An International Handbook of Contemporary Research, pp. 506–569. Walter de Gruyter, Basel (1993)

  11. Collins, C.: Syntactic Structures of the World’s Language: A Cross-linguistic Database. 2010. 27 September (2010), Colloquium: https://ling.yale.edu/syntactic-structures-worlds-language-cross-linguistic-database

  12. Dryer, M.S., Haspelmath, M.: WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig (2013). https://wals.info/

  13. Durrett, R.: Probability: Theory and Examples, vol. 49. Cambridge University Press, Cambridge (2019)

    Book  Google Scholar 

  14. Eriksson, N.K.: Algebraic Combinatorics for Computational Biology. PhD thesis, University of California, Berkeley (2006)

  15. Felsenstein, J.: Inferring Phylogenies, vol. 2. Sinauer Associates, Sunderland (2004)

    Google Scholar 

  16. Gascuel, O., Steel, M.: Neighbor-joining revealed. Mol. Biol. Evol. 23(11), 1997–2000 (2006)

    Article  Google Scholar 

  17. Gray, R.D., Drummond, A.J., Greenhill, S.J.: Language phylogenies reveal expansion pulses and pauses in pacific settlement. Science 323(5913), 479–483 (2009)

    Article  Google Scholar 

  18. Guardiano, C., Michelioudakis, D., Ceolin, A., Irimia, M., Longobardi, G., Radkevich, N., Sitaridou, I., Silvestri, G.: South by southeast. A syntactic approach to Greek and Romance microvariation. L’Ital. Dialett. 77, 95–166 (2016)

    Google Scholar 

  19. Hoffmann, K., Bouckaert, R., Greenhill, S.J., Kühnert, D.: Bayesian phylogenetic analysis of linguistic data using beast. J. Lang. Evol. 6(2), 119–135 (2021)

    Article  Google Scholar 

  20. Karimi, S., Piattelli-Palmarini, M.: Special issue on parameters. Linguist. Anal. 41, 3–4 (2017)

    Google Scholar 

  21. Kazakov, D.L., Cordoni, G., Algahtani, E., Ceolin, A., Irimia, M.A., Kim, S.S., Michelioudakis, D., Radkevich, N., Guardiano, C., Longobardi, G.: Learning implicational models of universal grammar parameters. In: Cuskley, C., Flaherty, M., Little, H., McCrohon, L., Ravignani, A., Verhoef, T. (eds.) The Evolution of Language: Proceedings of the 12th International Conference (EVOLANGXII). NCU Press (2018). https://doi.org/10.12775/3991-1.048. http://evolang.org/torun/proceedings/papertemplate.html?p=176

  22. Koopman, H.: SSWL Syntactic Structures of the World’s Languages: An Open-ended Database for the Linguistic Community and by the Linguistic Community. mit 50, 12 (2011). http://sswl.railsplayground.net/

  23. Lake, J.A.: Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. 91(4), 1455–1459 (1994)

    Article  Google Scholar 

  24. Longobardi, G.: Convergence in parametric phylogenies. Homoplasy or principled explanation? In: Galves, C., Cyrino, S., Lopes, R., Sandalo, F., Avelar, J. (eds.) Parameter Theory and Linguistic Change. Oxford University Press, Oxford (2012). https://doi.org/10.1093/acprof:oso/9780199659203.001.0001

    Chapter  Google Scholar 

  25. Longobardi, G.: Convergence in parametric phylogenies: homoplasy or principled explanation? In: Galves, C., Cyrino, S., Lopes, R., Sandalo, F., Avelar, J. (eds.) Parameter Theory and Language Change, pp. 304–319. Oxford University Press, Oxford (2012)

  26. Longobardi, G.: Principles, parameters, and schemata: a constructivist UG. Linguist. Anal. 41(3–4), 517–556 (2017)

    Google Scholar 

  27. Longobardi, G.: Principles, parameters, and schemata. A constructivist UG. Linguist. Anal. 41, 517–557 (2017)

    Google Scholar 

  28. Longobardi, G., Guardiano, C.: Evidence for syntax as a signal of historical relatedness. Lingua 119, 1679–1706 (2009)

    Article  Google Scholar 

  29. Longobardi, G., Guardiano, C., Silvestri, G., Boattini, A., Ceolin, A.: Toward a syntactic phylogeny of modern Indo-European languages. J. Hist. Linguist. 3(1), 122–152 (2013)

    Article  Google Scholar 

  30. Ma, J., Ratan, A., Raney, B.J., Suh, B.B., Miller, W., Haussler, D.: The infinite sites model of genome evolution. Proc. Natl. Acad. Sci. 105(38), 14254–14261 (2008)

    Article  Google Scholar 

  31. Marcolli, M.: Syntactic parameters and a coding theory perspective on entropy and complexity of language families. Entropy 18(4), 110 (2016). https://doi.org/10.3390/e18040110

    Article  MathSciNet  Google Scholar 

  32. Matsen, F.A., Steel, M.: Phylogenetic mixtures on a single tree can mimic a tree of another topology. Syst. Biol. 56(5), 767–775 (2007)

    Article  Google Scholar 

  33. Murawaki, Y.: Analyzing correlated evolution of multiple features using latent representations. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4371–4382 (2018)

  34. Nicholls, G.K., Gray, R.D.: Dated ancestral trees from binary trait data and their application to the diversification of languages. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70, 545–566 (2008)

    Article  MathSciNet  Google Scholar 

  35. Niyogi, P.: The Computational Nature of Language Learning and Evolution, Volume 43 of Current Studies in Linguistics. MIT Press, Cambridge (2006)

    Book  Google Scholar 

  36. Niyogi, P., Berwick, R.C.: A dynamical systems model for language change. Complex Syst. 11(3), 161–204 (1997)

    MathSciNet  Google Scholar 

  37. Nurbakova, D., Rusakov, S., Alexandrov, V.: Quantifying uncertainty in phylogenetic studies of the Slavonic languages. Procedia Comput. Sci. 18, 2269–2277 (2013)

    Article  Google Scholar 

  38. O’Donnell, R.: Analysis of Boolean Functions. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  39. Ortegaray, A., Berwick, R.C., Marcolli, M.: Heat Kernel Analysis of Syntactic Structures. CoRR (2018). arXiv:1803.09832

  40. Pachter, L., Sturmfels, B.: Algebraic Statistics for Computational Biology, vol. 13. Cambridge University Press, Cambridge (2005)

    Book  Google Scholar 

  41. Pachter, L., Sturmfels, B.: The mathematics of phylogenomics. SIAM Rev. 49(1), 3–31 (2007)

    Article  MathSciNet  Google Scholar 

  42. Pagel, M., Meade, A.: A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. 53(4), 571–81 (2004)

    Article  Google Scholar 

  43. Park, J.J., Boettcher, R., Zhao, A., Mun, A., Yuh, K., Kumar, V., Marcolli, M.: Prevalence and recoverability of syntactic parameters in sparse distributed memories. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Structures of Information 2017. Lecture Notes in Computer Science, vol. 10589, pp. 1–8. Springer, Cham (2017)

  44. Perelysvaig, A., Lewis, M.W.: The Indo-European Controversy: Facts and Fallacies in Historical Linguistics. Cambridge University Press, Cambridge (2015)

    Book  Google Scholar 

  45. Piispanen, P.: The Uralic–Yukaghiric connection revisited: sound correspondences of geminate clusters. Suom.-Ugr. Seuran Aikakauskirja 2013(94), 165–197 (2013). https://doi.org/10.33340/susa.82515

    Article  Google Scholar 

  46. Port, A., Gheorghita, I., Guth, D., Clark, J.M., Liang, C., Dasu, S., Marcolli, M.: Persistent topology of syntax. Math. Comput. Sci. 12(1), 33–50 (2018). https://doi.org/10.1007/s11786-017-0329-x

    Article  MathSciNet  Google Scholar 

  47. Port, A., Karidi, T., Marcolli, M.: Topological Analysis of Syntactic Structures. CoRR (2019). arXiv:1903.05181

  48. Rexová, K., Frynta, D., Zrzavỳ, J.: Cladistic analysis of languages: Indo-European classification based on lexicostatistical data. Cladistics 19(2), 120–127 (2003)

    Google Scholar 

  49. Ringe, D., Warnow, T., Taylor, A.: Indo-European and computational cladistics. Trans. Philol. Soc. 100, 59–129 (2002)

    Article  Google Scholar 

  50. Rizzi, L.: On the format and locus of parameters: the role of morphosyntactic features. Linguist. Anal. 41(3–4), 159–191 (2017)

    Google Scholar 

  51. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)

    Google Scholar 

  52. Semple, C., Steel, M., et al.: Phylogenetics, vol. 24. Oxford University Press on Demand, Oxford (2003)

    Book  Google Scholar 

  53. Shu, K., Marcolli, M.: Syntactic structures and code parameters. Math. Comput. Sci. 11(1), 79–90 (2017). https://doi.org/10.1007/s11786-017-0298-0

    Article  MathSciNet  Google Scholar 

  54. Shu, K., Ortegaray, A., Berwick, R.C., Marcolli, M.: Phylogenetics of Indo-European Language Families Via an Algebro-Geometric Analysis of Their Syntactic Structures. CoRR (2017). arXiv:1712.01719

  55. Shu, K., Aziz, S., Huynh, V.-L., Warrick, D., Marcolli, M.: Syntactic phylogenetic trees. In: Kouneiher, J. (ed.) Foundations of Mathematics and Physics One Century After Hilbert, pp. 417–441. Springer, Cham (2018)

  56. Štefankovič, D., Vigoda, E.: Phylogeny of mixture models: robustness of maximum likelihood and non-identifiable distributions. J. Comput. Biol. 14(2), 156–189 (2007)

    Article  MathSciNet  Google Scholar 

  57. Štefankovič, D., Vigoda, E.: Pitfalls of heterogeneous processes for phylogenetic reconstruction. Syst. Biol. 56(1), 113–124 (2007)

    Article  Google Scholar 

  58. Stumpf, P.S., Smith, R.C., Lenz, M., Schuppert, A., Müller, F.J., Babtie, A., Chan, T.E., Stumpf, M.P., Please, C.P., Howison, S.D., et al.: Stem cell differentiation as a non-Markov stochastic process. Cell Syst. 5(3), 268–282 (2017)

    Article  Google Scholar 

  59. Warnow, T.: Computational Phylogenetics. Cambridge University Press, Cambridge (2017)

    Book  Google Scholar 

  60. Zou, L., Susko, E., Field, C., Roger, A.J.: The parameters of the Barry and Hartigan general Markov model are statistically nonidentifiable. Syst. Biol. 60(6), 872–875 (2011). https://doi.org/10.1093/sysbio/syr034

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous referees, and Andrea Ceolin for thoughtful feedback on the previous version of this paper that motivated this revision. The second author is partially supported by NSF grant DMS-2104330 and by FQXi grant FQXi-RFP-1 804 and Caltech’s Center for Evolutionary Science.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matilde Marcolli.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gakkhar, S., Marcolli, M. Syntactic Structures and the General Markov Models. Math.Comput.Sci. 18, 4 (2024). https://doi.org/10.1007/s11786-023-00575-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11786-023-00575-6

Keywords

Mathematics Subject Classification

Navigation