Abstract
We study phylogenetic signal present in syntactic information by considering the syntactic structures data from Longobardi (Linguist Anal 41:517–557, 2017), Collins (Syntactic structures of the world’s language: a cross-linguistic database. 27 September (2010), Colloquium: https://ling.yale.edu/syntactic-structures-worlds-language-cross-linguistic-database, 2010), Ceolin et al. (Front Psychol 11:2384, 2020) and Koopman (SSWL syntactic structures of the world’s languages: an open-ended database for the linguistic community and by the linguistic community. mit 50, 12. http://sswl.railsplayground.net/, 2011). Focusing first on the general Markov models, we explore how well the the syntactic structures data conform to the hypothesis required by these models. We do this by comparing derived phylogenetic trees against trees agreed on by the linguistics community. We then interpret the methods of Ceolin et al. (2020) as an infinite sites evolutionary model and compare the consistency of the data with this alternative. The ideas and methods discussed in the present paper are more generally applicable than to the specific setting of syntactic structures, and can be used in other contexts, when analyzing consistency of data with against hypothesized evolutionary models.
Similar content being viewed by others
Code/Data availability
The code and data used in this paper are available at https://github.com/minorllama/syntactic_structures_phylogenetics.
Notes
The slight issue with negative determinants in Lake’s definition can be sidestepped using a constant scaling of the metric and moving it inside the logarithm.
Rate matrix is any matrix where each row sums to zero, and all entries are positive off diagonal and non-positive on it; each edge is thought of as a continuous Markov chain associated to the rate matrix.
In the updated SSWl Hittite has “11 Adposition Noun Phrase” set to value 0 and Armenian (Western Armenian) has “Neg 01 Standard Negation is Particle that Precedes the Verb” set to value 1.
01 Subject Verb, 06 Subject Object Verb, 11 Adposition Noun Phrase, 13 Adjective Noun, 15 Numeral Noun, 17 Demonstrative Noun, 19 Possessor Noun, 21 Pronominal Possessor Noun, Neg 03 Standard Negation is Prefix, Neg 08 Standard Negation is Tone plus Other Modification, Neg 10 Standard Negation is Infix, Neg 12 Distinct Negation of identity, Neg 13 Distinct Negation of Existence, Neg 14 Distinct Negation of Location, Order N3 01 Demonstrative Adjective Noun, Neg 04 Standard Negation is Suffix, 12 Noun Phrase Adposition.
This convergence can be quantified with the Berry–Esseem theorem, see Durrett [13].
References
Allman, E., Rhodes, J.: Phylogenetic ideals and varieties for general Markov models. Adv. Appl. Math. 40, 127–148 (2008)
Allman, E.S., Rhodes, J.A., Sullivant, S.: When do phylogenetic mixture models mimic other phylogenetic models? Syst. Biol. 61(6), 1049–1059 (2012)
Baker, M.C.: The Atoms of Language. Basic Books, New York (2002)
Biberauer, T.: The Limits of Syntactic Variation. John Benjamins Publishing, Amsterdam (2008)
Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S.J., Alekseyenko, A.V., Drummond, A.J., Gray, R.D., Suchard, M.A., Atkinson, Q.D.: Mapping the origins and expansion of the Indo-European language family. Science 337(6097), 957–960 (2012)
Bouckaert, R., Heled, J., Kühnert, D., Vaughan, T., Wu, C.H., Xie, D., Suchard, M.A., Rambaut, A., Drummond, A.J.: BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 10(4), e1003537 (2014)
Ceolin, A., Guardiano, C., Irimia, M.A., Longobardi, G.: Formal syntax and deep history. Front. Psychol. 11, 2384 (2020)
Ceolin, A., Guardiano, C., Longobardi, G., Irimia, M.A., Bortolussi, L., Sgarro, A.: At the boundaries of syntactic prehistory. Philos. Trans. R. Soc. B 376(1824), 20200197 (2021)
Chomsky, N.: Lectures on Government and Binding. Walter de Gruyter, Basel (1981)
Chomsky, N., Lasnik, H.: The theory of principles and parameters. In: Jacobs, J., von Stechow, A., Sternefeld, W., Vennemann, T. (eds.) Syntax: An International Handbook of Contemporary Research, pp. 506–569. Walter de Gruyter, Basel (1993)
Collins, C.: Syntactic Structures of the World’s Language: A Cross-linguistic Database. 2010. 27 September (2010), Colloquium: https://ling.yale.edu/syntactic-structures-worlds-language-cross-linguistic-database
Dryer, M.S., Haspelmath, M.: WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig (2013). https://wals.info/
Durrett, R.: Probability: Theory and Examples, vol. 49. Cambridge University Press, Cambridge (2019)
Eriksson, N.K.: Algebraic Combinatorics for Computational Biology. PhD thesis, University of California, Berkeley (2006)
Felsenstein, J.: Inferring Phylogenies, vol. 2. Sinauer Associates, Sunderland (2004)
Gascuel, O., Steel, M.: Neighbor-joining revealed. Mol. Biol. Evol. 23(11), 1997–2000 (2006)
Gray, R.D., Drummond, A.J., Greenhill, S.J.: Language phylogenies reveal expansion pulses and pauses in pacific settlement. Science 323(5913), 479–483 (2009)
Guardiano, C., Michelioudakis, D., Ceolin, A., Irimia, M., Longobardi, G., Radkevich, N., Sitaridou, I., Silvestri, G.: South by southeast. A syntactic approach to Greek and Romance microvariation. L’Ital. Dialett. 77, 95–166 (2016)
Hoffmann, K., Bouckaert, R., Greenhill, S.J., Kühnert, D.: Bayesian phylogenetic analysis of linguistic data using beast. J. Lang. Evol. 6(2), 119–135 (2021)
Karimi, S., Piattelli-Palmarini, M.: Special issue on parameters. Linguist. Anal. 41, 3–4 (2017)
Kazakov, D.L., Cordoni, G., Algahtani, E., Ceolin, A., Irimia, M.A., Kim, S.S., Michelioudakis, D., Radkevich, N., Guardiano, C., Longobardi, G.: Learning implicational models of universal grammar parameters. In: Cuskley, C., Flaherty, M., Little, H., McCrohon, L., Ravignani, A., Verhoef, T. (eds.) The Evolution of Language: Proceedings of the 12th International Conference (EVOLANGXII). NCU Press (2018). https://doi.org/10.12775/3991-1.048. http://evolang.org/torun/proceedings/papertemplate.html?p=176
Koopman, H.: SSWL Syntactic Structures of the World’s Languages: An Open-ended Database for the Linguistic Community and by the Linguistic Community. mit 50, 12 (2011). http://sswl.railsplayground.net/
Lake, J.A.: Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. 91(4), 1455–1459 (1994)
Longobardi, G.: Convergence in parametric phylogenies. Homoplasy or principled explanation? In: Galves, C., Cyrino, S., Lopes, R., Sandalo, F., Avelar, J. (eds.) Parameter Theory and Linguistic Change. Oxford University Press, Oxford (2012). https://doi.org/10.1093/acprof:oso/9780199659203.001.0001
Longobardi, G.: Convergence in parametric phylogenies: homoplasy or principled explanation? In: Galves, C., Cyrino, S., Lopes, R., Sandalo, F., Avelar, J. (eds.) Parameter Theory and Language Change, pp. 304–319. Oxford University Press, Oxford (2012)
Longobardi, G.: Principles, parameters, and schemata: a constructivist UG. Linguist. Anal. 41(3–4), 517–556 (2017)
Longobardi, G.: Principles, parameters, and schemata. A constructivist UG. Linguist. Anal. 41, 517–557 (2017)
Longobardi, G., Guardiano, C.: Evidence for syntax as a signal of historical relatedness. Lingua 119, 1679–1706 (2009)
Longobardi, G., Guardiano, C., Silvestri, G., Boattini, A., Ceolin, A.: Toward a syntactic phylogeny of modern Indo-European languages. J. Hist. Linguist. 3(1), 122–152 (2013)
Ma, J., Ratan, A., Raney, B.J., Suh, B.B., Miller, W., Haussler, D.: The infinite sites model of genome evolution. Proc. Natl. Acad. Sci. 105(38), 14254–14261 (2008)
Marcolli, M.: Syntactic parameters and a coding theory perspective on entropy and complexity of language families. Entropy 18(4), 110 (2016). https://doi.org/10.3390/e18040110
Matsen, F.A., Steel, M.: Phylogenetic mixtures on a single tree can mimic a tree of another topology. Syst. Biol. 56(5), 767–775 (2007)
Murawaki, Y.: Analyzing correlated evolution of multiple features using latent representations. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4371–4382 (2018)
Nicholls, G.K., Gray, R.D.: Dated ancestral trees from binary trait data and their application to the diversification of languages. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70, 545–566 (2008)
Niyogi, P.: The Computational Nature of Language Learning and Evolution, Volume 43 of Current Studies in Linguistics. MIT Press, Cambridge (2006)
Niyogi, P., Berwick, R.C.: A dynamical systems model for language change. Complex Syst. 11(3), 161–204 (1997)
Nurbakova, D., Rusakov, S., Alexandrov, V.: Quantifying uncertainty in phylogenetic studies of the Slavonic languages. Procedia Comput. Sci. 18, 2269–2277 (2013)
O’Donnell, R.: Analysis of Boolean Functions. Cambridge University Press, Cambridge (2014)
Ortegaray, A., Berwick, R.C., Marcolli, M.: Heat Kernel Analysis of Syntactic Structures. CoRR (2018). arXiv:1803.09832
Pachter, L., Sturmfels, B.: Algebraic Statistics for Computational Biology, vol. 13. Cambridge University Press, Cambridge (2005)
Pachter, L., Sturmfels, B.: The mathematics of phylogenomics. SIAM Rev. 49(1), 3–31 (2007)
Pagel, M., Meade, A.: A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. 53(4), 571–81 (2004)
Park, J.J., Boettcher, R., Zhao, A., Mun, A., Yuh, K., Kumar, V., Marcolli, M.: Prevalence and recoverability of syntactic parameters in sparse distributed memories. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Structures of Information 2017. Lecture Notes in Computer Science, vol. 10589, pp. 1–8. Springer, Cham (2017)
Perelysvaig, A., Lewis, M.W.: The Indo-European Controversy: Facts and Fallacies in Historical Linguistics. Cambridge University Press, Cambridge (2015)
Piispanen, P.: The Uralic–Yukaghiric connection revisited: sound correspondences of geminate clusters. Suom.-Ugr. Seuran Aikakauskirja 2013(94), 165–197 (2013). https://doi.org/10.33340/susa.82515
Port, A., Gheorghita, I., Guth, D., Clark, J.M., Liang, C., Dasu, S., Marcolli, M.: Persistent topology of syntax. Math. Comput. Sci. 12(1), 33–50 (2018). https://doi.org/10.1007/s11786-017-0329-x
Port, A., Karidi, T., Marcolli, M.: Topological Analysis of Syntactic Structures. CoRR (2019). arXiv:1903.05181
Rexová, K., Frynta, D., Zrzavỳ, J.: Cladistic analysis of languages: Indo-European classification based on lexicostatistical data. Cladistics 19(2), 120–127 (2003)
Ringe, D., Warnow, T., Taylor, A.: Indo-European and computational cladistics. Trans. Philol. Soc. 100, 59–129 (2002)
Rizzi, L.: On the format and locus of parameters: the role of morphosyntactic features. Linguist. Anal. 41(3–4), 159–191 (2017)
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)
Semple, C., Steel, M., et al.: Phylogenetics, vol. 24. Oxford University Press on Demand, Oxford (2003)
Shu, K., Marcolli, M.: Syntactic structures and code parameters. Math. Comput. Sci. 11(1), 79–90 (2017). https://doi.org/10.1007/s11786-017-0298-0
Shu, K., Ortegaray, A., Berwick, R.C., Marcolli, M.: Phylogenetics of Indo-European Language Families Via an Algebro-Geometric Analysis of Their Syntactic Structures. CoRR (2017). arXiv:1712.01719
Shu, K., Aziz, S., Huynh, V.-L., Warrick, D., Marcolli, M.: Syntactic phylogenetic trees. In: Kouneiher, J. (ed.) Foundations of Mathematics and Physics One Century After Hilbert, pp. 417–441. Springer, Cham (2018)
Štefankovič, D., Vigoda, E.: Phylogeny of mixture models: robustness of maximum likelihood and non-identifiable distributions. J. Comput. Biol. 14(2), 156–189 (2007)
Štefankovič, D., Vigoda, E.: Pitfalls of heterogeneous processes for phylogenetic reconstruction. Syst. Biol. 56(1), 113–124 (2007)
Stumpf, P.S., Smith, R.C., Lenz, M., Schuppert, A., Müller, F.J., Babtie, A., Chan, T.E., Stumpf, M.P., Please, C.P., Howison, S.D., et al.: Stem cell differentiation as a non-Markov stochastic process. Cell Syst. 5(3), 268–282 (2017)
Warnow, T.: Computational Phylogenetics. Cambridge University Press, Cambridge (2017)
Zou, L., Susko, E., Field, C., Roger, A.J.: The parameters of the Barry and Hartigan general Markov model are statistically nonidentifiable. Syst. Biol. 60(6), 872–875 (2011). https://doi.org/10.1093/sysbio/syr034
Acknowledgements
We would like to thank the anonymous referees, and Andrea Ceolin for thoughtful feedback on the previous version of this paper that motivated this revision. The second author is partially supported by NSF grant DMS-2104330 and by FQXi grant FQXi-RFP-1 804 and Caltech’s Center for Evolutionary Science.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gakkhar, S., Marcolli, M. Syntactic Structures and the General Markov Models. Math.Comput.Sci. 18, 4 (2024). https://doi.org/10.1007/s11786-023-00575-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11786-023-00575-6