Abstract
In light of recent controversies surrounding the use of computational methods for the reconstruction of phylogenetic trees of language families (especially the Indo-European family), a possible approach based on syntactic information, complementing other linguistic methods, appeared as a promising possibility, largely developed in recent years in Longobardi’s Parametric Comparison Method. In this paper we identify several serious problems that arise in the use of syntactic data from the SSWL database for the purpose of computational phylogenetic reconstruction. We show that the most naive approach fails to produce reliable linguistic phylogenetic trees. We identify some of the sources of the observed problems and we discuss how they may be, at least partly, corrected by using additional information, such as prior subdivision into language families and subfamilies, and a better use of the information about ancient languages. We also describe how the use of phylogenetic algebraic geometry can help in estimating to what extent the probability distribution at the leaves of the phylogenetic tree obtained from the SSWL data can be considered reliable, by testing it on phylogenetic trees established by other forms of linguistic analysis. In simple examples, we find that, after restricting to smaller language subfamilies and considering only those SSWL parameters that are fully mapped for the whole subfamily, the SSWL data match extremely well reliable phylogenetic trees, according to the evaluation of phylogenetic invariants. This is a promising sign for the use of SSWL data for linguistic phylogenetics. We also argue how dependencies and nontrivial geometry/topology in the space of syntactic parameters would have to be taken into consideration in phylogenetic reconstructions based on syntactic data. A more detailed analysis of syntactic phylogenetic trees and their algebro-geometric invariants will appear elsewhere [33].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
E. Allman, J. Rhodes, Phylogenetic ideals and varieties for general Markov models. Adv. Appl. Math. 40, 127–148 (2008)
M. Baker, The Atoms of Language (Basic Books, USA, 2001)
F. Barbançon, S.N. Evans, L. Nakhleh, D. Ringe, T. Warnow, An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica 30(2), 143–170 (2013)
C. Bocci, Topics in phylogenetic algebraic geometry. Expo. Math. 25, 235–259 (2007)
A. Bouchard-Côté, D. Hall, T.L. Griffiths, D. Klein, Automated reconstruction of ancient languages using probabilistic models of sound change. Proc. Natl. Acad. Sci. (PNAS) 110(11), 4224–4229 (2013)
R. Bouckaert, P. Lemey, M. Dunn, S.J. Greenhill, A.V. Alekseyenko, A.J. Drummond, R.D. Gray, M.A. Suchard, Q.D. Atkinson, Mapping the origins and expansion of the Indo-European language family. Science 337, 957–960 (2012)
N. Chomsky, Lectures on Government and Binding (Foris Publications, Dordrecht, 1982)
N. Chomsky, H. Lasnik, The theory of Principles and Parameters. in “Syntax: An International Handbook of Contemporary Research”, (de Gruyter, 1993), pp. 506–569
M. DeGiorgio, J.H. Degnan, Robustness to divergence time underestimation when inferring species trees from estimated gene trees. Syst. Biol. 63(1), 66–82 (2014)
A. Delmestri, N. Cristianini, Linguistic phylogenetic inference by PAM-like matrices. J. Quant. Linguist. 19, 95–120 (2012)
N. Eriksson, K. Ranestad, B. Sturmfels, S. Sullivant, Phylogenetic algebraic geometry, in “Projective Varieties with Unexpected Properties”, pp.237–255, Walter de Gruyter, 2005
P. Forster, C. Renfrew, Phylogenetic Methods and the Prehistory of Language (McDonald Institute Monographs, 2006)
C. Galves (ed.), Parameter Theory and Linguistic Change (Oxford University Press, Oxford, 2012)
D. Gusfield, ReCombinatorics (MIT Press, Cambridge, 2014)
D.H. Huson, R. Rupp, C. Scornavacca, Phylogenetic Networks: Concepts (Cambridge University Press, Algorithms and Applications, 2010)
P. Kanerva, Sparse Distributed Memory (MIT Press, Cambridge, 1988)
G. Longobardi, Methods in parametric linguistics and cognitive history. Linguist. Var. Yearb. 3, 101–138 (2003)
G. Longobardi, L. Bortolussi, M.A. Irimia, N. Radkevich, A. Ceolin, C. Guadagno, D. Michelioudakis, A. Sgarro, Mathematical modeling of grammatical diversity supports the historical reality of formal syntax. in “Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics” (2016)
G. Longobardi, S. Ghirotto, C. Guardiano, F. Tassi, A. Benazzo, A. Ceolin, G. Barbujani, Across language families: genome diversity mirrors linguistic variation within Europe. Am. J. Phys. Anthropol. 157(4), 630–640 (2015)
G. Longobardi, C. Guardiano, Evidence for syntax as a signal of historical relatedness. Lingua 119, 1679–1706 (2009)
G. Longobardi, C. Guardiano, G. Silvestri, A. Boattini, A. Ceolin, Towards a syntactic phylogeny of modern Indo-European languages. J. Hist. Linguist. 3(1), 122–152 (2013)
M. Marcolli, Syntactic parameters and a coding theory perspective on entropy and complexity of language families. Entropy 18, 110 [17 pages] (2016)
L. Nakhleh, D. Ringe, T. Warnow, Perfect phylogenetic networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81(2), 382–420 (2005)
L. Pacher, B. Sturmfels, The mathematics of phylogenomics. SIAM Rev. 49(1), 3–31 (2007)
L. Pacher, B. Sturmfels, Tropical geometry of statistical models. Proc. Natl. Acad. Sci. (PNAS) 10146, 16132–16137 (2004)
J.J. Park, R. Boettcher, A. Zhao, A. Mun, K. Yuh, V. Kumar, M. Marcolli, Prevalence and recoverability of syntactic parameters in sparse distributed memories. in Geometric Science of Information. Third International Conference GSI 2017. Lecture Notes in Computer Science, vol. 10589 (Springer, 2017), pp. 265–272
A. Perelysvaig, M.W. Lewis, The Indo-European Controversy: Facts and Fallacies in Historical Linguistics (Cambridge University Press, Cambridge, 2015)
F. Petroni, M. Serva, Language distance and tree reconstruction. J. Stat. Mech. 2008, P08012 [16 pages] (2008)
A. Port, I. Gheorghita, D. Guth, J.M. Clark, C. Liang, S. Dasu, M. Marcolli, Persistent Topology of Syntax. Math. Comput. Sci. 12(1), 33–50 (2018)
L. Rizzi, On the Format and Locus of Parameters: The Role of Morphosyntactic Features, preprint, 2016
N. Saitou, M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)
K.Shu, M.Marcolli, Syntactic Structures and Code Parameters, Math. Comput. Sci. 11(1), 79–90 (2017)
K. Shu, A. Ortegaray, R. C. Berwick, M. Marcolli, Phylogenetics of Indo-European Language families via an Algebro-Geometric Analysis of their Syntactic Structures. arXiv:1712.01719
K. Siva, J. Tao, M. Marcolli, Syntactic Parameters and Spin Glass Models of Language Change. Linguist. Anal. 41(3–4), 559–608 (2017)
B. Sturmfels, S. Sullivant, Toric ideals of phylogenetic invariants. J. Comput. Bio. 12(2), 204–228 (2005)
T. Warnow, S.N. Evans, D. Ringe, L. Nakhleh, Stochastic Models of Language Evolution and an Application to the Indo-European Family of Languages. Available at http://www.stat.berkeley.edu/users/evans/659.pdf
SSWL Database of Syntactic Parameters: http://sswl.railsplayground.net/
Acknowledgements
The first author is supported by a Summer Undergraduate Research Fellowship at Caltech. Part of this work was performed as part of the activities of the last author’s Mathematical and Computational Linguistics lab and CS101/Ma191 class at Caltech. The last author is partially supported by NSF grants DMS-1201512 and PHY-1205440 and DMS-1707882.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: The SSWL Parameters of the Latin languages
Appendix: The SSWL Parameters of the Latin languages
The phylogenetic invariants for the tree of Latin languages of Fig. 11 are evaluated at the probability distribution \(p_{i_1,i_2.i_3,i_4,i_5}\) at the leaves, based on the SSWL parameters for this group of languages. There are 106 parameters in the SSWL database that are completely mapped for all of these five languages. We have excluded from the list all those SSWL parameters that are only mapped for some but not all of the languages in this group. With the notation \(\ell _1=\) French, \(\ell _2=\) Italian, \(\ell _3 =\) Latin, \(\ell _4=\) Spanish, and \(\ell _5=\) Portuguese, the syntactic parameters are given by the following list. The column on the left lists the SSWL parameters P as labeled in the database, [37].
One can see by inspecting the different groups of parameters in this list that several parameters within the “same group” tend to behave in the same way (e.g. all the Neg parameters) or in more highly correlated way than across groups of parameters. This observation is consistent with the more general observation of dependencies observed through the Kanerva networks method in [26]. Thus, in order to better fit this set of binary variables with the hypothesis of independent equally distributed variables in Markov processes, it may be better to select a subset of the SSWL parameters that cuts across the various groups of more closely correlated variables. We will discuss this aspect more in details elsewhere.
The probability \(p_{i_1,i_2.i_3,i_4,i_5}\) is then computed by counting the frequencies of occurrence of binary vectors \([i_1,i_2,i_3,i_4,i_5] \epsilon \{0,1\}^5\) among the 106 vectors of SSWL parameters above. The only nonzero frequencies are
Note how these frequencies confirm some well known facts about the Latin languages. Syntactic parameters (as recorded in SSWL) are very likely to have remained the same across all five languages in the family, with a higher probability of a feature not allowed in Latin remaining not allowed in the other languages (31/106) than of a feature allowed in Latin remaining allowed in the other languages (21 / 106). It is also very likely that a feature is the same in all the modern ones but different from Latin, with a much higher incidence of cases of a feature allowed in Latin becoming disallowed in all the other languages (23/106) than the other way around (8/106). Among the remaining possibilities, we see incidences where French has an allowed feature that is missing in the other languages (5/106) of disallowed (3/106) and cases where Latin and Portuguese have the same feature allowed, which is disallowed in the other languages (3/106): all other nonzero entries have only two or less occurrences. The resulting matrices for the edge flattenings of the tree of Fig. 11 are then as computed in Sect. 5.
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Shu, K., Aziz, S., Huynh, VL., Warrick, D., Marcolli, M. (2018). Syntactic Phylogenetic Trees. In: Kouneiher, J. (eds) Foundations of Mathematics and Physics One Century After Hilbert. Springer, Cham. https://doi.org/10.1007/978-3-319-64813-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-64813-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64812-5
Online ISBN: 978-3-319-64813-2
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)