Syntactic Phylogenetic Trees

Shu, Kevin; Aziz, Sharjeel; Huynh, Vy-Luan; Warrick, David; Marcolli, Matilde

doi:10.1007/978-3-319-64813-2_14

Kevin Shu²,
Sharjeel Aziz²,
Vy-Luan Huynh²,
David Warrick² &
…
Matilde Marcolli²

1816 Accesses
5 Citations

Abstract

In light of recent controversies surrounding the use of computational methods for the reconstruction of phylogenetic trees of language families (especially the Indo-European family), a possible approach based on syntactic information, complementing other linguistic methods, appeared as a promising possibility, largely developed in recent years in Longobardi’s Parametric Comparison Method. In this paper we identify several serious problems that arise in the use of syntactic data from the SSWL database for the purpose of computational phylogenetic reconstruction. We show that the most naive approach fails to produce reliable linguistic phylogenetic trees. We identify some of the sources of the observed problems and we discuss how they may be, at least partly, corrected by using additional information, such as prior subdivision into language families and subfamilies, and a better use of the information about ancient languages. We also describe how the use of phylogenetic algebraic geometry can help in estimating to what extent the probability distribution at the leaves of the phylogenetic tree obtained from the SSWL data can be considered reliable, by testing it on phylogenetic trees established by other forms of linguistic analysis. In simple examples, we find that, after restricting to smaller language subfamilies and considering only those SSWL parameters that are fully mapped for the whole subfamily, the SSWL data match extremely well reliable phylogenetic trees, according to the evaluation of phylogenetic invariants. This is a promising sign for the use of SSWL data for linguistic phylogenetics. We also argue how dependencies and nontrivial geometry/topology in the space of syntactic parameters would have to be taken into consideration in phylogenetic reconstructions based on syntactic data. A more detailed analysis of syntactic phylogenetic trees and their algebro-geometric invariants will appear elsewhere [33].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Phylogenetics of Indo-European Language Families via an Algebro-Geometric Analysis of Their Syntactic Structures

Article 15 April 2021

Syntactic Structures and the General Markov Models

Article 07 March 2024

Topological Analysis of Syntactic Structures

Article 17 January 2022

Notes

References

E. Allman, J. Rhodes, Phylogenetic ideals and varieties for general Markov models. Adv. Appl. Math. 40, 127–148 (2008)
Article MathSciNet Google Scholar
M. Baker, The Atoms of Language (Basic Books, USA, 2001)
Google Scholar
F. Barbançon, S.N. Evans, L. Nakhleh, D. Ringe, T. Warnow, An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica 30(2), 143–170 (2013)
Article Google Scholar
C. Bocci, Topics in phylogenetic algebraic geometry. Expo. Math. 25, 235–259 (2007)
Article MathSciNet Google Scholar
A. Bouchard-Côté, D. Hall, T.L. Griffiths, D. Klein, Automated reconstruction of ancient languages using probabilistic models of sound change. Proc. Natl. Acad. Sci. (PNAS) 110(11), 4224–4229 (2013)
Article Google Scholar
R. Bouckaert, P. Lemey, M. Dunn, S.J. Greenhill, A.V. Alekseyenko, A.J. Drummond, R.D. Gray, M.A. Suchard, Q.D. Atkinson, Mapping the origins and expansion of the Indo-European language family. Science 337, 957–960 (2012)
Article Google Scholar
N. Chomsky, Lectures on Government and Binding (Foris Publications, Dordrecht, 1982)
Google Scholar
N. Chomsky, H. Lasnik, The theory of Principles and Parameters. in “Syntax: An International Handbook of Contemporary Research”, (de Gruyter, 1993), pp. 506–569
Google Scholar
M. DeGiorgio, J.H. Degnan, Robustness to divergence time underestimation when inferring species trees from estimated gene trees. Syst. Biol. 63(1), 66–82 (2014)
Article Google Scholar
A. Delmestri, N. Cristianini, Linguistic phylogenetic inference by PAM-like matrices. J. Quant. Linguist. 19, 95–120 (2012)
Article Google Scholar
N. Eriksson, K. Ranestad, B. Sturmfels, S. Sullivant, Phylogenetic algebraic geometry, in “Projective Varieties with Unexpected Properties”, pp.237–255, Walter de Gruyter, 2005
Google Scholar
P. Forster, C. Renfrew, Phylogenetic Methods and the Prehistory of Language (McDonald Institute Monographs, 2006)
Google Scholar
C. Galves (ed.), Parameter Theory and Linguistic Change (Oxford University Press, Oxford, 2012)
Google Scholar
D. Gusfield, ReCombinatorics (MIT Press, Cambridge, 2014)
Google Scholar
D.H. Huson, R. Rupp, C. Scornavacca, Phylogenetic Networks: Concepts (Cambridge University Press, Algorithms and Applications, 2010)
Book Google Scholar
P. Kanerva, Sparse Distributed Memory (MIT Press, Cambridge, 1988)
Google Scholar
G. Longobardi, Methods in parametric linguistics and cognitive history. Linguist. Var. Yearb. 3, 101–138 (2003)
Article Google Scholar
G. Longobardi, L. Bortolussi, M.A. Irimia, N. Radkevich, A. Ceolin, C. Guadagno, D. Michelioudakis, A. Sgarro, Mathematical modeling of grammatical diversity supports the historical reality of formal syntax. in “Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics” (2016)
Google Scholar
G. Longobardi, S. Ghirotto, C. Guardiano, F. Tassi, A. Benazzo, A. Ceolin, G. Barbujani, Across language families: genome diversity mirrors linguistic variation within Europe. Am. J. Phys. Anthropol. 157(4), 630–640 (2015)
Article Google Scholar
G. Longobardi, C. Guardiano, Evidence for syntax as a signal of historical relatedness. Lingua 119, 1679–1706 (2009)
Article Google Scholar
G. Longobardi, C. Guardiano, G. Silvestri, A. Boattini, A. Ceolin, Towards a syntactic phylogeny of modern Indo-European languages. J. Hist. Linguist. 3(1), 122–152 (2013)
Article Google Scholar
M. Marcolli, Syntactic parameters and a coding theory perspective on entropy and complexity of language families. Entropy 18, 110 [17 pages] (2016)
Article MathSciNet Google Scholar
L. Nakhleh, D. Ringe, T. Warnow, Perfect phylogenetic networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81(2), 382–420 (2005)
Article Google Scholar
L. Pacher, B. Sturmfels, The mathematics of phylogenomics. SIAM Rev. 49(1), 3–31 (2007)
Article MathSciNet Google Scholar
L. Pacher, B. Sturmfels, Tropical geometry of statistical models. Proc. Natl. Acad. Sci. (PNAS) 10146, 16132–16137 (2004)
Google Scholar
J.J. Park, R. Boettcher, A. Zhao, A. Mun, K. Yuh, V. Kumar, M. Marcolli, Prevalence and recoverability of syntactic parameters in sparse distributed memories. in Geometric Science of Information. Third International Conference GSI 2017. Lecture Notes in Computer Science, vol. 10589 (Springer, 2017), pp. 265–272
MATH Google Scholar
A. Perelysvaig, M.W. Lewis, The Indo-European Controversy: Facts and Fallacies in Historical Linguistics (Cambridge University Press, Cambridge, 2015)
Google Scholar
F. Petroni, M. Serva, Language distance and tree reconstruction. J. Stat. Mech. 2008, P08012 [16 pages] (2008)
Google Scholar
A. Port, I. Gheorghita, D. Guth, J.M. Clark, C. Liang, S. Dasu, M. Marcolli, Persistent Topology of Syntax. Math. Comput. Sci. 12(1), 33–50 (2018)
Google Scholar
L. Rizzi, On the Format and Locus of Parameters: The Role of Morphosyntactic Features, preprint, 2016
Google Scholar
N. Saitou, M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)
Google Scholar
K.Shu, M.Marcolli, Syntactic Structures and Code Parameters, Math. Comput. Sci. 11(1), 79–90 (2017)
Google Scholar
K. Shu, A. Ortegaray, R. C. Berwick, M. Marcolli, Phylogenetics of Indo-European Language families via an Algebro-Geometric Analysis of their Syntactic Structures. arXiv:1712.01719
K. Siva, J. Tao, M. Marcolli, Syntactic Parameters and Spin Glass Models of Language Change. Linguist. Anal. 41(3–4), 559–608 (2017)
Google Scholar
B. Sturmfels, S. Sullivant, Toric ideals of phylogenetic invariants. J. Comput. Bio. 12(2), 204–228 (2005)
Article Google Scholar
T. Warnow, S.N. Evans, D. Ringe, L. Nakhleh, Stochastic Models of Language Evolution and an Application to the Indo-European Family of Languages. Available at http://www.stat.berkeley.edu/users/evans/659.pdf
SSWL Database of Syntactic Parameters: http://sswl.railsplayground.net/

Download references

Acknowledgements

The first author is supported by a Summer Undergraduate Research Fellowship at Caltech. Part of this work was performed as part of the activities of the last author’s Mathematical and Computational Linguistics lab and CS101/Ma191 class at Caltech. The last author is partially supported by NSF grants DMS-1201512 and PHY-1205440 and DMS-1707882.

Author information

Authors and Affiliations

Division of Physics, Mathematics, and Astronomy, California Institute of Technology, 1200 E. California Blvd, Pasadena, CA, 91125, USA
Kevin Shu, Sharjeel Aziz, Vy-Luan Huynh, David Warrick & Matilde Marcolli

Authors

Kevin Shu
View author publications
You can also search for this author in PubMed Google Scholar
Sharjeel Aziz
View author publications
You can also search for this author in PubMed Google Scholar
Vy-Luan Huynh
View author publications
You can also search for this author in PubMed Google Scholar
David Warrick
View author publications
You can also search for this author in PubMed Google Scholar
Matilde Marcolli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matilde Marcolli .

Editor information

Editors and Affiliations

Nice and Sophia Antipolis University, Nice, France
Joseph Kouneiher

Appendix: The SSWL Parameters of the Latin languages

The phylogenetic invariants for the tree of Latin languages of Fig. 11 are evaluated at the probability distribution $p_{i_1,i_2.i_3,i_4,i_5}$ at the leaves, based on the SSWL parameters for this group of languages. There are 106 parameters in the SSWL database that are completely mapped for all of these five languages. We have excluded from the list all those SSWL parameters that are only mapped for some but not all of the languages in this group. With the notation $\ell _1=$ French, $\ell _2=$ Italian, $\ell _3 =$ Latin, $\ell _4=$ Spanish, and $\ell _5=$ Portuguese, the syntactic parameters are given by the following list. The column on the left lists the SSWL parameters P as labeled in the database, [37].

One can see by inspecting the different groups of parameters in this list that several parameters within the “same group” tend to behave in the same way (e.g. all the Neg parameters) or in more highly correlated way than across groups of parameters. This observation is consistent with the more general observation of dependencies observed through the Kanerva networks method in [26]. Thus, in order to better fit this set of binary variables with the hypothesis of independent equally distributed variables in Markov processes, it may be better to select a subset of the SSWL parameters that cuts across the various groups of more closely correlated variables. We will discuss this aspect more in details elsewhere.

The probability $p_{i_1,i_2.i_3,i_4,i_5}$ is then computed by counting the frequencies of occurrence of binary vectors $[i_1,i_2,i_3,i_4,i_5] \epsilon \{0,1\}^5$ among the 106 vectors of SSWL parameters above. The only nonzero frequencies are

$$ p_{0,0,0,0,0}=\frac{31}{106}, \ \ \ p_{0,0,0,0,1}=\frac{1}{106}, \ \ \ p_{0,0,0,1,0}=\frac{1}{106}, \ \ \ p_{0,0,1,0,0}=\frac{23}{106}, $$

$$ p_{0,0,1,0,1}= \frac{3}{106}, \ \ \ p_{0,0,1,1,1}=\frac{2}{106}, \ \ \ p_{0,1,0,0,0}=\frac{1}{106}, \ \ \ p_{0,1,0,1,1}=\frac{1}{106}, $$

$$ p_{0,1,1,0,1}= \frac{1}{106}, \ \ \ p_{0,1,1,1,1}=\frac{3}{106}, \ \ \ p_{1,0,0,0,0} = \frac{5}{106}, \ \ \ p_{1,0,0,1,0}=\frac{2}{106}, $$

$$ p_{1,1,0,1,0}=\frac{1}{106}, \ \ \ p_{1,1,0,0,0}=\frac{2}{106}, \ \ \ p_{1,1,0,1,1}=\frac{8}{106}, \ \ \ p_{1,1,1,1,1}=\frac{21}{106}. $$

Note how these frequencies confirm some well known facts about the Latin languages. Syntactic parameters (as recorded in SSWL) are very likely to have remained the same across all five languages in the family, with a higher probability of a feature not allowed in Latin remaining not allowed in the other languages (31/106) than of a feature allowed in Latin remaining allowed in the other languages (21 / 106). It is also very likely that a feature is the same in all the modern ones but different from Latin, with a much higher incidence of cases of a feature allowed in Latin becoming disallowed in all the other languages (23/106) than the other way around (8/106). Among the remaining possibilities, we see incidences where French has an allowed feature that is missing in the other languages (5/106) of disallowed (3/106) and cases where Latin and Portuguese have the same feature allowed, which is disallowed in the other languages (3/106): all other nonzero entries have only two or less occurrences. The resulting matrices for the edge flattenings of the tree of Fig. 11 are then as computed in Sect. 5.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Shu, K., Aziz, S., Huynh, VL., Warrick, D., Marcolli, M. (2018). Syntactic Phylogenetic Trees. In: Kouneiher, J. (eds) Foundations of Mathematics and Physics One Century After Hilbert. Springer, Cham. https://doi.org/10.1007/978-3-319-64813-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-64813-2_14
Published: 27 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64812-5
Online ISBN: 978-3-319-64813-2
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)

Publish with us

Policies and ethics

Syntactic Phylogenetic Trees

Abstract

Access this chapter

Similar content being viewed by others

Phylogenetics of Indo-European Language Families via an Algebro-Geometric Analysis of Their Syntactic Structures

Syntactic Structures and the General Markov Models

Topological Analysis of Syntactic Structures

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: The SSWL Parameters of the Latin languages

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Syntactic Phylogenetic Trees

Abstract

Access this chapter

Similar content being viewed by others

Phylogenetics of Indo-European Language Families via an Algebro-Geometric Analysis of Their Syntactic Structures

Syntactic Structures and the General Markov Models

Topological Analysis of Syntactic Structures

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: The SSWL Parameters of the Latin languages

Appendix: The SSWL Parameters of the Latin languages

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation