Phylogenetics beyond biology

Evolutionary processes have been described not only in biology but also for a wide range of human cultural activities including languages and law. In contrast to the evolution of DNA or protein sequences, the detailed mechanisms giving rise to the observed evolution-like processes are not or only partially known. The absence of a mechanistic model of evolution implies that it remains unknown how the distances between different taxa have to be quantified. Considering distortions of metric distances, we first show that poor choices of the distance measure can lead to incorrect phylogenetic trees. Based on the well-known fact that phylogenetic inference requires additive metrics, we then show that the correct phylogeny can be computed from a distance matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {D}}$$\end{document}D if there is a monotonic, subadditive function \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta$$\end{document}ζ such that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta ^{-1}({\mathbf {D}})$$\end{document}ζ-1(D) is additive. The required metric-preserving transformation \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta$$\end{document}ζ can be computed as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process remains elusive.


Introduction
At the most abstract level, evolution can be seen as a consequence of the generation of variation and selection. Since selection acts to remove entities from the system, it will eventually "die out" unless counteracted by some form of reproduction. Sustained evolution thus necessarily operates on populations of entities. The history of an evolutionary process can be recorded in the form of a directed graph: Dress et al. (2010b) considered the set Y comprising "all organisms that ever lived on earth" arranged into a graph G with arcs (directed edges) connecting to nodes u and v whenever u was a "parent" of v , defined in a rather loose sense as having contributed directly to the genetic makeup of v . These arcs encode not only father and mother in sexually reproducing populations, but also horizontal gene transfer, hybridization, the incorporation of retroviruses into the genome, etc. Since arcs encode ancestry, G is acyclic.
The very same construction applies to many other systems that are perceived as evolutionary. For example, in the evolution of languages one may consider the mutual influences of speakers or, even more fine grained, individual utterances as the basic entities (Croft 2000;Pagel 2009). The same is true for the transmission of cultural techniques, designs, and conventions (Mesoudi et al. 2006). Well-studied cases include the transmission of texts (Greg 1950), in particular manuscripts, and text reuse, i.e., the borrowing of parts of a corpus, with or without modifications, in the process of creating a new text, see, e.g., Seo and Croft (2008). Similarly, the revisions of the law as dissenting interpretations can be seen in this manner (Roe 1996). The common ground of these and presumably many other systems is that a limited set of entities at some point or interval in time "informs" limited sets of entities in their (usually immediate) future.
The key result of Dress et al. (2010b) is that several types of clusters on the subset U ⊂ Y of organisms that are currently alive can be defined from the structure of the graph G. Many of these form hierarchies and therefore define a tree. These clusters naturally take on the role of taxa, and the corresponding trees consequently are a meaningful representations of the phylogenetic relationships among these taxa. The same interpretation is meaningful, as we argued above, also for many-but presumably not all-aspects of human cultural endeavors. Notions of cultural evolution (see, e.g., Flannery (1972), Mesoudi et al. (2006)) are therefore more than a convenient metaphor. Instead, for a given system of interest, one has to ask whether or not the corresponding graph G shares key features with the one obtained from conceptualizing biological evolution. There is no a priori reason to assume, for instance, that G always gives rise to the tree-like abstraction that is at the heart of biological evolution. This is an inherently empirical question that needs to be answered for each "evolutionary" system under consideration. Human languages, for instance, are a prime example of an aspect of human activity that closely conforms to biological evolution.
The key point here is that a phylogenetic structure is an emergent phenomenon of the underlying evolutionary process; it requires that there exists a level of aggregation in G that produces clusters adhering to an (essentially) hierarchical structure. Although Dress et al. (2010b) provide a formal justification for phylogenetic reconstruction with their analysis of the graph G , their work does not attempt to provide a practical procedure to identify the relevant clusters, i.e., the taxa. After all, these are defined in terms of the graph G , which of course is not directly observable. In fact, usually not even the set U of extant entities will be known completely, as we will have to be content with a subset of available data.
In general, neither the "true nature" of the elementary entities nor a complete description for each of them is available to us. Instead, we have to be content with measured representations. For instance, in molecular phylogenetics, it is customary to represent a taxon by a set of sequences (usually representing single copy protein coding genes) obtained from one or more individuals. Morphological approaches in phylogenetics use a list of characters such as features of bones or organs to represent a typical individual. The impact of the choice of representation on the results of phylogenetic reconstructions has long been recognized in morphological phylogenetics and has been the subject of a long-standing debate, see, e.g., Wiens (2001).
The fundamental assumption that is made in any type of similarity-based phylogenetic analysis is that similarity of representations reflects evolutionary relatedness, i.e., proximity in G , and therefore also makes it possible to identify the hierarchical cluster systems that are defined in terms of G . This is well established, of course, in the case of molecular phylogenetics, where a detailed model of sequence evolution is available (Jukes and Cantor 1969;Tavaré 1986;Arenas 2015). Similarly, permutation distances directly count genomic rearrangement events (Hannenhalli and Pevzner 1995). The connection is much less clear for morphological phylogenetics, where choice and even the concept of "character" is under debate, see, e.g., Wagner (2001), Wagner and Stadler (2003) for a formal discussion. In many cases, it seems difficult to construct a theory that links distance or similarity measures directly to an underlying evolutionary process. This is the case for instance in phylogenetic applications of distances between RNA secondary structures (Siebert and Backofen 2005) or the use of distance measures based on data compression (Cilibrasi and Vitanyi 2005;RajaRajeswari and Viswanadha Raju 2017).
Phylogenetic methods have also been employed in the humanities. Relationships among languages, for instance, can be captured by using cognates (i.e., words with a common origin) as characters, see, e.g., Gray et al. (2011), Holman andWichmann (2017). Recently, sophisticated statistical approaches, that model, e.g., the importance of sound change, have been used to reconstruct language trees, see, e.g., Bhattacharya et al. (2018) for a recent overview. In stemmatics, differences between editions or manuscripts serve as characters from which the relationships, e.g., between the many different versions (O'Hara and Robinson 1993; Barbrook et al. 1998;Marmerola et al. 2016) can be reconstructed. Occasionally, material artefacts are considered. Tëmkin and Eldredge (2007) studied used phylogenetic methods to study the history of certain musical instruments. A broader perspective of phylogenetic approaches in cultural evolution is discussed, e.g., by Mesoudi et al. (2006), Steele et al. (2010) or Howe and Windram (2011).
It is a well-known fact in sequence analysis that not all (reasonable) distance measures lead to faithful reconstructions of phylogenies. It is a well-established practice, in fact, to correct for back-mutations, i.e., to transform raw counts of diverged sequence positions, i.e., the Hamming or Levenshtein distances, into distance measures that can be interpreted as numbers of evolutionary events or divergence times. Depending on the level of insights into the data, the simple Jukes-Cantor model (Jukes and Cantor 1969) or one of the many much more elaborate models (Tavaré 1986;Arenas 2015) is used for this purpose. In the field of alignment-free sequence analysis, on the other hand, the focus is on the efficient computation of dissimilarity measures, without overt concern of the measure's connection to a dynamical model of evolution (Vinga and Almeida 2003). One has observed, however, the distance measures that do well in a phylogenetic context also correlate very well with model-based distances (Edgar 2004;Haubold et al. 2009;Leimeister and Morgenstern 2014). We suspect that this reflects the fact that a particular subclass of metrics, the so-called additive metrics, conveys complete phylogenetic information, see "Distance-based phylogenetics" section. We therefore make a strong assumption throughout this contribution: Assumption A Given a complete and correct model of the evolutionary dynamics on a suitable constructed space X , there is an additive metric distance measure t on X that measures the cumulative change along each lineage.
An immediate consequence is that phylogenetic relationships can be reconstructed unambiguously if t is known. There is, of course, no reason to think that Assumption A holds in real life. In particular, it is certainly violated by all processes that lead to reticulate patterns in evolution, such as incomplete lineage sorting, horizontal gene transfer, and hybridization (Gontier 2015). The purpose of this contribution, therefore, is to ask how much (or how little) we need to know about the "true" metric t to be able to infer the correct phylogenetic tree T . More precisely, we investigate here the consequence of distorted distance measurements: Suppose that instead of t we can infer from the data only a "deformed" dissimilarity measure d = (t) , where is an unknown function about which only some qualitative features can be known. We then ask: How much information about t , and thus the underlying phylogenetic tree, does d still convey?

Distance-based phylogenetics
0 is a metric if it satisfies, for all x, y, z ∈ X: Distance measures can be used for clustering and thus serve as a means of extracting hierarchical, i.e., tree-like, structures on a set of data.
The basis of distance-based phylogenetic methods is additive metrics, i.e., metrics that are representations of edge-weighted trees. Consider a tree T with leaf-set X and a length function defined on the edges of T . Recall that every pair of leaves x and y is connected by a unique path xy in T . The length of this path, i.e., the sum of its edge lengths, defines the distance d T (x, y) . Additive metrics are those that derive from a tree in this manner. A famous theorem (Buneman 1974;Cunningham 1978;Dobson 1974;Simões-Pereira 1969) shows that additive metrics are characterized by the four-point condition: A metric is additive if and only if for any four points u, v, x, y ∈ X holds The appearance of additive metrics in evolutionary processes can be justified rigorously for specific models. For example, Markovian processes on strings of fixed length lead to distances that can be estimated directly from the data: Denoting by c ab (x, y) the fraction of characters in which x has state a and y has state b , which for each pair ( x, y ) can be arranged in a matrix (x, y) = c a,b (x, y) a,b . Steel (1994) s h o w e d t h a t ( t h e e x p e c t e d v a l u e s o f ) d(x, y) ∶= − ln |det( (x, y)| form an additive metric. Wellknown results from phylogenetic combinatorics show that given an additive metric, the tree T and its edge lengths can be reconstructed readily, see, e.g., the work of Apresjan (1966), Imrich and Stockiĭ (1972), Buneman (1974), Dress (1984), Bandelt and Dress (1992), Dress et al. (2010a). The well-known neighbor-joining algorithm (Saitou and Nei 1987), a special case of a large class of agglomerative clustering algorithms, furthermore, solves this problem efficiently and was shown to always compute the correct tree when presented with an additive metric, see the survey by Gascuel and Steel (2006) and the references therein. Additivity of the underlying metric is also assumed in a recent generalization of phylogenetic trees that allows data points to appear not only as leaves but also as interior vertices of the reconstructed tree (Telles et al. 2013). A stronger condition than additivity is ultrametricity, which is characterized by the strong triangle equation Condition (MU) means that all triangles are "isosceles with a short base", i.e., the length of two sides of the triangles is equal and the third one is at least not longer than these two. Ultrametrics appear in phylogenetics under the assumption of the strong clock hypothesis, i.e., constant evolutionary rates (Dress et al. 2007). Dating of the internal nodes (Britton et al. 2007) transforms an (additive) phylogeny into an ultrametric tree. Ultrametrics are a special case of additive metrics.
Real-life data sets, unfortunately, almost never satisfy the four-point condition. As a remedy, Sattah and Tversky (1977) and Fitch (1981) suggested to consider a "split relation" on pairs of objects, often referred to as quadruples, defined by The relation ‖ has been studied extensively and, under certain additional conditions, can provide sufficient information for reconstructing phylogenetic trees (Bandelt and Dress 1986) or at least phylogenetic networks (Bandelt and Dress 1992;Grünewald et al. 2009). The approximation of a given metric by additive metrics or ultrametrics given some measure of the goodness of fit has also received quite a bit of attention (Farach et al. 1996;Agarwala et al. 1998;Apostolico et al. 2013).
Here, we ask under which conditions distance data that may deviate from additivity in a systematic manner still yield a phylogenetically (more or less) correct relation ‖ . This is different from the inference problems mentioned above: Our task is not to minimize a uniform error functional but to deal with systematic distortions of the distance measurements. In order to formalize the problem setting, we assume that the evolutionary process under consideration (operating on a space X ) generates an additive metric t ∶ X × X → ℝ + 0 . The catch is that we have no knowledge of X and we cannot directly access t . We can, however, obtain partial knowledge from representations. That is, there is a function ∶ X → Y . The construction of the representation in Y depends on our theory of what is important about the evolving system. In molecular phylogenetics, Y may be chosen to be a space of sequences. In classical, morphology-based phylogenetics, the elements of Y are character-based descriptions of animals; attempts to use molecular structures for phylogenetic purposes might use RNA secondary structures or labeled graph representations of protein 3D structures; a historic linguist might choose word lists or grammatical features.
Once we have decided on representations, we can turn to measuring (dis)similarities between them. The concrete choice of a distance measure d ∶ Y × Y → ℝ + 0 of course again depends on the theoretical conception of the underlying evolutionary process. We can easily reinterpret d as a distance measure on X by setting It is easy to see that d ∶ X × X → ℝ is a metric whenever d is a metric and ∶ X → Y is injective, i.e., whenever our representation is good enough to distinguish objects in X . There is no a priori reason to make this assumption, however. Consider, for example, RNA secondary structures as a function of the primary sequences. This map is highly redundant (Schuster et al. 1994); for example, most tRNAs share the standard clover-leaf structure despite very different sequences and divergence times that pre-date the common ancestor of all extant life forms (Eigen et al. 1989); distances between secondary structures therefore do not reflect all evolutionary processes. Formally, d is not a metric but only a pseudometric in this case: It does not satisfy axiom (M1) any longer. We will ignore this complication here and assume for simplicity that d ∶ X × X → R + 0 is a metric.
The metric d is of interest for phylogenetic purposes if it quantifies evolutionary divergence in a meaningful way. That is, we are concerned with the information about the underlying additive metric t that can be extracted from d . Without additional assumptions on the relationships between t and d , however, nothing much can be said. At the very least, our representation (Y,d) should be good enough to recognize whether one of two objects y or z has diverged further from a given reference point x than the other. Hence, we assume that for all x, y, z ∈ X: In the absence of at least this very weak form of monotonicity, we cannot really hope to recover information about t from measuring d . To our knowledge, property (m0) has not received much attention in the past. The following, stronger condition, however, has been considered extensively: for all u, v, x, y ∈ X . This property is known as (strong) monotonicity (Kruskal 1964) and lies at the heart of nonmetric multi-dimensional scaling, a set of techniques that aim at approximating dissimilarity data by a Euclidean metric (Borg and Groenen 2005). A commonly used criterion is to minimize the violations of condition (m1). It is interesting to note in this context that, given any input metric d , there is a always a Euclidean metric that is connected with d by strong monotonicity, provided the embedding space is of sufficiently high dimension (Agarwal et al. 2007). In our context, it will be interesting to investigate whether there is an analogous result for additive metrics.
If we insist, in addition, that ties are preserved, i.e., that t(x, y) = t(u, v) is equivalent to d(x, y) = d (u, v) , then there exists an increasing function ∶ ℝ + 0 → ℝ + 0 such that d = (t) . In the following, we will consider this (more restrictive) setting in some detail.
We say that is a.m.-preserving (ultrametric-preserving) if •t is an additive metric whenever t is an additive metric (ultrametric). It was shown recently that a function preserves ultrametricity if and only if it is amenable (Z1) and non-decreasing (Z3) (Pongsriiam and Termwuttipong 2014). In Appendix, we prove: This implies in particular that an a.m.-preserving function is non-decreasing. It will not come as a surprise that nonlinear distortions do not preserve additivity.
A proof can be found in Appendix. The importance of this theorem lies in the fact that any nonlinear distortion of the metric t necessarily destroys additivity and thus, depending on the algorithm employed, may result in the reconstruction of an incorrect phylogeny.
Given the importance of the relation ‖ , it is natural to ask whether-or under what conditions-at least this relation is preserved. The example in Fig. 1 shows, however, that the relation ‖ is not necessarily preserved under transformations satisfying (Z1), (Z2), and (Z3). The example of Fig. 1 is reminiscent of the effect of long branch attraction (LBA) in parsimony-based methods (Felsenstein 1978;Bergsten 2005), which can also be understood the consequence of underestimating the impact of homoplasy, i.e., "back-mutations."

Multiple features
A reasonable approach to devise a distance measure for a set of objects is to use a representation in terms of a collection of features, i.e., to consider a product space Y independently defined for each of the features. Each feature can be seen as an independent representation, i ∶ X → Y i , and thus, we may reinterpret the d i as different distance measures on ) . In this setting, it seems most natural to assume that d i is just a pseudometric.
It is well known that any nonnegative linear combination of pseudometric d ∶= ∑ i a i d i with a i ≥ 0 is again a pseudometric. To avoid trivial cases, assume a i > 0 . Then, d is a metric whenever x ≠ y implies that there is a feature i such that d i (x, y) > 0 . The most general ways to combine metrics are given by the generalized metric-preserving transforms, i.e., functions ∶ (ℝ + 0 ) N → ℝ + 0 with the property that d = (t 1 , … , t n ) is a metric whenever each t i , 1 ≤ i ≤ N , is a SplitsTree (Huson and Bryant 2006). Here, d(u, y) + d(x, v) is the distance pair with the shortest distance sum, i.e., it corresponds to the quadruple uy‖xv . This split corresponds to the longer one of the two side lengths of the box metric (Das 1989). These functions have a characterization that naturally generalizes (Z1) and (Z*) to multiple arguments.
Theorem 2 If ∶ (ℝ + 0 ) N → ℝ + 0 transforms additive metrics d i consistent with the same underlying tree T into a metric (d 1 , … , d N ) that is again compatible with T , then is linear with nonnegative slope for d i > 0 as an immediate consequence of Theorem 1, i.e., condition (i) is necessary. Theorem 1 furthermore implies that the contribution for each feature i is necessarily of the form To ensure that we have a metric, each constituent must be a metric, i.e., at least one of a i and b i must be nonzero. □ In essence, Theorem 1 characterizes the distance measures that are "good" for phylogenetic purposes: These exactly are the ones that are linear combinations of distance measures that themselves are additive. In particular, therefore, alignmentfree phylogenetic methods are guaranteed to work only when their distance measure approximates an additive measure, or, equivalently, when they approximate a distance for which a transformation to an additive distance is known (and used for the phylogenetic reconstruction).

Inferring transformations
The theoretical considerations above lead to the conclusion that the key problem for phylogenetic inference from data without a completely understood underlying model is to find monotonic transformations that make the original data as additive as possible before applying distance-based phylogenetic methods. It is important to realize that this is not the same problem as extracting the additive part of a given metric using, e.g., split decomposition. To see this, consider the metric distance matrix The transformation t = −10 ln(1 − d) recovers the additive metric of Fig. 1 (up to small rounding errors) and thus recovers the tree in Fig. 1. Its split decomposition, on the other hand, yields the network on the r.h.s. of the figure with isolation indices (xv|uy) = 0.066 and (xy|uv) = 0.045 . Any reasonable methods for fitting an additive tree thus will pick up the a quadruple with the xv‖uy from these distances.
Consider now a function that, given a metric distance matrix = (d(x, y)) x,y as input, produced a "best-fitting" additive metric distance matrix of the same dimension as output. More formally, denote by n the set of all metrics on n points, and let = ⋃ n>1 n .
Definition 2 A function ∶ → is a.m.-consistent if the following conditions are satisfied: (i) If ∈ n , then ( ) ∈ n is an additive metric.
The neighbor-joining algorithm (Saitou and Nei 1987) is a well-known example of an a.m.-consistent function (Gascuel and Steel 2006). Another example is the non-prime part of the split decomposition (Bandelt and Dress 1992). Given a distance matrix and an a.m.-consistent function , a natural measure for the deviation from additivity is | − ( )| with some matrix norm | . | . In particular, | − ( )| = 0 if and only if is an additive metric.
Let us now return to Assumption A and characterize distances that derive from additive metrics in a simple manner: Lemma 2 Let be a metric distance matrix, let be an a.m.-consistent function, suppose is invertible, increasing, and subadditive, and let | . | be a matrix norm. Then, there is an additive distance matrix with = ( ) if and only if | − ( ( −1 ( )))| = 0.
Proof Invertibility of implies that = ( ) is equivalent to = −1 ( ) . Now = ( ) = ( −1 ( )) if and only if is additive. Using invertibility of again, this is in turn equivalent to = ( ) = ( ( −1 ( ))) . Since the matrix norm | . | vanishes only for the 0-matrix, the Lemma follows. □ Lemma 2 immediately suggests to search for by minimizing the error functional By Lemma 2, derives from an additive metric if and only if a with ( ) = 0 exists. Otherwise, we obtain an approximately additive source metric −1 ( ) that then serves as the best available input for phylogenetic reconstruction. In this case, the values of ( ) as well as the estimate −1 ( ) that is (4) ( ) ∶= | − ( ( −1 ( )))|.
found by minimizing ( ) will in general depend on both the a.m.-consistent function and the matrix norm | . |.
As a proof of principle, we first produced an artificial distance matrix by transforming distance of a randomly generated tree with 100 leaves using the Jukes-Kantor rule (Jukes and Cantor 1969) corresponding to a four-letter alphabet and scaling the mutation rate such that back-mutations play a role but distances are not completely saturated. We then make the assumption that the measured data might depend on the unknown additive scale via a stretched exponential transformation of the form with unknown parameters a , b , and c . Figure 2(top) shows that the correct values of a = 3∕4 and c = 1 can be inferred by using Eq. (4) to minimize the discrepancy ( ) . In "Appendix 2," we show more formally that the parameter b is arbitrary and hence cannot be inferred. Intuitively, this follows from the fact that b only scales the time axis and hence constitutes a purely additive transformation of the distance, which canceled in Eq. (4) by the application of −1 .
Real-life distance data of course are not perfectly additive. We therefore simulated sequence data by introducing substitutions independently at each sequence position according to a first order Markov process along all edges of a given phylogenetic tree. In order to tune the level of noise, we considered different linear combinations of the theoretical and the simulated data, see "Appendix 2" for details. We found that the estimation of via Eq. (4) works well for small levels of sampling noise. For large noise levels, however, there are systematic biases. These appear to depend strongly on the choice of the matrix norm | . | . Clearly, a better understanding of the numerical problems associated with this inference problem will be necessary before the conceptually simple workflow proposed here can be applied to real-life data.

Discussion and conclusions
It has been realized already in the early days of computational phylogenetics that suitable transformation of distance data, e.g., using the Jukes-Cantor transformation, can increase the additivity and thus conceivably improve the quality of phylogenetic reconstructions (Vach 1992). A main insight in this contribution is that it is, at least in principle, possible to infer the correct distance transformation from the measured data only. As a consequence, the correct inference of phylogenetic relationships is possible not only for additive distances but also for the large class of distances that arise from additive metrics with a monotonic metric-preserving function.
At the same time, our results suggest that there are limits to phylogenetic inference. Whenever the available data cannot be transformed into an additive metric (at least approximately, i.e., up to measurement noise), there seems little hope to justify the interpretation of the results of hierarchical clustering (which of course can be performed on any kind of distance or similarity data) as a phylogeny. It is important to note, however, that our discussion has focused on metricpreserving functions, i.e., "uniform" transformations of the distance data. It is entirely possible to employ more general schemes that further extend the realm of phylogenetically meaningful data. For instance, the results of "Multiple features" section show that for data comprising multiple types of descriptors, distances extracted from the different subclasses c can be transformed with different functions c . Such an approach might even be useful to distinguish phylogenetically informative from problematic classes of features. On a more conceptual level, our results show that detailed mechanistic models of the underlying evolutionary process are not logically necessary for phylogenetic inference. It is, in fact, sufficient that the measured distance data can be transformed to an additive metric by means of a monotonic metric-preserving function. This is not to say that a mechanistic understanding of the process is not useful or desirable. After all, a mechanistic model will, at the very least, typically imply the functional form of the transformation function . The inference of from real-world data remains an important open problem. The issue to be explored is not only the limiting effect of measurement noise and inherent deviations from additivity due to horizontal gene transfer, incomplete lineage sorting, etc., but also numerical issues such as the fact that, in large trees, a substantial fraction of all pairwise distances takes values very close to the diameter of the tree. This seems to cause a particular susceptibility to measurement noise. Systematic simulation studies well beyond the scope of this contribution will be required to address this issues.
A potential alternative to Eq. (4) is the minimization of some measure of tree-likeness for the transformed matrix −1 ( ) . Attractive candidates are the corresponding parameters of statistical geometry (Eigen et al. 1988;Nieselt-Struwe 1997) and the related " -plots" advocated by Holland et al. (2002). It is not obvious, however, how these measures react to the changes in scale invariably introduced by . This issue does not arise in the context of Eq. (4) because the effects cancel due to the appearance of both −1 and .
It is interesting to note that our results also provide an a posteriori explanation for the observation that alignment-free methods work best in phylogenetic applications when the distances correlate well with alignment-based distances (Haubold et al. 2009;Morgenstern et al. 2017;Thankachan et al. 2017). It will be interesting to see whether other types of distances, such as compression distances (Kocsor et al. 2006;Penner et al. 2011), admit a transformation that makes them approximately additive.
Finally, several mathematical questions arise naturally from the results presented here. First, we may ask whether it is possible to replace condition (m1) by weaker requirements, such as (m0)? Even more generally, to what extent can arbitrary rate variations be accommodated? We know of course that they are harmless in an underlying additive metric-but what is the most general distortion that can be accommodated? Complementarily, it will be of interest to characterize the functions that preserve circular (Kalmanson 1975) and weakly decomposable metrics (Bandelt and Dress 1992), respectively.

Proof of Lemma 1
Since every ultrametric is additive, an a.m.-preserving function must transform every ultrametric into an additive metric. Being a function, in particular transforms isosceles triangles into isosceles triangles. In particular, it preserves equilateral triangles.
Consider the set of ultrametrics q on 4 points satisfying uv‖xy . The four isosceles triangles are u|xy, v|xy, x|uv, and y|uv . Therefore, q(u, x) If the -transformed additive metric satisfies uv‖ xy, then these four triangles still have short base. Recall that q is an ultrametric if and only if every triangle is isosceles with short basis or equilateral. Therefore, •q is again an ultrametric. Otherwise, suppose ux‖ vy holds w.r.t. to the transformed metric. Then, additivity thus implies (q(u, x)) + (q(v, y)) < (q(u, v)) + (q(x, y)) = (q(u, y))+ (q(x, v)), i.e., 2 (c) < 2 (c) , a contradiction. The same result is obtained assuming uy‖ vx . In the degenerate case, no quadruple exists and thus (q(u, v)) + (q(x, y)) = 2 (c) . Since q (u, v) and q(x, y) can vary independently of each other, must be constant, and thus, •d is the trivial discrete metric, which is also an ultrametric. Hence, is ultrametric-preserving on any subset of four points and thus in particular also preserves ultrametricity of all triangles. □

Proof of Theorem 1
The discrete metric is additive; hence, any function that is constant on ℝ + is a.m.-preserving. As a consequence of Lemma 1 and Pongsriiam and Termwuttipong 2014, we know that any a.m.-preserving function is amenable and non-decreasing. In the following, we therefore assume that is amenable, not constant on ℝ + , and non-decreasing.
Consider the set of additive metrics on four points satisfying t uv + t xy < t ux + t vy = t uy + t vx . Then, for some sufficiently small, the metric t ′ defined by t � = t except t � ux = t ux + and t � vx = t vx + is again an additive. Thus, (t ux + ) − (t vx + ) = t uy − t vy , a constant. It is easy to see that t vx and t ux can be chosen arbitrarily (first choose an isolation index for uv‖xy such that < min(t vx , t ux ) and then pick t uv and t xy sufficiently small). Thus, for every a, b > 0 and sufficiently small | | , we have (a + ) = (b + ) + h ab . Let us fix a and consider the partial function h a ∶ b ↦ h ab . Suppose h a is not constant. Then, then there is a point b �� ∶= inf{b � > b|h ab ≠ h ab � } . Since we know that h a is constant in a neighborhood of b , we have b ′′ > b . By construction h ab � = h ab for all b � ∈ [b, b �� ) . But h a is also constant in an open neighborhood of b ′′ , which has a non-empty intersection with [b, b �� ) . Thus, h ab = h ab �� , a contradiction. Renaming the arguments, there is a function h such that for all a > 0 and x > 0.
R e p l a c i n g a b y pa f o r p ∈ ℕ y i e l d s (x + pa) − (x) = h(pa) , while substituting x with x + (p − 1)a yields (x + pa) − (x + (p − 1)a) = h(a) . Substituting p by p − 1 and adding the resulting equation lead to (x + pa) − (x + (p − 2)a) = 2h(a) and thus eventually (x + pa) − (x) = ph(a) . Taken together, we have h(pa) = ph(a) . Replacing a by a / p shows h(a) = ph(a∕p) and thus p � h(a) = ph(p � a∕p) for all p, p � ∈ ℕ . That is, h(pa) = ph(a) for all p ∈ ℚ . Since is non-decreasing, we see that a ↦ h(a) = (x + a) − (x) is also non-decreasing. Therefore, p � h(a) ≤ h(pa) ≤ p �� h(a) holds for all p ∈ ℝ and all p � , p �� ∈ ℚ with p ′ ≤ p ≤ p ′′ . Using the well-known fact that ℚ is dense in ℝ conclude that h(pa) = ph(a) holds for all p ∈ ℝ . In particular, we have h(a) = ah(1) . Substituting this into Eq. (6) and setting x = 1 yield (a + 1) − (1) = ah(1) . Setting x = a + 1 and rearranging the terms, finally, yield for all x > 0 . The theorem now follows by observing that both the slope h(1) and the intercept (1) − h(1) must be nonnegative since is amenable and non-decreasing. □ Appendix 2: On the example of Fig. 2 In Fig. 2, we considered distance data generated from an additive tree using a transformation of the form Eq. (5), which has the inverse −1 (u) = (−(1∕b) ln(1 − u∕a)) 1∕c , with parameters a , b , c fixed a some values a 0 , b 0 , c 0 , which we pretend not to know. Transforming them with −1 with the correct value a 0 but arbitrary choices of b and c yields transformed distances The coefficients b 0 and b appear only in the multiplicative factor (b 0 ∕b) 1∕c , and this does not affect additivity of the metric because the function must satisfy ( ) = ( ) for input matrices close to that are almost additive. It follows that the scaling factor b 0 of the time axis cannot be inferred by minimizing the discrepancy in Eq. (4). This does not matter for phylogenetic reconstruction, however, because the scaled distance matrix corresponds to the same phylogenetic tree as . In contrast, choosing an exponent c ≠ c 0 causes a nonlinear distortion and thus causes a nonzero discrepancy in data. It is also easy to see that any choice of a ≠ a 0 also causes a nonzero discrepancy, and hence, a can be inferred.
In order to construct a data set with tunable levels of sampling error, we used the tree as "scaffold" to simulate the evolution of four-letter sequences of length N = 10000 for 100 time units with an per site substitution rate of = 0.007 . Denote by H the empirically determined scaled Hamming distances for a particular instance of the simulated sequences. By construction, the expected distance matrix for this model is * = ( ) with a = 3∕4 and c = 1 . Hence, the sampling variance can be tuned by using linear combinations of * and H . We used convex combinations of the form = (1 − ) * + H . Note that the limit → 0 corresponds to sequences of infinite length, which allow an arbitrarily accurate estimation of the expected distances.