Introduction

Molecular techniques have been used to examine the parentage of polyploid species since the mid- to late twentieth century use of allozyme electrophoresis (Werth 1989). Where there is a basis (morphological or otherwise) to hypothesise that a polyploid species arose following interspecific hybridisation (allopolyploidy), any number of genetic methods can be used to test such hypotheses based on an expectation of additivity. These methods include the aforementioned allozyme electrophoresis, additive patterns in DNA sequences of one (e.g. the 28 S–5.8 S–18 S nrDNA cistron: Rauscher et al. 2002; Álvarez and Wendell 2003; Obbard et al. 2006) or more nuclear loci (especially low copy number nuclear genes: Barrier et al. 1999; Smedmark et al. 2003; Cronn et al. 2004; Small et al. 2004; Straub et al. 2006; Shepherd et al. 2008; Joly et al. 2009; Smissen et al. 2011; Kamneva et al. 2017), and cytological methods such as chromosome counting (e.g. Molloy and Dawson 1998) and in situ hybridisation (see Chester et al. 2010), including chromosome painting (Mandáková et al. 2018).

However, relationships among taxa descended from allopolyploids (paleopolyploids), and between them and diploid species can be difficult to infer and test as a result of successive allopolyploidisations, which can increase both the complexity of relationships and genomes, extinction of diploid (or higher) ancestors, and post-duplication genomic modification (especially loss of duplicated loci and other “genome downsizing”: Leitch and Bennett 2004). Thus, while all flowering plants are now hypothesised to be descended from polyploid ancestors, the precise details of many of the genome duplications involved may never be known. Nonetheless, much can be learned about the likely impact of polyploidisation in plant evolution from the study of contemporary polyploid complexes at different stages of their evolution.

Apart from the general importance of polyploidy in plant evolution, understanding of the genomic constitution of allopolyploids and accounting for homoeologous loci is a prerequisite for the accurate genotyping of allopolyploid plants in conservation genetics. Recently we have been conducting conservation-relevant genetic analyses for a group of highly threatened species of Lepidium L. (Brassicaceae) from Aotearoa New Zealand (henceforth New Zealand). These species (namely L. kirkii Petrie, L. solandri Kirk, and L. sisymbrioides Hook.f.) are endemic to inland areas of the South Island of New Zealand. Lepidium solandri and L. sisymbrioides have received past scientific attention because they are the only dioecious species of Brassicaceae (Heenan et al. 2007). All three have undergone concerning decline over recent decades and are listed as Nationally Critical in the New Zealand Department of Conservation Threat Classification System (de Lange et al. 2018). The reasons for their decline have not been firmly established, although land-use change and pathogenic microorganisms have been suggested (Wiser et al. 2013; Walker et al. 2023). Past studies suggest that these species include higher polyploids (de Lange and Murray 2002; Dierschke et al. 2009; Soza et al. 2014).

Brassicaceae are known for ‘their rampant hybridisation', including that between distant lineages (Hendriks et al. 2023). Initially we explored whether sequence-capture data for these threatened Lepidium species could be used as a source of genetic marker data using the bait set of Nikolov et al. (2019), which targets 1827 exons. However, finding that assemblies of this data implied a number of gene copies, and unable to phase the short reads (125 bp paired-end Illumina), we embarked on an amplicon sequencing study of selected genes from the sequence capture data intended to produce data fit for interpretation as codominant markers. The results of that study will be presented elsewhere, but in order to better understand the genomic constitution of our target species, and their relationships to other Lepidium species, we also assayed our amplicon markers in a selection of other Lepidium species, including additional New Zealand endemic species and species naturalised in New Zealand that were available at the Allan Herbarium (CHR) and co-opted genomic data from the National Center for Biotechnology Information (NCBI). Included in our sample of New Zealand endemic species were L. oleraceum Sparrm., L. flexicaule Kirk, L. juvencum Heenan & de Lange, L. naufragorum Garn.-Jones & D.A.Norton, and L. tenuicaule Kirk.

A hybrid origin of a Lepidium clade from Australia and New Zealand was hypothesised by Mummenhoff et al. (2004) on the basis of discordant plastid and nrDNA sequences. They found that Australasian Lepidium species fall into two clades (A and C) in nrDNA trees. Clade A species are sister to a group of African Lepidium species and clade C species are sister to North American species. In their plastid DNA trees, the Australasian Lepidium are united in one clade, and this is sister to North American species. Mummenhoff et al. (2004) proposed that two trans-oceanic dispersals brought North American and African Lepidium species to Australasia, where they hybridised to produce an Australasian clade. They proposed that nrDNA sequences in the Australasian clade were then homogenised by concerted evolution to render species with either the African or North American nrDNA type.

Subsequently, Dierschke et al. (2009) used genomic in situ hybridisation to test this hypothesis (sampling Australian members of clade A and C Lepidium). They found strong evidence for the African contribution to the genomes of clade A Australasian species but not to clade C. This led to a modified hypothesis, which proposed that Australasian species of clade C were the result of radiation following dispersal from North America and that clade A was the result of subsequent hybridisation between clade C and Lepidium of African origin. However, Dierschke et al. (2009) also pointed out that the evidence for a single Australasian Lepidium cpDNA clade sister to North American species was at best weak (detected only in 60% of the maximally parsimonious trees). If this chloroplast DNA evidence is excluded, the direct evidence for an American contribution to clade C comes only from a weakly supported (60% bootstrap) branch in the nrDNA tree of Mummenhoff et al. (2004), and there remains no direct sequence evidence for an involvement of the American species in Australasian clade A species.

Soza et al. (2014) also examined the genomic constitution of the dioecious species L. sisymbrioides and its relatives (see Heenan et al. 2007) L. tenuicaule, L. naufragorum, and L. kirkii. Based on meiotic chromosome counts, genome size estimates, and a phylogeny based on the low-copy-number gene PISTILLATA, they concluded that L. sisymbrioides is an allo-octoploid (2n = 8x = 64) with four diploid progenitors. Although Soza et al. (2014) did include PISTILLATA sequences from Australasian clade A species (generated in Lee et al. 2002) in their phylogeny, they did not find sequences close to those of L. africanum (Burm.f.) DC. However, this can be explained by Lee et al. (2002) having sequenced relatively few clones in their study, or by biases in PCR or cloning. Noteworthy in the phylogeny of Soza et al. (2014) is that two copies of PISTILLATA from Australasian Lepidium species were closely associated with American species.

Based on genome size estimates, Soza et al. (2014) also concluded that L. naufragorum is a high polyploid of the order of 18x (consistent with the meiotic count reported by de Lange and Murray (2002) as c. 72 bivalents). Among the New Zealand species, L. naufragorum appeared with clade A species in plastid phylogenies but with clade C species in nrDNA phylogenies, incongruence that may indicate a hybrid origin (de Lange et al. 2013).

Our aims in this study were to:

  1. 1.

    Test whether Australasian clade A and clade C species are allopolyploids, and to estimate the ploidy

  2. 2.

    For the New Zealand species, estimate the contribution of African and American species and determine how many diploid progenitors contributed to them

  3. 3.

    Test whether L. naufragorum is a product of allopolyploidisation between Australasian clades A and C.

In addition, we explore whether a network approach treating splits additively can be used to summarise information from multiple gene-family trees where some samples are represented by non-monophyletic groups of sequences.

Materials and methods

Taxon sampling

We selected two representative plants of L. solandri, one of L. sisymbrioides, one L. naufragorum, and one of L. kirkii from our larger set of plants sequenced for conservation genetics to represent the clade C Lepidium from New Zealand. To this, we added a herbarium sample of L. tenuicaule. Clade A New Zealand Lepidium were represented by herbarium samples of species L. oleraceum, L. juvencum, and L. flexicaule; an Australian native species of clade A, L. pseudotasmanicum Thell., was also included from a population naturalised in New Zealand. Lepidium species from the Americas were represented by two herbarium samples of L. bonariense L. and one of L. didymum L., both species being naturalised in New Zealand. Lepidium africanum was also represented by a naturalised New Zealand herbarium specimen.

Vouchers for these are given in Table 1, but as these are highly threatened and declining species no voucher is available for one of the L. solandri specimens and the voucher cited for L. kirkii is from the same site but not from the individual sequenced. Sequence capture data is available for L. hirtum (L.) Sm. (SRR8528378), and we were able to map this against reference sequences matching our markers in Geneious 10.1.3 and include sequences for this diploid species. Genome sequences are publicly available for L. africanum (GCA_900406405.1), L. aucheri Boiss. (GCA_900406435.1), L. campestre (L.) W.T.Aiton (GCA_009757365.2), and L. sativum L. (GCA_023374145.1), so we retrieved sequences matching our markers by Blast search against these genomes through the NCBI genome database. For outgroups, we used the reference genome of Arabidopsis thaliana (L.) Heynh. (GCF_000001735.4), representing the sister clade of Lepidium in Nikolov et al. (2019) and Cardamine resedifolia L. (GCA_014578085.1) representing the sister to the clade including Arabidopsis Heynh. and Lepidium in Nikolov et al. (2019), again by using Blast searches against these genomes in the NCBI database.

Table 1 Vouchers for samples sequenced in this study. 1Voucher from the same population but not the individual sequenced

DNA was extracted using the 2 × CTAB method (Doyle and Dickson 1987), modified to include a phenol chloroform extraction (as in Smissen and Heenan 2007), except that Zymo IIC columns and buffers were used for DNA recovery.

Marker design and amplification

Initially we generated genomic DNA libraries enriched using the bait set of Nikolov et al. (2019) for 21 Lepidium samples from clade A, and two from clade C using an Illumina Truseq nano kit (with UDI indexes) and a custom MyBaits kit. We used a strict hybridisation temperature for enrichment (65 °C) and Kappa HiFi DNA polymerase for the post-enrichment amplification. Sequencing utilised half of one lane of an Illumina HiSeq run (2 × 125 bp paired ends). Reads for selected samples were assembled by mapping to reference sequences assembled from the baits of Nikolov et al. (2019) in Geneious 10.2.6 and reviewed by eye to select variable regions between 300 and 500 bp with suitable flanking regions for primer design using the primer design tools in Geneious.

Primers were designed targeting 16 regions and tailed with Illumina Nextera indexing sequences. Details of the primers are shown in Online Resource 1. The primers were screened by gradient PCR (using Roche FastStart DNA polymerase on a single L. sisymbrioides). After gradient PCR and screening for consistent amplification against a small panel of samples fifteen primer pairs were selected for amplicon sequencing with 55 °C annealing temperature chosen as appropriate for all pairs. Amplicons for Illumina sequences were generated in a two-stage process. In the first stage, amplicons were generated using tailed primers in four sets of multiplexed reaction using Roche FastStart DNA polymerase and the following thermal cycling parameters (98 °C for 3 min, followed by 10 cycles of 95 °C for 30 s, 55 °C for 30 s, 72 °C for 30 s, and a final extension step of 72 °C for 5 min). The second-stage reaction used combined and 10 × diluted pools of the first reaction products (sample by sample) as a template and used index PCR primers with Kappa HiFi DNA polymerase and the following thermal cycling parameters (95 °C for 3 min, followed by 10 cycles of 95 °C for 15 s, 65 °C for 15 s, 72 °C for 15 s, and a final extension step of 72 °C for 1 min).

Library preparation, sequencing and phylogenetic analysis

All PCR products from stage 2 were then pooled and purified using SPRI beads at a 1.2:1 ratio with sample. Paired end 2 × 300 bp sequences were generated using a MiSeq machine by Massey Genome Services (Palmerston North).

Demultiplexed DNA sequences were processed with TSSV (Anvar et al. 2014) called from FDStools 1.2.0 (Hoogenboom et al. 2017), and then well-represented sequences were selected from the TSSV output using a custom R (R Core Team 2021) script. The custom R script selected sequences that had at least 10% the number of reads of the highest read number recovered for each sample and primer pair, and at least 70% of the length of the longest sequence. These were aligned using Muscle (Edgar 2004). Indel characters in the Muscle alignments were scored using 2matrix (Salinas and Little 2014), and the resulting Phylip files were transformed back into fasta format using the perl script Phylip2Fasta.pl (Hughes unpublished, available at https://github.com/josephhughes/Sequence-manipulation). Files were converted to NEXUS format using a second custom R script, which also replaced binary indel codes with ‘a’ and ‘c’ so that they could be included in phylogenetic analysis along with nucleotide substitution characters. Gene-family trees were generated using RAxMLHPC (Stamatakis 2014) with rapid bootstrapping. Because we were including indel characters and sequence divergences were not great, we applied the simple JC69 substitution model. Representative amplicon sequences were used for Blast searches against the reference A. thaliana genome (GCF_000001735.4). Each returned a single match or one clearly best match (chromosome and position are given in Online Resource 1).

Summarising gene-family trees

Methods using explicit models to recover hybrid relationships among groups, including allopolyploids, are available (Lautenschlager et al. 2020; Oxelman et al. 2017; Than et al. 2008), but they can be computationally demanding, and we were not able to apply them to our data. Other methods are limited to diploid parents and their tetraploid derivatives (e.g. AlloPPNet: Jones et al. 2013). Instead, we provide an analysis attempting to combine information from our multiple gene-family trees additively. So far as we are aware, this approach has not been published previously. This analysis (see below) is affected by character polarisation (i.e. which side of a split is coded as 0 and which side is coded as 1), and as a result by input taxon order in some software, including SplitsTree (i.e. an outgroup needs to be the first taxon in each matrix).

Splits for further analysis were generated in two ways. First, gene-family alignments were analysed using SplitsTree 4.10. A NeighborNet network was generated for each gene-family, and bootstrap values generated from 1000 replicates. NeighborNet was used to generate splits (as opposed to median networks or other methods), because there is some missing data in the gene-family alignments, mostly due to incomplete recovery of loci from some genome sequences. Splits were exported unfiltered, filtered for a minimum 70% bootstrap support value, and filtered for a minimum 90% bootstrap value. Second, a custom R script was used to extract split information from RaxML trees using the as.Splits function from the R package TreeTools (Smith 2019). The collapseUnsupportedEdges function from the R package ips (Heibl 2008 onwards) was used to exclude edges with less than specified bootstrap values.

We used another custom R script to convert the splits into a binary matrix, combining splits for each sample such that the sample was scored as present in a split if any sequences derived from that sample were members of the split. We will refer to these as condensed splits. The script outputs binary matrices recording the concatenated splits for all gene-families. This file was then imported into SplitsTree and subjected to median network (Huber et al. 2001) and NeighborNet (Huson and Bryant 2006) analyses. The NeighbourNet graphs shown were generated without normalising distances. Median network analyses handle missing data by excluding characters with any missing information. We did not recover a complete set of sequences from some of the Lepidium genome sequences we included, so we have removed these samples from some median network analyses.

Our approach is illustrated in Fig. 1 and a fuller explanation provided in Online Resource 2. TSSV files, NEXUS files and scripts can be found at https://datastore.landcareresearch.co.nz. Sequences have been accessioned to GenBank (see Online Resource 3 for accession numbers).

Fig. 1
figure 1

Representation of the process used to condense splits using simple contrived data. Taxa B to E represent diploid species. E represents an outgroup. Taxon A represents a tetraploid combining the genomes of taxa B and C and displaying both sequences. Splits are condensed for taxon A by coding a ‘1’ where either sequence A1 or A2 are coded as ‘1’

Results

Distribution of sequences among species

Two sample gene-family trees are shown in Figs. 2 and  3 representing respectively a relatively well-resolved tree in which most relationships discussed here are represented and a less well-resolved tree in which fewer putative gene copies were recovered. The full set of gene-family trees is provided in a supplementary file (Online Resource 4 A–M).

Fig. 2
figure 2

Gene-family tree for marker 20. Numbers appended to tip labels are the number of copies of each sequence counted by TSSV. Letter codes in tip labels identify unique sequences in order of their sequence count in TSSV results. Blue sample labels, Eurasian species. Green sample labels, Australasian clade C. Yellow sample labels, Australasian clade A. Red sample labels, American species. Purple sample labels, L. africanum. Black sample labels, outgroups

Fig. 3
figure 3

Gene-family tree for marker 151. Numbers appended to tip labels are the number of copies of each sequence counted by TSSV. Letter codes in tip labels identify unique sequences in order of their sequence count in TSSV results. Blue sample labels, Eurasian species. Green sample labels, Australasian clade C. Yellow sample labels, Australasian clade A. Red sample labels, American species. Purple sample labels, L. africanum. Black sample labels, outgroups

For the diploid species L. africanum, at most two (highly similar) sequence types were recovered, consistent with primer pairs amplifying a single copy gene with some minimal allelic variation. We recovered too few amplicon sequences from our L. africanum sample for primer pair 342 to pass our filter, but we were able to recover a sequence from the L. africanum genome assembly (which was in any case identical to the most represented amplicon sequence). For this species, sequences recovered from our amplicon data differed from those recovered from the genome assembly by no more than one substitution, except in one case. Our Blast search against the L. africanum genome sequence for primer pair 366 recovered two hits. The first hit (L._africanum_2_1 in Online Resource 4I) was an exact match for the sole sequence recovered from our amplicon sample of this species; a second match with lower similarity and slightly less overlap was also recovered. This second sequence did not group with the other L. africanum sequences in our analyses, instead emerging with a number of sequences recovered from the genome sequences of L. aucheri, L. campestre, and L. sativum. It is possible that these sequences recovered by Blast search for this marker are not homologous with those amplified with our primers, but we have retained them in our analyses rather than risk adding subjective bias.

For sequences assembled from pre-existing capture data for L. hirtum, either a single sequence or two very similar sequences were recovered. From the genome sequences of the diploid outgroup species Cardamine resedifolia and Arabidopsis thaliana a singular and clearly best match was recovered by Blast search and included. From the genome assemblies of L. campestre, L. aucheri, and L. sativum we either did not always recover a matching sequence, or recovered a match for only part of the length or, in a few cases, multiple matches of similar score (up to 10 for L. sativum for primer pair 366) that were non-monophyletic in our trees (see Online Resource 4).

From our samples of Australasian clade A species we recovered between two and five distinct sequence types, with four the most common number and five the second most common (Table 2). However, not all sequence types were recovered from every individual sample for every marker (with the thresholds used to score TSSV results). From clade C Australasian species (excluding L. naufragorum) we recovered between two and four copies, with four the most common number. Again, for some markers we did not recover all sequence types from every individual. From the tetraploid American species L. didymum we recovered two sequence types for most primer pairs but only a single type for two markers. For L. bonariense we recovered between two and four copies, with three the most common number, followed by four.

Table 2 Selected features of gene-family trees

An exclusive clade including clade A species from Australasia and L. africanum was recovered in 10 of 15 gene-family trees with > 50% rapid bootstrap support. In the remaining five gene-family trees no incongruent clades including L. africanum were recovered with > 50% bootstrap support (see Table 2). All gene-family trees showed at least one and up to three clades exclusively containing Australasian and American sequences with a range of bootstrap values, including high values (see Table 2).

Lepidium naufragorum combines sequences of clade A and clade C

Lepidium naufragorum generally displayed sequences characteristic of both Australasian clades A and C, and, where these were resolved separately, displayed sequences consistent with an additive combination of both (e.g. Fig. 2). A maximum of nine sequences for each locus passed our filters of TSSV data for L. naufragorum.

Analyses of condensed splits

Figure 4 shows two median network graphs and a NeighborNet derived from our condensed split approach using splits output from gene-family NeighborNets using Splitstree (see Materials and Methods). Figure 4a is the median network with a minimum support value of 5. In this graph, an edge linking L. africanum to Australasian clade A samples is apparent (highlighted in purple). This split is not apparent in the more complicated median networks built with lower minimum support values (not shown). Both the edge linking the Australasian species to the American species and the edge linking L. africanum to the Australasian clade A species is maintained with an increasing minimum support value up to 8 (Fig. 4b). Because 5 of 15 markers are affected by missing data for a single taxon, they are effectively excluded from median network analyses. Therefore, we also provide a distance-based NeighborNet graph for this data, which includes all the markers (Fig. 4c).

Fig. 4
figure 4

a, median network showing splits with a minimum support of 5 built from condensed splits using all splits from NeighborNet analyses of all gene-family trees. b, median network showing splits with a minimum support of 8 built from condensed splits using all splits from NeighborNet analyses of all gene-family trees. c, NeighborNet built from condensed splits using all splits from NeighborNet analyses of gene-family trees (Fit = 97.347). Blue sample labels, Eurasian species. Green sample labels, Australasian clade A. Yellow sample labels, Australasian clade A. Red sample labels, American species. Purple sample labels, L. africanum

Robustness of condensed split analyses to filtering splits by bootstrap values

Figure 5 illustrates that the results from analysing splits generated from NeighborNet were largely robust to filtering splits using bootstrap values of 70% or 90%. However, the conflict associated with L. naufragorum is notably less, presumably as a consequence of low levels of bootstrap support associated with the short splits distinguishing Australasian clade A and clade C sequences in the initial NeighborNet, from which the splits were derived. Lepidium naufragorum tends to group with clade A Australasian taxa rather than clade C taxa in these analyses.

Fig. 5
figure 5

a, median network constructed from condensed splits from NeighborNet analyses filtered for a minimum of 70% bootstrap support. b, median network constructed from condensed splits generated from NeighborNet analyses filtered for a minimum of 70% bootstrap support with a minimum support of 7 and removing taxa with missing markers. c, NeighborNet constructed from condensed splits from NeighborNet analyses filtered for a minimum of 70% bootstrap support (Fit = 97.874). d, median network constructed with splits from NeighborNet analyses filtered for a minimum of 90% bootstrap support. e, median network constructed with splits from NeighborNet analyses filtered for a minimum of 90% bootstrap support applying a minimum weight of 3 and removing taxa with missing markers. f, NeighborNet constructed from condensed splits from NeighborNet analyses filtered for a minimum of 90% bootstrap support (Fit = 97.729). Blue sample labels, Eurasian species. Green sample labels, Australasian clade C. Yellow sample labels, Australasian clade A. Red sample labels, American species. Purple sample labels, L. africanum

Condensed splits derived from trees are similar to those derived from NeighborNets.

The results of our condensed split approach using splits generated from RAxML trees were similar, although somewhat weaker support for splits was observed, perhaps due to a reduced number of splits recovered from maximum likelihood trees compared to the distance-based NeighborNet approach. Figure 6 shows median network and NeighborNet graphs generated from condensed splits using selected weight thresholds.

Fig. 6
figure 6

a, median network built with a minimum support of 4 from condensed splits generated from maximum likelihood trees with edges receiving less than 50% rapid bootstrap support collapsed. b, median network built with a minimum support of 10 from condensed splits generated from maximum likelihood trees with edges receiving less than 50% rapid bootstrap support collapsed and taxa with missing data removed. c, median network built with a minimum support of 7 from condensed splits generated from maximum likelihood trees with edges receiving less than 90% rapid bootstrap support collapsed and taxa with missing data removed. d, NeighborNet graph generated from condensed splits from maximum likelihood trees with edges receiving less than 90% bootstrap support collapsed (Fit = 95.599). Blue sample labels, Eurasian species. Green sample labels, Australasian clade C. Yellow sample labels, Australasian clade A. Red sample labels, American species. Purple sample labels, L. africanum

Discussion

In this study, we used amplicon sequencing in order to avoid the need to phase massively parallel sequence data where multiple divergent copies of genes coexist in polyploid genomes. The problem of phasing massively parallel shotgun sequence data is likely to apply not only to recently formed polyploid lineages but also to paleopolyploid lineages that are still in the process of diploidisation, and therefore to many angiosperm lineages. Phasing will be facilitated by use of long-read sequencing for shotgun libraries (including enriched shotgun libraries) but may continue to be problematic and bioinformatically challenging.

In this case, short amplicons (168–422 bp in length) have proven to contain sufficient phylogenetic signal to provide information about the number of distinct subgenomes and the relationships among them in a group of high polyploid Lepidium species. It is possible that in allopolyploids in other genera longer amplicons would be required to achieve similar results, requiring the use of long-read sequence technologies (e.g. PacBio Dauphin et al. 2018) or library preparation techniques involving fragmentation of long amplicons for short read sequencing (e.g. Callahan et al. 2021). Amplicon sequencing is a cost-effective option as it does not require baits or specialised equipment and is readily scaleable to large numbers of samples. Genomic resources including whole genome assemblies, shotgun libraries and target enrichment libraries are increasingly available for diverse plant groups making the design of PCR primers reliably targeting nuclear gene regions an option for studies examining allopolyploid relationships and those seeking to examine large numbers of individuals in population level studies.

Our amplicons map to four of the five A. thaliana chromosomes but may be biased toward chromosome 3 (Online Resource 1). The reasons for this are not clear. Amplicon sequences mapping to the same A. thaliana chromosome were separated by more than 1 million bp with the exception of amplicon 361 and 383 which were only separated by 77,900 bp. Therefore, while our markers may not be an entirely random sample of the Lepidium genome they are likely to be broadly representative.

The approach to combining split information from multi-label gene trees we present has a number of strengths. It requires no a-priori knowledge of sample ploidy or whether sequences are related as alleles, paralogs, or homoeologs. It is not computationally demanding and can be quickly conducted in R even with large numbers of sequences. Edges in the final networks are directly relatable to branches in individual gene trees. It relies on no assumptions about evolutionary process other than that allopolyploids combine the genomes of their parent species.

In interpreting our data we lean heavily on previous work suggesting that Australasian Lepidium have high ploidy levels and that reticulate relationships link them to L. africanum (or a species closely related to it) and American Lepidium species. The hypothesis of Mummenhoff et al. (2004) that Australasian plants are descended from an allopolyploid ancestor, of which one parent was closely allied to L. africanum, was already supported by the cytological work of Dierschke et al. (2009) for clade A species but not for clade C. Our gene-family trees provide further support for the African contribution to clade A, with 10 of 15 trees providing support for clades exclusively containing L. africanum and at least some Australasian clade A species, and the remaining five trees not suggesting any well-supported contradictory clades (Table 2). None of our data provide any support for an African contribution to Australasian clade C species. Although Dierschke et al. (2009) did not find support for an American contribution to the genomes of Australasian Lepidium species, neither did their work provide strong evidence against this hypothesis. Soza et al. (2014) did provide support for such a link in the case of both clade A and clade C Australasian species from their analysis of cloned PISTILLATA PCR products. All our gene-family trees had at least one clade exclusively containing at least one of the two American species we sampled, and at least one of each Australasian clade.

The situation is unquestionably complex, and the difficulty of obtaining reliable chromosome counts appears to have hampered the generation of clear working hypotheses of the reticulate phylogenetic history of these plants. In presenting our own working hypothesis, we favour the chromosome counts of Soza et al. (2014), who used meiotic spreads of L. sisymbrioides and L. tenuicaule and in both cases found that 2n = 8x = 64. These counts are consistent with our sequencing results for Australasian clade C species (except for L. naufragorum, see below), in that we recovered a maximum of four distinct sequence clades from these species. Building from this, it would be parsimonious to hypothesise that clade A plants combining five subgenomes are 2n = 10x = 80, since, in general, they display a similar complement of sequences to clade C along with an additional clade closely related to L. africanum sequences.

However, Dierschke et al. (2009) present chromosome counts for members of clade A and C from Australia that are inconsistent with this relatively simple hypothesis. Specifically, from clade A they report 7x = 56 for L. aschersonii Thell. and 11x = 88 for L. muelleriferdinandi Thell., and from clade C they report 9x = 72 for L. hyssopifolium Desv. (two samples) and L. pseudohyssopifolium Hewson. In addition they report 14x = 112 for L. ginninderrense N.H.Scarlett (for which no ITS sequence is available but which may be part of clade C based on its morphological affinity to L. pseudopapillosum Thell., as described by Scarlett 2001). The only published chromosome count available for a New Zealand species of clade is A is the meiotic count of c. 36 bivalents provided by de Lange and Murray (2002). Although this might be considered consistent with 9x, its authors make clear that this must be considered only an approximate count.

Unless unusual meiotic processes allowing for fertile gametes to be produced from odd-ploidy sporophytes (e.g. canina meiosis in Rosa sect. Caninae (DC) Ser., Täckholm 1921) are invoked, we might expect the odd-ploidy plants to be sterile. New Zealand Lepidium of clade A have never been suggested to be reliant on asexual reproduction and hybrids between New Zealand species of clade A including L. flexicaule × L. oligodontum de Lange & Heenan and L. flexicuale × L. oleraceum are convincingly documented in de Lange et al. (2013) implying sexual reproduction.

One of the samples examined by Dierschke et al. (2009) was noted by them as having an ITS sequence consistent with it having a hybrid background involving parents with sequences typical of both clade A and clade C. One explanation of these odd-ploidy counts is that perhaps some samples may represent F1 hybrids and were not representative of the species they were identified as (or perhaps indicate multiple ploidies within those species), or that those species are not sexually reproducing. However, we are not familiar with the species concerned and any explanation we provide is speculative. It does not seem likely that explanations of this sort could realistically explain all the results reported by Dierschke et al. (2009). A more complex history involving intermediate ancestors in which aneuploiod reductions have occurred (e.g. in Brassicaeae genera Cardamine: Mandáková et al. 2013, and Ricotia: Mandáková et al. 2018) might also have led to the chromosome counts observed by Dierschke et al. (2009). Moreover, the relationship between odd-ploidy and sterility is not always straightforward, especially in high polyploids (e.g. Mahelka et al. 2007; Caruso et al. 2008). The chromosome counts of Dierschke et al. (2009) certainly indicate that more complexity in genome composition is present among Australasian Lepidium species than would be accounted for by derivation of clade A species through polyploidisation after hybridisation between 8x and 2x ancestors. It is also possible that extant Australasian species incorporating the L. africanum genome have been produced by multiple pathways.

Lepidium naufragorum has been previously found to display nrDNA sequences consistent with clade C and and plastid DNA sequences consistent with clade A (Heenan et al. 2007) and was estimated using flow cytometry to be 18x. Our data are consistent with L. naufragorum being an allopolyploid with one parent from clade A and one from clade C. This is one of the most strongly supported results of our analyses.

Our results are consistent with two subgenomes of both Australasian clades being related to the American species L. didymum and L. bonariense. However, neither of these American species displays sequences that suggest they are particularly closely related to the direct progenitors of the Australasian species. Lepidium didymum has been reported as tetraploid (n = 16, 2n = 32: Mehra and Sharma 1975) and L. bonariense as tetraploid or octoploid (2n = 32 or 64: Lee et al. 2002), and our results for two samples from the naturalised New Zealand population of this species are consistent with octoploidy.

If it is accepted that two of the subgenomes of Australasian Lepidium are derived from an American ancestor, that still leaves two subgenomes with no obvious relatives in our sampling. Further sampling of American taxa might very well identify closer relatives that can account for these remaining subgenomes, but it is also possible that Lepidium from elsewhere in the world contain the closest extant relatives of these subgenomes, or that an indigenous Australasian component contributed them.

Further sampling of our markers from Lepidium species globally (but especially more Australian and American species) would probably be highly informative, and as we recovered only some loci (sometimes partial) from the genome sequences of L. aucheri, L. campestre, and L. sativum, better data for these or related species would be desirable. If resources are available, sequence capture data would be preferable, at least for diploid species (e.g. L. hirtum included here from the data of Nikolov et al. (2019)), and with a longer read length than used by us for allopolyploids, but phasing may remain problematic.

Implicit in this proposal is that our amplicon sequencing has produced incomplete representation of the subgenomes, since for many of the markers we recovered lower numbers of sequence clades. This is also consistent with the observation that in some gene-family trees particular species were absent from sequence clades we might expect them to display. This may be a result of homoeolog-loss following whole genome duplication, but it is likely that at least some of these absences result from mutations in PCR primer sites or other mutations in sequences, reducing the representation of particular sequences in PCR pools.

It is also likely that our use of a threshold for sequence representation in libraries of 10% of the most numerous sequence is imperfect and may have caused us to exclude less-represented but nonetheless real sequences. Based on our review of TSSV results for our data, a lower threshold would probably have caused us to include sequences that were artefactual (mostly PCR error). More sophisticated approaches to distinguish PCR or sequencing artefacts from poorly represented but real sequences could probably be developed, but an obvious step to improve the quality of amplicon data would be to reduce the number of PCR cycles used, which in retrospect was too high in this study and probably contributed to both uneven representation and a very long tail of minority sequences in libraries.

Our approach of combining split information from multi-label gene trees additively for samples does need to be considered with some caution. Edge lengths in our median network and NeighborNet graphs reflect the number of times splits occur across all gene-family trees included and not genetic distances among sequences or samples. Our intent with this analysis is to summarise additivity across gene-family trees in the spirit of historical genetic analyses of putative allopolyploids and is not an attempt to directly infer evolutionary relationships. It is intended to represent pattern and not process, more analogous to phenetic clustering than cladistic analysis. Until methods using explicit models of allopolyploid evolution able to deal with complex situations involving higher polyploids are established, we think other workers may find our approach (or a similar one) useful.

Conclusions

Phylogenetic analyses of nuclear genes are consistent with allopolyploid origins of New Zealand Lepidium species. Based on the number of phylogenetically distinct gene copies present in samples we hypothesise a group corresponding with most of the New Zealand members of nrDNA clade C (L. kirkii, L. sisymbrioides, L. solandri and L tenuicaule but excluding L. naufragorum) is octaploid, consistent with previous chromosome counts for L. sisymbrioides and L. tenuicaule (Soza et al. 2014). The Australian and New Zealand members of nrDNA clade A we sampled (L. flexicaule, L. juvencum, L. oleraceum and L. pseudotasmanicum) display phylogenetically distinct gene sequences consistent with being decaploid but this does not accord with chromosome counts for Australian species of clade A (Dierschke et al. 2009) and a more complicated history should not be excluded. Plants conspecific with, or at least closely related to L. africanum contributed one subgenome to this second (clade A) group. Our taxonomic sampling was insufficient to fully establish the affinities of the other subgenomes involved but our analyses are consistent with at least two of these, present in all Australasian species sampled, being related to a clade of Lepidium species from the Americas. A New Zealand species, L. naufragorum, has gene sequences supporting a derivation via allopolyploidisation between members of the nrDNA clade A and nrDNA clade C species, consistent with the octa-decaploid genome estimated from flow cytometry data by Soza et al. (2014). Amplicon sequencing can provide a way of assaying the nuclear genome of allopolyploid species without cloning or the challenges of phasing massively parallel sequence data while combining split information additively across multi-labelled gene trees can provide a method to combine phylogenetic signal across multiple gene trees.