Introduction

According to Algaebase (Guiry and Guiry, 2021), organisms, colorless and apochlorotic, without chloroplasts and pyrenoids, currently classified as Prototheca W.Krüger, 1894 (Chlorophyta, Trebouxiophyceae) are widely distributed from temperate to tropical conditions in both fresh and marine waters. Prototheca, closely related to the Chlorella Beyerinck, 1890 species-complex, is polyphyletic with Helicosporidium D.Keilin, 1921 and Auxenochlorella (I.Shihira & R.W.Krauss) T.Kalina & M.Puncochárová, 1987 branching within clades of Prototheca species (e.g. Bakuła et al. 2020; Shave et al. 2021). Prototheca and Chlorella are the only known algal genera including disease-causing organisms in humans (Jagielski et al. 2019). Prototheca is associated with conditions termed protothecosis (Guiry and Guiry, 2021). Concerning diagnostics, taxonomic validity is important – in particular concerning the pathology associated taxa P. wickerhamii K.Tubaki & M.Soneda, 1959 and P. zopfii W.Krüger, 1894 (type species). Several species of Prototheca are rare opportunistic pathogens (Huerre et al. 1993) in humans (Lass-Flörl et al. 2007), other mammals (e.g. Möller et al. 2007; Marques et al. 2008) and fish (Jagielski et al. 2017). Protothecosis is an infectious disease, which often spreads through contact with contaminated water (Jagielski and Lagneau, 2007). The first case of protothecosis in humans was described in 1964 (Davies et al. 1964). The disease manifests in three clinical forms: cutaneous lesions, olecranon bursitis and disseminated or systemic infections (Leiman et al. 2004; Lass-Flörl et al. 2007). Although protothecosis is considered a rare disease in humans with 105 of 211 published cases reported between 2000 and 2017 (Todd et al. 2018), it is likely that a large number of cases are unreported, undiagnosed or misdiagnosed (Masuda et al. 2021) since diagnostic methods as well as protocols for the treatment of protothecosis are not well established yet (Todd et al. 2018). Among other vertebrates, cattle are most commonly affected of protothecosis (eg. Bueno et al. 2006; Marques et al. 2008; Jagielski et al. 2019), where the infection typically manifests as mastitis (Shave et al. 2021).

Fifteen Prototheca species are currently accepted (Jagielski et al. 2019; Kunthiphun et al. 2019), split into two lineages, with a dominance of human- and cattle-associated species, respectively (Jagielski et al. 2019). In both lineages, altogether six species are known to cause infections in humans and other vertebrates: P. blaschkeae U.Roesler, A.Möller, A.Hensel, D.Baumann & U.Truyen, 2006, P. bovis Jagielski, 2019, P. cutis K.Satoh & Makimura, 2010, P. miyajii Masuda, Hirose, Ishikawa, Ikawa & Nishimura, 2016, P. paracutis Khunthiphun, Endoh, Takashima, Ohkuma, Tanasupawat & Savarajara, 2019 and P. wickerhamii (Masuda et al. 2021). Phylogenetic relationships among Prototheca and their affiliated species (Tables S1, S2) have been investigated using ribosomal DNA (rDNA) (Suzuki et al. 2018; Jagielski et al. 2018; Masuda et al. 2016; Kunthiphun et al. 2019) and/or partial mitochondrial cytochrome b (cytb) (Jagielski et al. 2019) sequence data. In this study, for the first time, using distance-, parsimony- and maximum likelihood-algorithms, we reconstruct the phylogeny of Prototheca and allies using rDNA (18S rDNA and ITS2) sequence- and secondary structure data simultaneously, an approach reviewed by Keller et al. (2010), Wolf et al. (2014) and Wolf (2015), increasing robustness and accuracy of phylogenetic tree estimation. The evolution of protothecean ITS2 secondary structures is discussed.

Material and methods

For a material and methods workflow, see Fig. 1. ITS2 and 18S rDNA sequences of Prototheca and its affiliated species (Tables S1, S2) were obtained from NCBI Nucleotide database (retrieved on 2021–04-26) (Benson et al. 2009). ITS2 sequences were annotated using the “annotate” option implemented in the ITS2 database which uses Hidden Markov Models to annotate eukaryote ITS2 (Eddy, 1998; Keller et al. 2009; Schultz et al. 2006; Ankenbrand et al. 2015).

Fig. 1
figure 1

Flowchart of all methods applied in this work. Sequences which could not be properly annotated or aligned were discarded, as well as Prototheca strains classified as “sp.”. For alignment editing, Align (Hepperle et al. 2004) was used (not shown). After reconstruction of Neighbor-Joining (NJ) overview trees using ProfDistS (Wolf et al. 2008) subsets were manually chosen for Maximum-Likelihood (ML), Maximum-Parsimony (MP) and Neighbor-Joining (NJ) analysis. Further figures available with this manuscript are indicated

In ClustalX (Larkin et al. 2007), ITS2 as well as 18S rDNA sequences were aligned. Introns were removed from the 18S rDNA alignment with the help of the sequence editor Align (Hepperle et al. 2004).

Based on minimum free energy and constrained folding by using lower case letters, secondary structures of selected ITS2 (Tables S1, S2) sequences were predicted with RNAstructure (Reuter et al. 2010) which were then used as templates for homology modeling (Wolf et al. 2005; Selig et al. 2008) of the remaining secondary structures. Homology modeling was performed with the “model” option as implemented in the ITS2 database. Secondary structures of 18S rDNA sequences were also predicted via homology modeling using the ITS2 database. The template structure (Jaagichlorella luteoviridis (Chodat) Darienko & Pröschold, 2019) was obtained from the Comparative RNA Web Site (Cannone et al. 2002; Figure S1).

ITS2 and 18S rDNA sequence-structure datasets were each aligned using 4SALE (Seibel et al. 2006 and 2008). 4SALE uses a 12-letter-alphabet consisting of the four nucleotides and their structural states (unpaired, paired left, paired right) to encode sequence and structure information simultaneously. 4SALE was also used to visualize a consensus structure for Prototheca ITS2 sequences.

Sequence-structure alignments were exported from 4SALE for further analysis. Specifically, a sequence-structure Neighbor-Joining (NJ) (Saitou and Nei, 1987) tree was calculated based on both ITS2 and 18S rDNA sequence-structure alignments using ProfDistS (Friedrich et al. 2005; Wolf et al. 2008). For ITS2 sequence-structure data, Q_ITS2, a sequence-structure specific General Time Reversible correction model (cf. Lanave et al. 1984) as implemented in ProfDistS, was used, while for 18S rDNA sequence-structure data a sequence-structure specific JC model (Jukes and Cantor, 1969) was used as distance estimation method.

From each dataset, a subset with less taxa was manually chosen (Tables S1, S2). For each subset, a sequence-structure NJ tree was calculated using ProfDistS. Maximum-Parsimony (MP) and Maximum-Likelihood (ML) trees based on sequence-structure-data were calculated with PAUP (Swofford, 2002) (using one-letter encoded sequence-structure data) and R (R Core Team, 2018), respectively. The R-script is available at http://4sale.bioapps.biozentrum.uni-wuerzburg.de/mlseqstr.html. Additionally, ITS2 and 18S rDNA sequences were aligned in ClustalX and sequence-only NJ trees were calculated in ProfDistS. For all methods, due to the complexity of the sequence-structure approach, a bootstrap support (Felsenstein, 1985) was estimated based on 100 pseudo-replicates.

A third dataset was created consisting of combined ITS2 and 18S rDNA sequence-structure alignments. For this dataset, sequence-structure NJ, MP and ML trees were calculated with the programs and methods described above (for comparison reasons, additionally for this dataset, each marker was again handled separately, cf. Figures S4 and S5). All trees were rooted with Chlorella vulgaris Beyerinck, 1890 and Parachlorella kessleri L.Krieniz, E.H.Hegewald, D.Hepperle, V.A.R.Huss, T.Rohr & M.Wolf, 2004. All alignments are available on request.

Results and discussion

Taxon sampling

From NCBI, 192 ITS2 sequences of Prototheca and affiliated species could be obtained, as well as 165 18S rDNA sequences (Tables S1, S2). For ITS2 sequences, re-annotation was performed using the “annotate” tool in the ITS2 database with “Viridiplantae” as the model, inclusion of the proximal stem (last 25 nucleotides of 5.8S and first 25 nucleotides of 28S rDNA) and an E-Value of < 0.01 or < 0.1.

For ITS2 and 18S rDNA, sequence alignments were created using ClustalX. Sequences, which could not be annotated or aligned, were discarded. The ITS2 sequence of Prototheca wickerhamii was significantly longer than all other ITS2 sequences and could therefore not be properly aligned. The final alignments consist of 118 ITS2 sequences and 73 18S rDNA sequences (cf. Figure 1).

For ITS2 sequences, six secondary structure templates were created using RNAstructure (P. blaschkeae, P. cutis, P. stagnorum W.B.Cooke, 1968, P. ulmea R.S.Pore, 1986, P. xanthoriae Jagielski, 2019, P. zopfii). With these templates, structures of all other sequences could be predicted using the “model” tool of the ITS2 database with at least 70 percent transfer of the structure for most and 60 percent of the structure for three sequences (P. tumulicola Nagatsuka, Kiyuna, Kigawa & J.Sugiyama, 2016). Structures of P. wickerhamii could not be predicted with the templates described and showed a significantly longer and bifurcated fourth helix when modeled with RNAstructure. This taxon is therefore missing in further analyses. Phylogenetic trees were calculated on sequence-structure alignments generated in 4SALE, consisting of 112 taxa and a subset with 30 taxa.

For 18S rDNA sequences, a structure template was obtained from CRW (Jaagichlorella luteoviridis, X73998, Figure S1). All 73 18S rDNA sequences could be predicted with at least 70 percent transfer of the structures, for all structures except Helicosporidium sp. (67.82%). Prototheca sequences classified as “sp.” were discarded. For 18S rDNA data, phylogenetic trees were calculated with 71 taxa and a subset of 26 taxa.

From ITS2 and 18S rDNA subsets, a combined sequence-structure alignment of 15 strains / sequences was created.

Phylogeny of Prototheca based on ITS2 sequence-structure data

A Neighbor-Joining tree was calculated based on 112 ITS2 sequence-structure pairs (Fig. 2). From the clades shown in this overview tree, 30 taxa were manually selected for NJ, MP and ML analysis (Fig. 3). Towards the root of this tree, a highly supported supergroup consisting of P. miyajii, P. cutis and P. paracutis finds itself with Jaagichlorella luteoviridis and Auxenochlorella protothecoides, showing the polyphyly of the Prototheca genus. Strains of P. xanthoriae form a sister clade to all other Prototheca strains in the ML tree, although its position differs in the trees based on NJ and MP algorithms. P. moriformis W.Krüger, 1894 is very highly supported to be a sister group to the remaining taxa, which then are further divided into a P. tumulicola /P. stagnorum clade and a second clade, a supergroup consisting of P. zopfii /P. bovis, P. ciferrii Negroni & Blaisten, 1941, P. pringsheimii Jagielski, 2019, P. cerasi Jagielski, 2019, P. cookei Jagielski, 2019, and P. blaschkeae. In this supergroup, P. ciferrii appears to be polyphyletic with P. pringsheimii sequences branching within the P. ciferrii clade. P. ciferrii /P. pringsheimii and their sister group P. cerasi appear to be a sister group to the P. zopfii /P. bovis clade. All of these strains form a sister group to P. cookei. P. blaschkeae appears to sister with just the P. zopfii /P. bovis clade in the overview NJ tree, but in trees calculated on the subset data the sister group also includes P. ciferrii, P. pringsheimii, P. cerasi, P. cookei and P. moriformis (only in the MP tree).

Fig. 2
figure 2

ITS2 sequence-structure Neighbor-Joining tree obtained from ProfDistS (Wolf et al. 2008). An alignment of 112 sequence-structure-pairs (x.fasta format) of Prototheca and affiliated species was created using 4SALE (Seibel et al. 2006 and 2008) and encoded by a 12-letter alphabet (Wolf et al. 2014) for reconstruction of this tree. GenBank accession numbers accompany each taxon name. Clades are alternately marked green and blue and are additionally named alongside the tree in accordance with the clade names proposed in the phylogram by Jagielski et al. (2019). Taxa which were manually chosen for the subset are marked bold. The tree is rooted with Chlorella vulgaris FM205854

Fig. 3
figure 3

ITS2 sequence-structure Maximum-Likelihood tree calculated with R (R Core Team, 2018) including a representative subset of 30 sequence-structure pairs from Prototheca and its affiliated species which were manually selected from Fig. 2. Bootstrap values from 100 pseudo-replicates mapped at the internodes are from Maximum-Likelihood (ML), Maximum-Parsimony (MP, obtained from PAUP (Swofford, 2002)) and Neighbor-Joining (NJ, obtained from ProfDistS (Wolf et al. 2008)) analyses. For NJ tree reconstruction the global multiple sequence-structure alignment (.xfasta format) as derived by 4SALE (Seibel et al. 2006 and 2008) was automatically encoded by a 12-letter alphabet (Wolf et al. 2014). For ML and MP tree reconstruction the “one letter encoded” fasta format (12-letter alphabet) as derived by 4SALE (Seibel et al. 2006 and 2008) was used. GenBank accession numbers accompany each taxon name. Clades are alternately marked green and blue and are additionally named alongside the tree in accordance with the clade names proposed in the phylogram by Jagielski et al. (2019). The tree is rooted with Chlorella vulgaris FM205854 and Parachlorella kessleri FM205885

In general, the topology of the trees calculated on ITS2 sequence-structure-data show similar topology to the trees calculated by Masuda et al. (2016) and Hirose et al. (2018), with the additional taxa proposed by Jagielski et al. (2019) and Kunthiphun et al. (2019). Auxenochlorella protothecoides and Jaagichlorella luteoviridis are sister groups in our tree based on ITS2 sequence-structure data with a bootstrap support of 100 and both clade with P. cutis / P. paracutis / P. miyajii with a bootstrap support of 81. Auxenochlorella protothecoides branches with Prototheca wickerhamii (with a bootstrap support of 58) in the work of Hirose et al. (2018), and is sister group to all Prototheca sequences except P. wickerhamii in the tree proposed by Masuda et al. (2016) with a bootstrap support of 87. P. ulmea is poorly supported being a sister group to P. zopfii, P. moriformis and P. blaschkeae sequences in the same work whereas we show P. tumulicola / P. stagnorum as sister group to these species with a bootstrap support of 76.

The phylogenetic position of P. xanthoriae remains unresolved here as its position differs in all constructed trees always with low bootstrap support. Several other relationships (e.g. the close relationship of P. miyajii + P. cutis /P. paracutis, P. tumulicola + P. stagnorum or P. moriformis + the supergroup consisting of P. zopfii /P. bovis, P. ciferrii, P. pringsheimii, P. cerasi, P. cookei, P. blaschkeae, P. tumulicola, and P. stagnorum) are very highly supported by bootstrap values > 90 in all (NJ, MP, ML) calculated trees. A single P. moriformis sequence (MK445153) was positioned within the P. zopfii /P. bovis clade in the ITS2 sequence-structure overview tree (Fig. 2). This strain (SAG 263–2) appears in the P. moriformis clade (cluster IX) in the phylogram based on the partial cytb sequences by Jagielski et al. (2019).

Comparing the sequence-structure tree to a tree based on sequence data only (Figure S2), it is apparent that the sequence-only tree is similar, sometimes lower supported than the tree based on the sequence-structure alignment and shows several differences in the topology, e.g. the positions of the P. tumulicola / P. stagnorum clade or the P. moriformis clade. Auxenochlorella protothecoides and Jaagichlorella luteoviridis branch inside the P. cutis / P. paracutis / P. miyajii clade in the sequence-only tree. This latter clade without A. protothecoides and J. luteoviridis is highly supported in the sequence-structure tree with a bootstrap value of 93.

Phylogeny of Prototheca based on 18S rDNA sequence-structure data

Using the Neighbor-Joining algorithm, an overview tree based on 71 18S rDNA sequence-structure pairs was created (Fig. 4). Here, as in several other trees based on rDNA data (e.g. Masuda et al. 2016; Hirose et al. 2018; Shave et al. 2021), P. wickerhamii appears to be polyphyletic with two strains (X56099, X74003) branching outside of the P. wickerhamii clade. Jagielski et al. (2019), reclassified these taxa as a new species, Prototheca xanthoriae. From the NJ overview tree, a subset was created by manual selection of 26 sequence-structure pairs. Trees calculated on this subset data (Fig. 5) show P. xanthoriae as the sister group to all strains except the outgroup. Auxenochlorella protothecoides and Jaagichlorella luteoviridis are sister group to all remaining taxa, which are then further divided into a P. miyajii /P. cutis clade and a second clade, in which Helicosporidium sisters with P. wickerhamii and another supergroup of several Prototheca species. This supergroup forms two clades, the first being P. ulmea /P. moriformis and their sister group P. tumulicola /P. stagnorum, the second divided into a P. blaschkeae clade and a second clade consisting of P. ciferrii, P. moriformis and P. zopfii /P. bovis. Bootstrap support of this tree is generally high as all but one external nodes are supported by a bootstrap value > 65. The trees calculated on 18S rDNA sequence-structure data show similar topology to the trees based on LSU rDNA data proposed in literature (e.g. Masuda et al. 2016; Hirose et al. 2018), but differ from the phylogram based on partial cytb sequences by Jagielski et al. (2019) in several aspects. P. stagnorum, P. tumulicola and P. moriformis appear towards the root of the tree in the cytb sequence based phylogram, while our tree shows all three species distant to the root and forming the sister group to P. zopfii / P. bovis, P. moriformis, P. ciferrii and P. blaschkeae. In the cytb sequenced phylogram, a multifurcation occurs including P. wickerhamii, the sister groups P. miyajii and P. cutis as well as a clade consisting of P. xanthoriae, Helicosporidium sp. and Auxenochlorella protothecoides. P. wickerhamii is shown to be the sister group of P. zopfii / P. bovis, P. ciferrii, P. blaschkeae, P. tumulicola, P. stagnorum and P. moriformis in our tree, although with low bootstrap support. Helicosporidium sp. is sister group to all of these species (with moderate bootstrap support) and P. miyajii / P. cutis sister with these species including Helicosporidium sp. with bootstrap values > 70 for all methods applied (NJ, MP, ML).

Fig. 4
figure 4

18S rDNA sequence-structure Neighbor-Joining tree obtained from ProfDistS (Wolf et al. 2008). An alignment of 71 sequence-structure-pairs (x.fasta format) of Prototheca and its affiliated species was created using 4SALE (Seibel et al. 2006 and 2008) and encoded by a 12-letter alphabet (Wolf et al. 2014) for reconstruction of this tree. GenBank accession numbers accompany each taxon name. Clades are alternately marked green and blue and are additionally named alongside the tree in accordance with the clade names proposed in the phylogram by Jagielski et al. 2019. Taxa which were manually chosen for the subset are marked bold. The tree is rooted with Chlorella vulgaris FM205854 and Parachlorella kessleri FM205885

Fig. 5
figure 5

18S rDNA sequence-structure Maximum-Likelihood tree calculated with R (R Core Team, 2018) including a representative subset of 26 sequence-structure pairs from Prototheca and its affiliated species which were manually selected from Fig. 4. Bootstrap values from 100 pseudo-replicates mapped at the internodes are from Maximum-Likelihood (ML), Maximum-Parsimony (MP, obtained from PAUP (Swofford, 2002)) and Neighbor-Joining (NJ, obtained from ProfDistS (Wolf et al. 2008)) analyses. For NJ tree reconstruction the global multiple sequence-structure alignment (.xfasta format) as derived by 4SALE (Seibel et al. 2006 and 2008) was automatically encoded by a 12-letter alphabet (Wolf et al. 2014). For ML and MP tree reconstruction the “one letter encoded” fasta format (12-letter alphabet) as derived by 4SALE (Seibel et al. 2006 and 2008) was used. GenBank accession numbers accompany each taxon name. Clades are alternately marked green and blue and are additionally named alongside the tree in accordance with the clade names proposed in the phylogram by Jagielski et al. (2019). The tree is rooted with Chlorella vulgaris FM205854 and Parachlorella kessleri FM205885

Comparing the sequence-structure tree to a tree based on sequence data only (Figure S3), it is apparent that the bootstrap support is mostly higher although sometimes similar in the sequence-structure tree. The topology differs slightly, e.g. in the P. zopfii /P. bovis + P. ciferrii + P. moriformis clade, P. tumulicola + P. stagnorum clade or within the P. wickerhamii clade.

Phylogeny of Prototheca based on combined ITS2 and 18S rDNA sequence-structure data

A combined ITS2 and 18S rDNA sequence-structure alignment was created from strains which appeared in both the ITS2 and 18S rDNA subset. NJ, MP and ML trees were calculated on this 15 taxa sequence-structure alignment (Fig. 6). This tree is highly supported by bootstrap values ≥ 70 at all nodes throughout the whole tree with most of them being > 95. P. cutis and P. miyajii form a clade outside all other Prototheca clades, which are then further divided into a P. moriformis clade and the sister group consisting of P. stagnorum, P. tumulicola, P. blaschkeae, P. ciferrii and P. zopfii /P. bovis. In this supergroup, P. stagnorum and P. tumulicola find themselves together against the remaining taxa which then have P. blaschkeae as sister group to P. ciferrii and P. zopfii /P. bovis clades.

Fig. 6
figure 6

Combined 18S and ITS2 sequence-structure Maximum-Likelihood tree calculated with R (R Core Team, 2018) including 15 sequence-structure pairs from Prototheca and its affiliated species. Bootstrap values from 100 pseudo-replicates mapped at the internodes are from Maximum-Likelihood (ML), Maximum-Parsimony (MP, obtained from PAUP (Swofford, 2002)) and Neighbor-Joining (NJ, obtained from ProfDistS (Wolf et al. 2008)) analyses. For NJ tree reconstruction the global multiple sequence-structure alignment (.xfasta format) as derived by 4SALE (Seibel et al. 2006 and 2008) was automatically encoded by a 12-letter alphabet (Wolf et al. 2014). For ML and MP tree reconstruction the “one letter encoded” fasta format (12-letter alphabet) as derived by 4SALE (Seibel et al. 2006 and 2008) was used. Strain numbers accompany each taxon name. The tree is rooted with Chlorella vulgaris CCAP 211/81 and Parachlorella kessleri CCAP 211/11G

Comparing the ITS2 and 18S rDNA subset trees to the tree based on the combined alignment, the trees show similar topology despite several species missing in the combined alignment. In all three trees, P. zopfii / P. bovis (and one P. moriformis strain in the 18S rDNA tree) is the sister group to P. ciferrii, forming a supergroup which then is sister group to P. blaschkeae.

While P. moriformis and P. tumulicola / P. stagnorum are sister groups in the 18S rDNA tree, P. moriformis is sister group to several more species in the ITS2 and the combined tree. P. cutis / P. miyajii form a sister group to all other Prototheca strains in the combined and the 18S rDNA tree (except P. xanthoriae, which sisters with all species except the outgroup in the 18S rDNA tree) but are more closely related to Auxenochlorella protothecoides and Jaagichlorella luteoviridis in the tree based on ITS2 sequence-structure data. Accordingly, these nodes are the nodes showing a relatively low bootstrap support in the very highly supported tree based on the combined 18S rDNA and ITS2 alignment.

Calculating trees on ITS2 and 18S rDNA sequence-structure data of the 15 chosen taxa for the combined alignment separately, it is apparent that the combined ML tree shows more similarity to the 18S rDNA tree (Figure S4) than the ITS2 tree (Figure S5). Both separate trees show the supergroup consisting of P. moriformis, P. tumulicola, P. stagnorum, P. blaschkeae, P. ciferrii and P. zopfii / P. bovis with a bootstrap value of 100. The relationships between the Prototheca strains in this supergroup varies; however the combined tree and the 18S rDNA tree show P. blaschkeae being a sister group to P. ciferri and P. zopfii / P. bovis while in the ITS2 tree, P. blaschkeae is related to P. zopfii / P. bovis, but with low bootstrap support. The ITS2 tree shows P. miyajii / P. cutis being sister group to Auxenochlorella protothecoides and Jaagichlorella luteoviridis, whereas this relationship doesn’t appear in the 18S rDNA or combined ML tree. The bootstrap support of the 18S rDNA tree is overall high with all but one bootstrap value > 60. In the ITS2 tree, two bootstrap values are lower than 50 with the accompanying nodes being the ones where the ITS2 tree doesn’t show the same topology as either the 18S rDNA or combined tree.

Finally, if we compare ITS2 and 18S rDNA trees, we must not forget that we cannot include P. wickerhamii in the comparison. In order to deduce the phylogeny of the entire genus Prototheca, i.e., including P. wickerhamii, one always needs at least one additional marker gene beside ITS2.

Evolution of protothecean ITS2 secondary structures

ITS2 secondary structures of six Prototheca sequences were constructed using RNAstructure (Fig. 7). In general, these structures folded into the common core structure known for eukaryotes with four helices (Schultz et al. 2005). Protothecean ITS sequences are known to vary in length (Marques et al. 2015). Prototheca sequences in this work were between 269 (all three P. tumulicola strains) and 543 / 544 bp (P. moriformis MF163495 /P. ulmea MF163497) long. ITS2 sequences of P. wickerhamii were significantly longer (1171–1358 bp). ITS2 structures from P. blaschkeae and P. cutis showed an exceptionally large fourth helix, while helix IV of P. stagnorum and P. zopfii was rather short. The third helix of P. ulmea appears to be bifurcated. Figure 8 visualizes the sequence-structure alignment of all Prototheca strains in the subset by a 51% consensus structure. A few bindings in helix II, between helix II and III and at the end of helix III are shown to be 80% conserved where known ITS2 structure motifs (the U-U mismatch in helix II, the triple A between helix II and III and the UGGU motif in helix III) are generally located.

Fig. 7
figure 7

ITS2 secondary structure templates used for homology modeling of Prototheca sequences in the ITS2 database (Schultz et al. 2006; Ankenbrand et al. 2015). Templates were created in RNAstructure (Reuter et al. 2010) based on minimum free energy and constrained folding. The stem, consisting of the last 25 nucleotides of the 5.8S and the first 25 nucleotides of the 28S rDNA is highlighted in purple (5.8S) and blue (28S) using Varna (Darty et al. 2009)

Fig. 8
figure 8

Visualization of the subset Prototheca sequence-structure alignment without gaps (outgroups, Jaagichlorella and Auxenochlorella species were excluded) by a 51% consensus structure created in 4SALE (Seibel et al. 2006 and 2008). Nucleotide bonds that are at least 80% conserved are marked in yellow. Conservation of the sequence is indicated by red (low conservation) to green (high conservation) color. Nucleotides which are 100% conserved in all sequences are written as A, U, G or C

Despite the differences in length in the Prototheca ITS2 sequences, homology modeling of the secondary structures was possible with just three templates (P. zopfii, P. cutis, P. stagnorum) at 50% consensus level for all Prototheca sequences except P. blaschkeae and P. wickerhamii. The P. zopfii template could be used to model other P. zopfii structures and those of P. bovis, P. cerasi, P. ciferrii, P. cookei, one P. moriformis strain and P. pringsheimii. These species also form a supergroup in the ML sequence-structure tree (Fig. 3). With the P. cutis template, all strains of the P. cutis /P. paracutis + P. miyajii clade could be predicted. The P. stagnorum template could be used for prediction of the secondary structures of other P. stagnorum and the P. tumulicola sequences as well as P. ulmea, P. xanthoriae and P. moriformis sequences with a lower consensus. Therefore, three additional templates were created (P. blaschkeae, P. ulmea and P. xanthoriae).

Given their distant relationship in the pylogenetic trees based on ITS2 sequence-structure data, elongation of helix IV of the ITS2 in P. blaschkeae and P. cutis seems to have occurred independently in the course of evolution.

ITS2 is one of the most effective phylogenetic markers. The high variability allows to study closely related organisms, the conserved structure reveals larger relationships. In most cases, the secondary structure helps to better align variable sequences. Sometimes, however, the length variations and differences even within a genus are already so large that alignments (whether based only on sequence or on sequence-structure information) should be viewed with caution. Prototheca is such an example. Homology is difficult to discern and individual sequences are even impossible to align at all. On the other hand, if you take out only a few sequences (e.g. those with an extremely elongated fourth helix), the alignment quickly becomes much more compact. With this study, we reconstructed phylogenetic trees on extremely diverse Prototheca sequences—whose sequence-structure information was encoded into a new alphabet; and indeed the results show robust trees similar to those based on other markers (e.g. 18S, LSU or cytb). We encourage the commuity to draw on additional markers and, by comparison and/or concatination, to better and better understand the phylogeny of Prototheca and related taxa.

To understand ITS2 length differences further research is needed. Compared to other genera, in terms of extreme sequence differences, it seems possible to discover additional species in the Prototheca species complex. Such species will then put the sequence differences into perspective and/or significantly advance our understanding of length variation (e.g. by expansion, duplication, and/or alternative splicing), or more generally, our understanding of RNA sequence-structure evolution.

Conclusion

In this work, using sequence-structure information simultaneously, for two phylogenetic markers (ITS2 and 18S rDNA), we reconstructed generally well-supported phylogenetic trees that are in overall agreement with the trees based on rDNA sequences (mainly LSU data) proposed in literature but show several topological differences to trees calculated on cytb sequences. Prototheca wickerhamii, the main causative for human protothecosis, could not be included in analysis based on ITS2 data since its ITS2 sequences were exceptionally long and could therefore not be aligned with other Prototheca sequences. The phylogenetic trees calculated on sequence-structure alignments of our subset data show Maximum-Likelihood support (> 50) for all but three branches in both the ITS2 and the 18S rDNA tree. Bootstrap support values are generally higher than those from sequence-only analyses (in this study or in the available literature using RNA and/or protein data).

The ITS2 of Prototheca is known to vary in length. Our study shows that out of the Prototheca ITS2 structures we reconstructed, P. blaschkeae and P. cutis displayed an elongated fourth helix. Helix III of P. moriformis (formerly P. ulmea) appears to be bifurcated. Despite the differences in length, a 51% consensus structure showing all but the fourth helix could be visualized with some nucleotide bonds being 80% conserved throughout all examined Prototheca structures.