Systematic Analysis of Diverse Polynucleotide Kinase Clp1 Family Proteins in Eukaryotes: Three Unique Clp1 Proteins of Trypanosoma brucei

The Clp1 family proteins, consisting of the Clp1 and Nol9/Grc3 groups, have polynucleotide kinase (PNK) activity at the 5′ end of RNA strands and are important enzymes in the processing of some precursor RNAs. However, it remains unclear how this enzyme family diversified in the eukaryotes. We performed a large-scale molecular evolutionary analysis of the full-length genomes of 358 eukaryotic species to classify the diverse Clp1 family proteins. The average number of Clp1 family proteins in eukaryotes was 2.3 ± 1.0, and most representative species had both Clp1 and Nol9/Grc3 proteins, suggesting that the Clp1 and Nol9/Grc3 groups were already formed in the eukaryotic ancestor by gene duplication. We also detected an average of 4.1 ± 0.4 Clp1 family proteins in members of the protist phylum Euglenozoa. For example, in Trypanosoma brucei, there are three genes of the Clp1 group and one gene of the Nol9/Grc3 group. In the Clp1 group proteins encoded by these three genes, the C-terminal domains have been replaced by unique characteristics domains, so we designated these proteins Tb-Clp1-t1, Tb-Clp1-t2, and Tb-Clp1-t3. Experimental validation showed that only Tb-Clp1-t2 has PNK activity against RNA strands. As in this example, N-terminal and C-terminal domain replacement also contributed to the diversification of the Clp1 family proteins in other eukaryotic species. Our analysis also revealed that the Clp1 family proteins in humans and plants diversified through isoforms created by alternative splicing. Supplementary Information The online version contains supplementary material available at 10.1007/s00239-023-10128-x.

(*) To create a sequence set of 254 representative Clp1 proteins, similar sequences were removed from the 1,264 detected Clp1 family proteins with CD-HIT.We also excluded the protein sequences of splicing variants.We then added the sequences of representative organisms (Table 1) and the sequences of prokaryotes for comparative analysis.
Supplementary Table S3 List of Clp1 family proteins (both Clp1 and Nol9/Grc3 groups) used for amino acid sequence alignments  Tb -Clp1-t1-Hisx6 (Amino acid sequence) Supplementary Table S7 Oligonucleotide  The sequence of oligonucleotide R20-comp is complementary to that of the R20-FAM oligonucleotide.This is also true for the D20comp and D20-FAM oligonucleotides.FAM, carboxyfluorescein.
Supplementary Fig. S1 Classification of Clp1 family proteins (both Clp1 group and Nol9/Grc3 group) analyzed in this study.A phylogenetic tree based on 254 Clp1 family proteins (both Clp1 and Nol9/Grc3 groups) consisting of 251 protein sequences from 154 eukaryotes, two protein sequences from two archaea, and one protein sequence from a bacterium is shown (see also Supplementary Table S2b).The phylogenetic tree was constructed from their full-length amino acid sequences.The LG+F+R8 model was used for this phylogenetic tree.Midpoint rooting was applied during tree visualisation.The scale bar under the tree indicates the number of amino acid substitutions per site.The protein names of representative species and taxonomic groups (Kingdom to Phylum) are listed next to the phylogenetic tree.Prokaryota, Archaea, and Bacteria are described as domains.Types 1-3 are protein types of Euglenozoa Clp1 group proteins classified according to their sequence similarities.Horizontal dotted lines were used to divide the Clp1 and Nol9/Grc3 groups.Supplementary Fig. S2 The legend for this figure is placed on the next page.Supplementary Fig. S2 Amino acid sequence alignment of Euglenozoa Clp1 group proteins classified as types 1-3.A total of 25 (from eight species; Supplementary Table S3) of 33 Euglenozoa Clp1 family proteins (Supplementary Table S2a), excluding the Nol9/Grc3 group proteins, were used for the sequence alignment.Rectangular boxes above the sequences indicate their protein domain structures (see Supplementary Table   S3 for species and protein information used for the analysis.S3 for species and protein information used for the analysis.S3 for species and protein information used for the analysis.S3 for species and protein information used for the analysis. of Clp1 family proteins in complete eukaryotic genomes (a) Summary of the number of all Clp1 family proteins detected in this study (1,264 proteins).(b) Summary of representative Clp1 family proteins used in the phylogenetic analysis (*). S4 ): Clp1_eN1 (light blue), Clp1_P (green), Clp1_euC1 or Clp1_euC2 or Clp1_euC3 (yellow).The motif name (top), the active site consensus sequence (middle), and its region (arrow at the bottom) are shown for each sequence.The lower part of the figure shows the conservation scores of the alignment.Red boxes indicate areas in which the sequence of the active site is completely conserved.See Supplementary Table Amino acid sequence alignment of the Euglenozoa Clp1 group proteins (type 1).Rectangular boxes above the sequences indicate protein domain structures (Supplementary Table S4 ): Clp1_eN1 (light blue), Clp1_P (green), and Clp1_euC1 (yellow).The lower part of the figure shows the conservation scores of the alignment.See Supplementary Table Fig. S4 Amino acid sequence alignment of the Euglenozoa Clp1 group proteins (type 2).Rectangular boxes above the sequences indicate protein domain structures (Supplementary Table S4 ): Clp1_eN1 (light blue), Clp1_P (green), and Clp1_euC2 (red).The lower part of the figure shows the conservation scores of the alignment.See Supplementary Table S5 Amino acid sequence alignment of the Euglenozoa Clp1 group proteins (type 3).Rectangular boxes above the sequences indicate protein domain structures (Supplementary Table S4 ): Clp1_eN1 (light blue), Clp1_P (green), and Clp1_euC3 (orange).The lower part of the figure shows the conservation scores of the alignment.See Supplementary Table

Domain symbol Protein domain details Protein domain name Pfam ID Protein
domains were identified using two methods: (i) searches against the Pfam database; and (ii) manual amino acid sequence alignments (all symbols are rectangles).*1,Saitoet al. 2019, *2, this study.

Supplementary Table S5 Summary of Clp1 and RNA ligase family enzymes (Trl1/Rnl and RtcB) in the species examined in this study
Arabidopsis thaliana Rnl (Q0WL81).For the eukaryotic RefSeq database, see the Dataset section in the Materials and Methods, and for the prokaryotic RefSeq database, we used the August 2018 dataset (Chan and Lowe 2009)nih.gov/genomes/refseq/;last accessed September 17, 2019).The number of intron-containing tRNAs in each species is according to the Genomic tRNA Database (GtRNAdb) (http://gtrnadb.ucsc.edu)(ChanandLowe 2009).