Introduction

Sequencing technology has progressed rapidly, and associated costs have declined, which has resulted in increased throughput in genome sequencing. Consequently, databases of sequenced genomes are now growing exponentially (RefSeq growth statistics [1]; GenBank and WGS Statistics [2]), and raw sequencing data are accumulating at an even greater rate (e.g., NCBI’s Sequence Read Archive [SRA] database [3]) [4, 5]. It has therefore become challenging to assemble and curate the genomes at the same pace as sequencing takes place [6]. For plants, even though published genomes are mostly complete with respect to the sequencing level, many of the assemblies are fragmented, with a scaffold N50 (the statistical value that defines assembly quality in terms of contiguity) that is below the standard 1 Mb [7]. A fragmented genome assembly will impact how well genes can be annotated, both in regards to entirely missing protein-coding genes and structural errors in the coding sequence (CDS) [8, 9]. Thus, it is challenging to identify all members of a gene family of interest in the predicted proteome of a published genome. This is also true in data sets where both sequence similarity and ab initio gene prediction have been used for predictions [10]. A limitation of using transcriptomic data to support the annotation is that gene family members often have a differential expression with respect to development, cell type, and changes in the environment, and the transcriptome may not be representative of all stages. For example, several members of the P3A plasma membrane (PM) H+-ATPase family from Arabidopsis thaliana are almost exclusively expressed in the gametophyte and not in the sporophyte [11].

Sequence alignment of related genes has identified sequence motifs that characterize members of specific gene families (e.g., as can be found in the Pfam database of protein families and domains [12]), and such motifs can be used as baits when searching new genomes for regions in genes that are similar to the identified sequence motifs. To derive such sequence profiles, or Hidden Markov Models (HMMs), from sequence alignments, methods such as HMMER [13] have been developed [14]. Some families or domains are defined not just by one motif but by the co-occurrence of two or more motifs. Whereas the presence of just one of these motifs may be insufficient to assign a protein to a particular family or domain, the simultaneous occurrence of linked motifs improves the confidence that the sequence belongs to the gene family being considered [15]. Alignment-free analytical methods have also been developed [16].

A problem with fragmented genomes is that they hinder the automatic annotation of genes. In this work, we developed a tool to identify members of gene families in fragmented genomes of plants that have not been fully annotated yet. The method is based on the translation of an entire genome in all six reading frames and the co-occurrence of two sequence motifs. We tested our method on P3A ATPases, a major family of PM H+-ATPases in plants. An advantage of our method is that it avoids using the predicted proteome/transcriptome for a given organism, which can potentially be misleading.

Materials and methods

Annotation of P-type ATPases in Hordeum vulgare

The inventory of P-type ATPase genes in the H. vulgare genome was based on identified homology to the A. thaliana and Oryza sativa inventories. Sequences for A. thaliana and O. sativa were retrieved from refs. [17, 18], respectively. The H. vulgare genome was retrieved from the IPK website together with high-confidence CDS and protein predictions.

To identify P-type ATPases in the H. vulgare cultivar Morex, the functional annotation of the high-confidence protein predictions on the IPK database [19] was first searched for proteins belonging to the PFAM00122 family (E1-E2 ATPase) [20]. The identified sequences were inspected for the DKTGT motif, and all sequences not containing this motif were discarded. Subsequently, the sequences were used for preliminary phylogenetic analysis to determine the closest homologs in O. sativa and A. thaliana. An alignment for each of the H. vulgare CDSs, the corresponding genomic sequence, and the CDS for the closest homolog from O. sativa was created to correct and check the identified ATPases (Supplementary files 1, 2, 3, 4, 5 and 6). This alignment was used to correct the CDSs by hand, compared to the O. sativa CDS, and by inspection of intron–exon splice sites.

Phylogenetic analysis

Sequences were aligned in MEGA6 using multiple sequence comparison by log expectation (MUSCLE) [21]. The phylogenetic analysis was performed with the CIPRES Science Gateway [22]. For Maximum likelihood analysis, the RAxML-HPC2 workflow implemented in the XSEDE tool was used with the following parameters: bootstraps = 1,000; data type = protein; Protein Substitution Matrix = LG; and the rest of the settings were default. For Bayesian inference analysis, the MrBayes algorithm implemented in XSEDE 3.2.6 was used with the following parameters: number of generations = 100,000; data type = protein; Protein Substitution Matrix = LG; number of runs = 2; likelihood model = inverse gamma distribution; number of chains = 8; temperature parameter = 0.05; and the rest of the settings were default.

Building the query sequences

The query sequences were created for each family (P1B, P2A, P2B, P3A, P4, and P5) using the ATPases from A. thaliana, O. sativa, and H. vulgare (Supplementary file 7) and were built by aligning all protein sequences for a family in Genius Prime using MUSCLE with standard parameters. From the alignment, two query sequences were constructed. The first query sequence consisted only of the family-specific motif, the superfamily-specific motif, and the distance between the two. The second query sequence contained the consensus surrounding the family- and superfamily-specific motifs.

Genomes used for identifying P3A ATPases

All genomes used for determining the number of P3A PM H+-ATPases were reference genomes as of October 2021 and were retrieved from NCBI with the exception of H. vulgare and Aeluropus littoralis, for which the genomes were retrieved from IPK in October 2021. Links to all genomes and assemblies used are given in Supplementary Table 1.

Translation of genomes and building and searching databases

To build the databases for interrogation of the genomes, all genomes were imported into CLC Main Workbench 20.0.4. Following this, the genomes were translated in all six reading frames and basic local alignment search tool (BLAST) databases were created in CLC Main Workbench for each of the now translated genomes. All BLAST searches were also carried out using CLC Main Workbench.

cDNA cloning

Seeds of H. vulgare cv. RGT Planet were surface sterilized by removing the husk, submerging the seeds in a bleach solution (80% primo-bleach, 19% water, and 0.1% Triton X-100), and then placing the submerged seeds on a pivoting table for 20 min. After sterilization, the seeds were washed 10 times in sterilized Mili-Q water. Seeds were germinated on sterile wet filter paper for 3 days at 20 °C with 16 h of light. After germination, the seeds were moved to full MS plates and left to grow for 7 days at 20 °C with 16 h of light. Sterile seedlings were homogenized into a fine powder using a clean mortar submerged in liquid nitrogen, from which RNA was extracted with the Qiagen Plant RNeasy Kit following the manufacturer’s protocol. gDNA was removed with the TURBO DNA-free Kit from Ambion following the manufacturer’s instructions. cDNA was produced using the iScript kit from Bio-Rad according to the provided protocol.

Identification and cloning of HvHMA7 and HvHMA1-like

The HvHMA7 transcript was validated by partially cloning it from cDNA of 10-day-old whole seedlings of the cultivar RGT Planet using the following primer pairs: HvHMA7Fp2; 5’-TCCGCAACTGTCAATCT-3’ X HvHMA7Rp10; 5’-TCCTCAAAAAATGTCTTC-3’ and HvHMA7Fp8_BLAST; 5’-AGGCTGCCAACTGCATCAAT-3’ X HvHMA7Rp8_BLAST; 5’-CAAC GGGTAGCCCAACAATG-3’. HvHMA1B transcript expression was confirmed by cloning the first 1850 bp of the gene using the following primer pairs: HvHMA1BFp1_BLAST; 5’-ATGCGGCTTGACTCC-3’ X HvHMA1BRp1_BLAST; 5’-CATGCCCGCCTTTCAACAAA-3’ and HvHMA1BFp9; 5’-CAGGTGGTGCTAGGAACCTG-3’ X HvHMA1BRp9; 5’-CACTGGCACGTTGAGCTAGA-3’. All PCR products were sequenced by Eurofins Genomics.

Statistics

Statistical analysis was performed using GraphPad Prism 9.4.0 using Student’s t-tests.

Results

Generating query sequences for different P-type ATPase families

P-type ATPases (also called E1-E2 ATPases) are a large superfamily of primary active pumps comprising several families in plants (P1B, P2A, P2B, P3A, P4, and P5). All P-type ATPases have the Asp-Lys-Thr-Gly-Thr (DKTGT) motif in the phosphorylation (P) domain where the phosphorylatable aspartate residue resides [23, 24]. Upstream of the DKTGT motif, each family also has a specific motif surrounding the proline helix breaker in transmembrane helix four, namely, P1B: CPC/SPC; P2A/B: PEXL; P3A: PIA; P4: PIS; and P5: PPXXP [24,25,26,27]. We thus considered that the two motifs in close proximity to each other could be used in combination to search the database of translated genomes. A search string for each family only containing the family- and ATPase-specific motifs with a fixed distance between the two was created (see Additional file 14) . We called this approach the two-motif search.

A reference inventory of P-type ATPases in H. vulgare

Several P-type ATPases have been identified in H. vulgare [28,29,30,31], but an exhaustive inventory of this superfamily has yet to be constructed. To test our method, we started by using the high-quality assembled genome of H. vulgare (current N50 = 69.6 Mb). First, we identified and annotated all P-type ATPases using conventional methods. Subsequently, we searched through the H. vulgare genome using the above-described method.

P-type ATPase genes were annotated by retrieving all predictions with a PFAM00122 motif from the high-confidence predictions, which were curated guided by intron–exon splice sites and homology to their orthologues in O. sativa. The HvHMA7 assembled transcript was only identified in Morex V3 and was difficult to curate properly. Therefore, the transcript was validated by partially cloning it from 10-day-old H. vulgare seedlings and searching the transcriptomic shotgun assembly (TSA) database at NCBI. All sequences are presented in Supplementary files 1, 2, 3, 4, 5 and 6. Phylogenetic analysis was performed to compare the ATPases found in H. vulgare with those of O. sativa and A. thaliana. H. vulgare sequences were named by adopting the name from the closest homolog in O. sativa (Fig. 1). HvPAA1 was renamed to HvHMA3 for consistency. A few differences could be found when comparing the number of P-type ATPase genes in O. sativa and H. vulgare. Two orthologues, HvHMA2 and HvHMA2-like, were identified for OsHMA2 in the P1B family. There also appeared to be two orthologues of OsHMA1 in H. vulgare, HvHMA1 and HvHMA1-like. The HvHMA1-like gene has a premature stop codon after codon number 80 and is therefore characterized as a pseudogene. The premature stop codon was confirmed by partially cloning HvHMA1-like from cDNA of 10-day-old seedlings of the cultivar RGT Planet and was also identified in the cultivar Price by searching the TSA database with the CDS of HvHMA1-like. In the P2B family, a second homolog of OsACA10 was identified in H. vulgare, and a homolog of OsACA4/OsACA5 was missing. Furthermore, in the P3A family no ortholog of OsAHA2 could be identified in H. vulgare. For a complete list of the genes, see Table 1.

Fig. 1
figure 1

Phylogenetic analysis of P-type ATPases from A. thaliana, H. vulgare, and O. sativa. Full-length protein sequences were aligned in MEGA6 using MUSCLE with standard parameters. The alignment was then subjected to maximum likelihood analysis and Bayesian inference. For maximum likelihood analysis, the RAxML-HPC2 workflow implemented in the XSEDE tool was used on the CIPRES Science Gateway with the following parameters: bootstraps = 1000; data type = protein; Protein Substitution Matrix = LG; and the rest of the settings were default. For Bayesian inference analysis, the MrBayes v. 3.2.7 tool was used on the CIPRES Science Gateway with the following parameters: aamodelpr = fixed(lg); prset statefreqpr = fixed(empirical); lset rates = invgamma; mcmcp ngen = 100,000; mcmcp nruns = 2; mcmcp nchains = 8; mcmcp temp = 0.05; mcmcp mcmcdiagn = yes; and mcmc diagnfreq = 10,000. Numbers at nodes show likelihood from maximum likelihood analysis, and filled circles indicate full support in Bayesian inference analysis. In the phylogenetic analysis, the orthologue of OsALA6 was excluded as this gene is partial

Table 1 Inventory of P-type ATPases in H. vulgare

Evaluating the two-motif search in A. thalianaO. sativa, and H. vulgare

Three genomes in which all P-type ATPases have been identified are now available, which allowed us to test the two-motif method. We searched the translated genomes of A. thaliana, H. vulgare, and O. sativa with the two-motif string and, with the exception of the P4 family, identified all P-type ATPases of a specific family that did not have an intron between the two motifs (Table 2). To identify all members of the P4 family, the search string was made more flexible to accept a distance between the two motifs that varied by 1 amino acid (Supplementary Fig. 5). As P2A and P2B have the same family motif, the two groups are indistinguishable and need to be added together.

Table 2 Comparison of ATPases identified by searching translated genomes with the two-motif method and the expanded two-motif method. The inventory of the respective species and the number of exons in which the query sequence is present

Modifying the two-motif method to include the area surrounding the family- and P-Type ATPase-specific motifs

As the two-motif method alone falls short regarding the identification of all ATPases within a group, we aimed to modify it by including the consensus sequence surrounding the family- and ATPase-specific motifs. To identify genes with an intron between the two motifs, it is necessary to build a bait sequence with additional information between the motifs. We modified the two-motif search and built an enlarged bait sequence for each family (P1B, P2A, P2B, P3A, P4, and P5) using the entire inventory of P-type ATPases from A. thaliana, H. vulgare, and O. sativa. We aligned sequences from the different families and produced a consensus sequence surrounding the DKTGT motif and family-specific motif for each family. The consensus sequences were some 55 to 70 amino acid residues in length (Supplementary Figs. 1-6). Each consensus sequence was used as a short query sequence to identify the number of genes in a specific family.

Testing the enlarged query sequence on the annotated genomes of A. thaliana, O. sativa, and H. vulgare

To test the enlarged query sequence, we used the full inventories of P-type ATPases from A. thaliana, O. sativa, and H. vulgare. To avoid including the data used to build the query sequence, we rotated organisms out of the query sequence. Thus, for testing against H. vulgare, the query sequence would be created with A. thaliana and O. sativa. After construction, we used the sequence for a BLAST search against the translated genome of H. vulgare. The assumption was that if the DKTGT and the family-specific motifs were located in close proximity to each other and in the correct orientation relative to each other, the hit would be assessed as constituting a P-type ATPase of that family. This way, a gene can be identified even when the motifs are located in different exons.

The testing showed that all genes belonging to the P3A and P5 ATPase families could be scored (Table 2). These families have high overall conservation, which facilitated their identification (Supplementary Table 2 and Supplementary Figs. 4 and 6). Also, these families have the query sequence in two exons. The method identified fewer P1B, P2B, and P4 ATPases than what is present in the inventory. This could have been due to low overall conservation in the families or because several of the genes have the query sequence distributed onto three exons. Lastly, for P2A, the number is overestimated, most likely due to the close similarity between P2A and P2B (see Supplementary Figs. 2 and 3). If P2A and P2B are considered to be one family, all genes where the query sequence is found in one or two exons can be identified. A complete list of the number of exons the query sequences are located in is provided in Supplementary Table 3. Overall, it appears that the number of genes in a family can be relatively well predicted with the exception of genes where the query sequences are spread out onto three exons. To ensure that the result obtained for H. vulgare was consistent, we further tested O. sativa and A. thaliana, with results similar to those from H. vulgare (Table 2). Thus, we assumed that the method could be used to determine the number of P3A and P5 genes in a plant genome and estimate the number of P1B, P2A/B, and P4 genes.

Identification of P3A ATPases in complete assemblies

To further evaluate the ability of our approach to correctly determine the number of P3A ATPases, several different well-sequenced and annotated genomes were retrieved and translated in all six reading frames, and a database was created for each. Using the constructed consensus sequence, the databases were interrogated for the number of P3A ATPases (Supplementary Table 4). The number obtained was compared to the number annotated in the genome found by performing BLAST searches using the protein sequence of AtAHA2 against the protein database for each species in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [32]. From this analysis, it appears that the method can predict the number of P3A ATPases in a genome relatively well, and this applies to both very large genomes such as that of Triticum aestivum (genome size = 14.6 Gb) and smaller genomes such as that of Beta vulgaris (genome size = 540.5 Mb).

Testing the method for the P2 family

We further tested the method in the Glycine max assembly, where P2 family ATPases have already been annotated [33]. The G. max genome assembly used here has a scaffold N50 of 48 Mb and a contig N50 of 419 kb. Twelve P2 Ca2+ ATPases were identified [33]. This is a surprisingly low number considering that G. max is a tetraploid [34] and that the diploid A. thaliana contains 14 P2 ATPases [18]. When searching the translated G. max assembly with the bait made from O. sativa, A. thaliana, and H. vulgare, 23 potential Ca2+ ATPases were identified (Supplementary Table 5), demonstrating that care must be taken when using protein prediction models.

Comparing the number of P3A ATPases in glycophytes and halophytes

Under salt stress, plants extrude Na+ back into the soil to lower the K+/Na+ ratio. This extrusion is done in roots by Salt Overly Sensitive 1 (SOS1), which is a Na+/H+ antiporter, driven by the electrochemical H+ gradient generated by the PM H+-ATPase [35,36,37,38]. The uptake of K+ from the soil into roots, as well as K+ retention in the cytosol [39], through channel proteins and symporters is also dependent on the activity of the PM H+-ATPase [40]. Thus, we speculated that the number of PM H+-ATPases could contribute to the difference between halophytes (salt-tolerant plants) and glycophytes (salt-sensitive plants) with respect to salt tolerance as an expansion of genes and their subsequent diversification could possibly enable a more fine-tuned regulation at the tissue expression level.

To determine if there is a difference in the number of P3A ATPases between halophytes and glycophytes, we only used diploid genomes for the analysis, as the ploidy could influence the results. Halophytic species were selected according to the The Food and Agriculture Organization of the United Nations (FAO) [41]. We first investigated if there was a general expansion in the number of genes between the two groups. No significant difference between the number of protein-coding genes in the two groups could be observed (Fig. 2A). Searching through the translated genomes with the consensus sequence for P3A ATPases earlier identified, the number of PM H+-ATPases was determined for the two groups (Table 3). On average, there were 10–12 P3A ATPase genes per plant genome (Fig. 2B), which could be normalized to about 0.00035% of protein-coding genes (Fig. 2C). The mean numbers were slightly lower in the halophyte group (10.3 ± 0.9 vs. 13.8 ± 1.6), but due to the sample size being limited by the number of species available for analysis it could not be ascertained whether the difference was significant.

Table 3 The number of genes and PM H+-ATPases and the ratio between PM H+-ATPases and the number of genes in glycophytes and halophytes. Additional references for numbers of protein-coding genes are given in Supplementary Table 6
Fig. 2
figure 2

Comparison of the number of plasma membrane (PM) H+-ATPases (P3A ATPases) in halophytes and glycophytes. The total numbers of protein-coding genes in the genomes of halophytes and glycophytes are listed in Table 3. Comparison of the number of PM H+-ATPases in the listed halophytes and glycophytes. Comparison of PM H+-ATPases in halophytes and glycophytes after normalization to the total number of protein-coding genes. Whiskers are minimum and maximum, the box is upper and lower quartile and median. ns, no significance

Discussion

An expanded two-motif method for determining the number of genes in a family

We have described a simple and unbiased method for determining the number of genes within a gene family that circumvents the use of a transcriptome and/or predicted proteome (Fig. 3). The genome of interest is translated in all six reading frames to produce six protein sequences for each scaffold or chromosome, be it complete or fragmented. The protein sequences are used to build a BLAST database that can be searched with a short query sequence for the gene family in question. To apply the method, a gene family in question must contain two motifs in close proximity to each other. The query sequence can then be produced by aligning the protein sequences, and the consensus sequence adjacent to the two motifs can then be used to search the translated genome database. Once a sequence containing the motifs has been identified, it must be verified that (1) the motifs are located in the correct orientation relative to each other and (2) in close proximity to each other. If the hit fulfills these criteria, it can be considered to constitute a gene in the family. The method described can be used to identify genes of specific families in the P-type ATPase superfamily and most likely also in other gene families.

Fig. 3
figure 3

Flow chart of the procedure used to predict the number of genes for a family in a species of an incompletely assembled genome using the expanded two-motif method. An overview of the workflow for identifying all members of a specific family. Step 1: A query sequence is built by aligning sequences from the desired family. The query sequence should contain two motifs, with at least one being unique to the family. Step 2: The genome for the organism of interest is translated in all six reading frames. Step 3: A BLAST database is built using the translated genome. Step 4: The BLAST database is searched with the query sequence. Step 5: The BLAST result is inspected; the orientation and distance (if there are few or no gaps between the motifs in the family) of the two motifs is used to determine if a hit can constitute a member of the gene family. In our hypothetical example, inspection revealed hits on three hypothetical contigs: 1050, 3000, and 5045. For Contig 1050, the two motifs are close and correctly oriented. For 3000, the two motifs are located on each side of an intron; thus, the position in the genome is used to verify the orientation and establish if the hits are located relatively close to each other. 5045 is a negative hit. In this contig, there are two hits, but inspection revealed that the motifs are incorrectly oriented and therefore is deemed not to constitute a member of the family

Comparison with other methods

Tools are available for such analysis of novel genomes, as the Pfam database is based on sequence alignments and identification of common sequence motifs in gene family members. P-type ATPases form a large superfamily of primary active pumps divided into five major families. A unique motif for all members of the superfamily is the sequence DKTGT, but Pfam entries for the families remain to be identified.

An HMM has been used to identify soybean P2-type Ca2+ ATPase genes in the G. max genome [33]. Using the automatic annotation according to these criteria, 12 P2-type ATPases were identified, which is a number far below the 23 P2A/B ATPases we identified in this genome using our method based on the expanded two-motif method.

Inventory of P-type ATPases in H. vulgare and identification of new family members

Using conventional methods, we identified all P-type ATPases presently annotated in the H. vulgare cultivar Morex V3 HC genome. HvHMA7, HvACA3, HvACA10, HvAHA8, and HvALA6 were found to be incorrectly annotated when compared to the homologs from A. thaliana and O. sativa but could be corrected based on homology, intron–exon splice sites, and cloning.

We identified two novel P-type ATPases, namely, HvHMA1-like and HvHMA2-like. HvHMA2 and OsHMA2 are both involved in the root-to-shoot transport of zinc and belong to the same phylogenetic sub-clade of P1B ATPases. Among monocots, only T. aestivum has recently been reported to have two genes belonging to the same clade as OsHMA2 [61]. HvHMA2-like was found to be a paralogue of HvHMA2 and is located approximately 50 kb upstream of HvHMA2. Therefore, this gene could be the result of a gene duplication event by unequal crossover.

HvHMA1-like was found to have high homology to HvHMA1 and OsHMA1, but HvHMA1-like has a premature stop codon after codon number 80, and neither HvHMA1 nor OsHMA1 has a premature stop codon. HvHMA1-like is expressed and can be cloned from cDNA from 10-day-old whole-seedling cDNA. This suggests that the N-terminal region of the protein is expressed, but its role, if any, for example, in regulating heavy metal transport, remains unknown.

Evaluation of methods

We first surmised that the presence of two distinct motifs near each other with a fixed distance between them could be used to identify the genes in a specific family. As we tested this hypothesis, we found that even though the two motifs were relatively close together, a large part of the sought-after genes had an intron between the two motifs and could not be identified using this method. This shortcoming limits the effectiveness of the approach; however, though it can identify genes without an intron extremely well.

To address this shortcoming, we devised a method that expanded on the two-motif method to include more information between and surrounding the two motifs. The expanded two-motif method can successfully identify genes in which the two motifs are separated by an intron but fails to identify genes when the bait is distributed across three exons. This could potentially be negated by including more sequence information on either side of the motifs. However, if this is done, it might not be a feasible method for fragmented genomes where continuity is low.

P3A ATPases in glycophytes and halophytes

We used the expanded two-motif approach to approximate the number of P3A ATPases in different species divided into two groups, glycophytes and halophytes. We aimed to determine if there is a genetic difference between the two groups in regards to P3A ATPases. We found no significant difference using the species selected here when the data were normalized to the number of protein-coding genes in each species. However, it may be too early to establish whether P3A ATPases have been reduced or expanded in halophytes as part of their salt tolerance mechanism. The number of species used for analysis was rather limited (as we opted to only use diploids), and the variability within each group made the difference not significant. Future studies involving a larger number of species in each group will provide an explicit answer about the possible causal link between salinity tolerance and the number of PM H+-ATPase copies.

Advantages and limitations of the methods

The advantage of the expanded two-motif method described here is that it can be used on poorly annotated and poorly assembled genomes and on any gene family in which two unique sequence motifs occur in close proximity to each other.

When using the two-motif method when the distance between the two motifs is fixed, the rate of false positives will only be impacted by the probability of the two motifs randomly occurring with the specific distance between them. This was only observed once during the testing for HvP5. The larger the distance between the two motifs, the higher the possibility of them being split onto two scaffolds/contigs and vice versa or being separated by an intron. The distance should therefore be kept relatively short.

For the expanded two-motif method, if there is an intron between the motifs, it does not necessarily prevent identification but requires more work. The BLAST output will show two hits from the same scaffold/contig, and it will be necessary to investigate the position of the two hits, both regarding the orientation of the two compared to each other and the distance between them. Furthermore, as is evident from the testing, when the query sequence is distributed onto three exons, it is impossible to identify the family member using this approach. The method described will identify the location of the two motifs and thus will not identify the whole gene. Therefore, the method is prone to error because it cannot differentiate between true genes and pseudogenes. The expanded two-motif method proved useful for identifying gene members of the superfamily of P-type ATPases. Using the expanded two-motif method to screen well annotated genomes, we successfully identified all members of P3A and P5 ATPases even though the two motifs in some members of these families are separated by an intron.