# A new efficient method for analyzing fungi species using correlations between nucleotides

**Part of the following topical collections:**

## Abstract

### Background

In recent years, DNA barcoding has become an important tool for biologists to identify species and understand their natural biodiversity. The complexity of barcode data makes it difficult to analyze quickly and effectively. Manual classification of this data cannot keep up to the rate of increase of available data.

### Results

In this study, we propose a new method for DNA barcode classification based on the distribution of nucleotides within the sequence. By adding the covariance of nucleotides to the original natural vector, this augmented 18-dimensional natural vector makes good use of the available information in the DNA sequence. The accurate classification results we obtained demonstrate that this new 18-dimensional natural vector method, together with the random forest classifier algorthm, can serve as a computationally efficient identification tool for DNA barcodes. We performed phylogenetic analysis on the genus *Megacollybia* to validate our method. We also studied how effective our method was in determining the genetic distance within and between species in our barcoding dataset.

### Conclusions

The classification performs well on the fungi barcode dataset with high and robust accuracy. The reasonable phylogenetic trees we obtained further validate our methods. This method is alignment-free and does not depend on any model assumption, and it will become a powerful tool for classification and evolutionary analysis.

## Keywords

DNA barcoding Fungi species Classification Correlations between nucleotides 18-dimensional natural vector Phylogenetic analysis## Abbreviations

- BOLD
Barcode of Life Data System

- CBOL
Consortium for the Barcode of Life

- COI
Cytochrome c oxidase

- ITS
Internal transcribed spacer

- LDA
Linear discriminant analysis

- ROC
Receiver operating characteristic curve

## Background

The identification and phylogenetic analysis of living species are crucial tasks for understanding natural biodiversity. Conventional taxonomy based on morphological and ecological data is often challenging work. Not only does it require a highly experienced taxonomist, but also it is usually quite time consuming. Moreover, previous studies have revealed that traditional phenotypic recognition of taxonomy may lead to misidentification [1]. Advancements in sequencing and computational technologies enable the production of a large number of DNA sequences in a very short time. DNA genomes contain all genetic material of organisms, and are becoming the major source of new information for our understanding in evolutionary relationships [2]. However, the complexity and large size of genomes make it inappropriate for distinguishing species rapidly. In the last two decades, researchers have proposed using the information from one or a few gene regions, termed DNA barcodes, to easily discriminate species [1].

A DNA barcode is a standardized short fragment (500–800 bp) of a DNA sequence that characterizes and identifies the species of a specimen [3]. The gene region must satisfy some properties. For example, the fragment must contain significant species-level genetic diversity. Specimens of the same species should have identical fragments, but the fragments from specimens of different species should differ. In addition, the fragment must be adaptable and conserved with primer binding sites allowing it to be readily for PCR amplification [4]. Several studies have established and described the significance of this approach for taxonomic work. DNA barcoding has been developed with great success for identifying groups of living species, including animal [5], plants [6], fungi [3], bacteria [7] and algea [8]. Consequently, combining DNA barcoding with traditional taxonomic tools could accurately and effectively reveal biodiversity. Although researchers are trying to find a suitable biomarker for the discrimination of all taxa, a universal DNA barcode has not been found. The region of mitochondrial gene cytochrome c oxidase (COI) [9] can serve as a DNA barcode for most animal groups. The internal transcribed spacer (ITS) region has been recommended as the most appropriate DNA marker for barcoding universal fungi [3, 10].

Global online workbenches have established databases for DNA barcode records, such as the Consortium for the Barcode of Life (CBOL, http://barcoding.si.edu) and the Barcode of Life Data System (BOLD, http://www.barcodinglife.org). Meanwhile, these databases are freely available to researchers interested in DNA barcoding. By comparing a DNA barcode sequence obtained from an unidentified specimen with sequences from known species in the reference database, we can determine the species or group of the new specimen. In March 2018, the BOLD [11] contained more than 6 million specimens with barcode records, belonging to over 277,013 species. This situation encourages us to determine the reliable assignment in an accurate and fast manner.

Traditional barcoding methods are usually performed by classical phylogenetic approaches, such as neighbor-joining [12] and maximum parsimony [13]. Some statistical models for data analysis have also been proposed, such as sophisticated Bayesian [14], decision theory [15] and some other approaches [16, 17]. These methods have greatly contributed to DNA barcoding research. However, a number of challenges remain, including proper choice of the threshold and computational efficiency, as well as the accuracy of classification. In this study, we investigate a new alignment-free method for DNA barcoding. Our focus is on performing this assignment task accurately and efficiently.

In this work, we develop a new representation for charactering DNA sequence which is based on the distribution on nucleotides within the sequence. To accomplish this, we add the covariance of nucleotides to the original natural vector. As a result, the new 18-dimensional natural vector makes good use of the available information in the sequence. We used the fungi barcode dataset as the test dataset, because we want to improve the relatively low accuracy of fungi barcode classification in the previous studies [3, 18]. We also analyzed and evaluated the genetic distance within and between species of barcoding dataset. On one hand, the one-to-one correspondence between DNA barcodes and their 18-dimensional vectors ensures the barcoding sequence information is not lost. On the other hand, because the difference between intraspecific variation and interspecific variation is reflected in vector distances, this method shows promise for it being used to distinguish species and identify specimens into correct species with higher accuracy and less time. Furthermore, we also investigated the phylogenetic relationship between species using fungi DNA barcode sequences.

## Results

### Convex hull analysis of DNA barcodes

### Classification performance

Although the distribution of barcodes specimens for different species is uneven, we want to explore the performance of identifying barcode sequences into species using natural vectors. After recognizing and removing the species with only one barcode, we were left with 56392 sequences from the original dataset. The 56392 barcodes belong to 8826 species and 1465 genera, as well as 349 families, 107 orders and 30 classes. We evaluated the assignment in the similar way as described above. The out-of-bag error calculated in this dataset is shown in Fig. 2b. We selected five and twenty sample trees to compute the identification. The prediction accuracy was 96.87 and 99.91% respectively for species assignment with different number (five and twenty) of sample trees. In addition, we studied the efficacy at higher taxonomic levels. The calculated error rates of higher taxonomic groups are about 0.2–0.5% using random forest classifier with twenty sample trees. We also drew the ROC curves for the classification results of class, order, family, genus and species shown in Fig. 3b. In this figure, the areas under ROC curves achieve more than 0.99. Consequently, these results indicate the ITS region is a suitable barcode for species and also for higher taxonomic levels. Moreover, our classification success demonstrates that the new 18-dimensional natural vector representation method together with the random forest classifier algorithm can serve as an effective identification tool for DNA barcodes.

### Genetic distance and statistical analysis

*Megacollybia*and family

*Massarinaceae*as examples. After computing 18-dimensional natural vectors for each species in the genus

*Megacollybia*, the Euclidean distance was used to calculate intraspecific differences, as well as interspecific differences. A graphical representation using a histogram of distance distribution results is shown in Fig. 4a. On one hand, we could observe the clear gap between mean intraspecific distance (5.02) and mean interspecific distance (55.84). On the other hand, intraspecific distances of most species are much smaller than the mean interspecific distance and vice versa. In addition to this, we also studied the distance variability at the genus level. The significant barcode gap between intragenus distance and intergenus distance that was found, is shown in Fig. 4b. As we can see from Fig. 4b, most of intergenus distances are much larger than the mean intragenus distance (307.89). The genus

*Megacollybia*and family

*Massarinaceae*are the two examples which show the barcode gap most clearly. We have added Additional file 1: Figures S1-S6 to the additional file which show more barcode gap examples. These analyses confirm the potential of the ITS region for barcoding fungi on different taxonomic levels.

### Phylogenetic analysis on DNA barcodes for species

*Megacollybia*, we computed the Euclidean Distance between pairs of sequences after calculating the 18-dimensional natural vector for each barcode. We then performed the single linkage algorithm [19] to reconstruct the phylogeny shown in Fig. 5. The phylogeny for sequences within the same species is shown in the same color. From the phylogenetic tree, we can see that nearly all the sequences from the same species cluster together to the same clades, except for two sequences from

*Megacollybia platyphylla*. The two sequences

*Megacollybia platyphylla 10*and

*Megacollybia platyphylla 18*do not group with the other barcode specimens in

*Megacollybia platyphylla*. To better explore and understand this situation, we reconstructed the evolutionary relationship of the 26 barcodes from

*Megacollybia platyphylla*as shown in Fig. 6. We found that the two sequences are at the basis of other 24 barcodes, and we marked them with stars. We calculated the length and GC content of the 26 barcode sequences. The length of the other 24 barcodes are around 657. The other two sequences are significant longer than them. In addition, the two sequences also have higher GC content than the other. This may the reason that the two barcodes are not in the same clade as the other sequences in the phylogenetic tree. Overall, the reasonableness of the phylogenetic trees confirms that our methods applied on this dataset are convincing.

## Discussion

### Comparison with other methods

We applied the 18-dimensional and 12-dimensional natural vector methods on the genus *Pachyphloeus* consisting of 12 species in the barcode dataset. In this genus, there are 8 species containing only one sequence and the other 4 species are comprised of 2, 3, 10, 13 sequences, respectively. The phylogenetic trees constructed by the 12-dimensional natural vector and 18-dimensional natural vector methods are shown in Additional file 1: Figure S7 and Figure S8.

In Additional file 1: Figure S7, the three sequences in species *Pachyphloeus marroninus* colored orange are not clustered together. Sequence *Pachyphloeus marroninus 3* is wrongly clustered closer to *Pachyphloeus melanoxanthus 2* than the other two sequences in the species *Pachyphloeus marroninus*. By contrast, using the 18-dimensional natural vector method, we find that *Pachyphloeus marroninus 3* is classified near the other sequences belonging to the same species shown in Additional file 1: Figure S8. This shows that the 18-dimensional natural vector method performs better than the 12-dimensional natural vector method in phylogenetic analysis.

On the other hand, we also tested the commonly used multiple sequence alignment method and the k-mer method on the same genus. The phylogenetic trees are shown in Additional file 1: Figures S9 and S10. For the largest species *Pachyphloeus sp.*, neither multiple sequence alignment nor the k-mer method performed well because the barcodes from this species do not cluster together and instead cluster with other species’ barcodes. We marked the wrongly clustered sequences with stars in Additional file 1: Figure S9 and S10. Comparing with Additional file 1: Figure S8, we can see that the results of our 18-dimensional natural vector method outperform that two methods. In addition, the 18-dimensional natural vector method is alignment-free and the computation time is much shorter than multiple alignment method. In conclusion, the new natural method can perform phylogenetic analysis more precisely and quickly.

### Significance for the four features in 18-dimensional natural vector

The 18-dimensional natural vector of a sequence contains four kinds of features including the numbers, the mean positions and the normalized variations of the four kinds of nucleotides as well as the covariance between different nucleotides. The four features are significant and integral of the 18-dimensional natural vector. In order to check which features are more important, we carried out the following process. First of all, we tried deleting one of the four features of 18-dimensional natural vector. We then used the other three features to represent the sequence and perform phylogenetic analysis on the same dataset. The four trees are shown in Additional file 1: Figure S11-S14 in the additional file. By comparing these four trees with that of 18-dimensional natural vector method (Additional file 1: Figure S8), we analyzed the importance of each feature. Among these four trees below, the first and second trees differ more from the tree constructed using the 18-dimensional natural vector than the third and fourth trees. We marked the mixed-up clades with stars in phylogenetic trees shown in Additional file 1: Figure S11 and S12. Although the clades in Additional file 1: Figure S13 and S14 are not as good with that of Additional file 1: Figure S8, these two trees are better than the tress in Additional file 1: Figure S11 and S12. This implies that the number and mean position features are more important than the normalized variation and covariance features. On the other hand, the orders of magnitudes of number and mean position features are about 10^2, while the normalized variation and covariance features are about 10^1.This makes the number and mean position features become dominant in the Euclidean distance between vectors in this study.

## Conclusions

In this study, we present a new method for DNA barcode classification based on the distribution on nucleotides. We consider the statistical information containing the numbers, mean positions, moments and correlations between nucleotides. These features are used to characterize a DNA sequence, forming an 18-dimensional natural vector. Our method has the following main advantages: (1) It contains nearly all important information within a sequence. (2) The mapping between a DNA barcode and its 18-dimensional vector is one-to-one. (3) The covariance measures the correlations between the four nucleotides. We treat the Euclidean distance between the vectors as the similarity metric. We then examine the classification performance for class, order, family, genus as well as species taxonomy by applying the 18-dimensional natural vector method to barcode sequences. Furthermore, further tests on barcode gap analysis and phylogenetic analysis are used to validate the method. The highly accurate results and computationally efficient algorithm provide us a new quantitative way of identifying and analyzing evolutionary relationships among species based on DNA barcodes in molecular biological study.

## Methods

### Datasets

The dataset used in this study is downloaded from the Barcode of Life Data System (BOLD, http://www.barcodinglife.org). The origin dataset contains a total of 88,650 barcodes from 28,058 fungi species. We remove some sequences without fully taxonomic information and preserve the dataset pertaining to ITS region of fungi. There are 72,844 barcodes sourced from 25,278 different species with completed taxonomy information would be analyzed in this study. The current classification scheme for fungi is used in this dataset, in which the taxonomic classification could be split into 38 classes, 135 orders, 448 families and 2337 genera. We first verified the convex hull principle by analyzing these 72,844 barcodes. The distribution of sequences into different species is uneven. Among these 2327 genera, 597 genera have only one barcode member. Except for these 597 sequences, the other 72,247 sequences belong to 24,681 species, 1740 genera, 382 families, 117 orders and 33 classes, which are used for further classification analysis in this study.

### Features of natural vector

Let *S* = (*s*_{1}, *s*_{2}, *s*_{3}, ⋯, *s*_{N}) be a nucleotide sequence of length *N*, where *s*_{i} ∈ {*A*, *C*, *G*, *T*}, *i* = 1, 2, 3, ⋯*N.*

*k*, define

*n*

_{k},

*μ*

_{k}and \( {D}_2^k \) to describe the number of nucleotide

*k*, and the mean position of nucleotide

*k*as well as the normalized variation of the position for nucleotide

*k*appearing in sequence

*S*, respectively. The features are defined as follows:

*k*represents the four nucleotides.

### Covariance between nucleotides

*A*= {

*a*

_{1},

*a*

_{2}, …,

*a*

_{n}},

*B*= {

*b*

_{1},

*b*

_{2}, …,

*b*

_{m}} in

*R*,which satisfy

*a*

_{1}<

*a*

_{2}< … <

*a*

_{n}and

*b*

_{1}<

*b*

_{2}< … <

*b*

_{m}, the covariance of the two sets

*A*and

*B*can be calculated in two cases. If

*m*=

*n*, we define

*m*≠

*n*, assume that

*m*>

*n*. Then the covariance between

*A*and any

*n*values in

*B*is computed and take the average of these \( {C}_m^n \) results as the final covariance

*Cov*(

*A*,

*B*) between the two point sets.

*k*

_{1}and

*k*

_{2}for a sequence

*S*of length

*N*. Assume that position of

*k*

_{1}appeared in the sequence

*S*is

*A*= {

*a*

_{1},

*a*

_{2}, ⋯,

*a*

_{n}}, the position of

*k*

_{2}is

*B*= {

*b*

_{1},

*b*

_{2}, ⋯,

*b*

_{m}}. Then the covariance formula of

*k*

_{1}and

*k*

_{2}is defined as

*k*

_{1}=

*k*

_{2}, the corresponding formula should be the

The formula above reflects the variance of the position of nucleotides.

We consider the sequence *S* = *ACGTAC* as an example. Based on *μ*_{A} = 3, *μ*_{C} = 4 and the positions of amino acids *A* = {1, 5} and *C* = {2, 6} in the sequence, we get *Cov*(*A*, *C*) = [(1 − 3)(2 − 4)/2 + (5 − 3)(6 − 4)/2]/6 = 2/3. The other covariance could also be calculated in the same way.

### A novel 18-dimensional natural vector with covariance

*S*. The number of pairs of nucleotides is \( {C}_4^2=6 \)

*.*Thus, the natural vector with covariance of a nucleotide sequences

*S*is given as follows:

In this study, we utilized the novel 18-dimensional natural vector with covariance to analyze DNA barcodes. This method is alignment-free and does not depend on any assumptions.

### Convex hull

In computational geometry, a convex hull is the smallest convex set containing the points. The points can be in high dimensional space. In this study, we classify DNA barcodes into different species based on the corresponding disjoint convex hulls of 18-dimensional natural vectors of the sequences. Because of the extensive computational time required to directly compute convex hulls in high dimensional spaces, instead of comparing convex hulls directly, we utilized the following method to test whether two high-dimensional convex hulls constructed by two point sets disjoint.

### Linear programming approach

Assume two point sets *A* = {*a*_{1}, *a*_{2}, ⋯, *a*_{n}} and *B* = {*b*_{1}, *b*_{2}, ⋯, *b*_{m}} in *R*^{k} space, the method is to compute whether there are two groups of coefficients *λ*_{i} and *μ*_{j}, 1 ≤ *i* ≤ *n*, 1 ≤ *j* ≤ *m*, such that the following equation holds:

\( {\sum}_{i=1}^n{\lambda}_i{a}_i={\sum}_{j=1}^m{\mu}_j{b}_j \)*,* here \( 0\le {\lambda}_i,{\mu}_j\le 1,{\sum}_{i=1}^n{\lambda}_i=1\kern0.5em ,\kern0.5em {\sum}_{j=1}^m{\mu}_j=1 \)

We use the linear programming approach to check if there are two groups of coefficients that satisfy the conditions. If the coefficients exist, then the two convex hulls of the two point sets have interactions, otherwise, the two convex hulls are disjoint [20].

### Linear discriminant analysis

The linear discriminant analysis (LDA) is a generalization of Fisher’s linear discriminant. If two groups are linearly separable, we can use linear discriminant analysis. Linear separable suggests that the groups can be separated by a linear combination of features [21]. This means the two convex hulls of the two point sets have no interaction.

*A*is partitioned into

*k*classes as

*A*= {

*Π*

_{1}, ⋯

*Π*

_{k}} with a set of features \( \overrightarrow{x} \), where the

*i*th class

*Π*

_{i}contains

*n*

_{i}points and\( {\sum}_{i=1}^k{n}_i=n \). We try to find a projection matrix to separate two point sets in low dimension where\( y={W}^T\overrightarrow{x} \). In the discriminant analysis, two scatter matrices, called with-class (

*S*

_{ω}) and between-class (

*S*

_{b}) matrices, are defined to quantify the quality of the cluster as follows:

where \( {m}_i=\frac{1}{n_i}{\sum}_{x\in {\varPi}_i}x \) is the mean of the *i*th class, and \( m=\frac{1}{n}{\sum}_{i=1}^k{\sum}_{x\in {\varPi}_i}x \) is the global mean.

*W*, the within-class and between-class matrices become\( {S}_b^L={W}^T{S}_bW \), and\( {S}_{\omega}^L={W}^T{S}_{\omega }W \). An optimal transformation

*W*would maximize\( trace\left({S}_b^T\right) \) and minimize\( trace\left({S}_{\omega}^T\right) \). Common optimizations in classical discriminant analysis include:

The solution can be obtained by applying an eigen-decomposition to the matrix \( {S}_{\omega}^{-1}{S}_b \), if *S*_{ω} or\( {S}_b^{-1}{S}_{\omega } \)is nonsingular. The reduced dimension by LDA is from one to *k − 1*.

### Random forest

A random forest [22] is a classifier consisting of a collection of tree-structured classifiers {*h*(*x*, *θ*_{k}), *k* = 1, 2, …}, where the {*θ*_{k}} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input *x*. Specifically, for the *k*th tree, a random vector *θ*_{k} is generated independent of the past random vectors *θ*_{1}, …, *θ*_{k − 1} but with the same distribution. The tree is grown using the training set and *θ*_{k}, resulting in a classifier *h*(*x*, *θ*_{k}) where *x* is an input vector. After a large number of trees is generated, they vote for the most popular class. These procedures are called random forests. Bagging algorithm used in random forest helps us improve the stability and accuracy in statistical classification. Out-of-bag error is a value to measure the prediction error rate of bagging algorithm on training dataset in random forest method. It’s can be empirical proved as an unbiased estimation for using a test set of the same size as the training set.

### ROC curve

In this study, the ROC curves are drawn by the following steps. Assume that all the samples belong to *M* classes. Given the number of tree parameter *k* in random forest algorithm, each time one class is considered as positive and the others are regarded as negative. We compute four values for *M* times as follows: (1) the true positive (TP): Number of positive samples predicted correctly; (2) the true negative (TN): Number of negative samples predicted correctly; (3) the false negative (FN): Number of positive samples predicted incorrectly; (4) the false positive (FP): Number of negative samples predicted incorrectly. Then the average of the true positive rate TPR = TP/(TP + FN) and the false positive rate FPR = FP/(FP + TN) for all the *M* classes are calculated. The ROC curve plots TPR as a function of FPR when *k* varies. The AUC value representing the area under the ROC curve (above the *x*-axis) is calculated by the “trapz” function in Matlab software.

## Notes

### Acknowledgements

The authors wish to thank Dr. Benson from Department of Computer Science, Seattle Pacific University for help with revising the manuscript, and the Department of Mathematical Science at Tsinghua University for providing the work space and library facilities. Prof. Stephen Yau is grateful to National Center for Theoretical Sciences (NCTS) for providing excellent research environment while part of this research was done.

### Funding

This study is supported by the National Natural Sciences Foundation of China (91746119), Tsinghua University start up fund. The authors wish to thank Tsinghua Qingfeng Scholarship (THQF2018–13). The funders did not take part in study design; in collection and analysis of data; in the writing of the manuscript; in the decision to publish this manuscript.

### Availability of data and materials

All the datasets used in this study could be downloaded from the Barcode of Life Data System (BOLD, http://www.barcodinglife.org) which is open for public access.

### Authors’ contributions

XZ, KT and SSTY conceived the ideas and designed the methodology; XZ and KT collected and analyzed the data; XZ and SSTY led the writing of the manuscript. All authors contributed critically to the draft and gave final approval for publication.

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary material

## References

- 1.Fišer PŽ, Buzan EV. 20 years since the introduction of DNA barcoding: from theory to application. J Appl Gen. 2013;55:43–52.CrossRefGoogle Scholar
- 2.Hajibabaei M, Singer GA, Hebert PD, Hickey DA. DNA barcoding: how it complements taxonomy, molecular phylogenetics and population genetics. Trends Genet. 2007;23:167–72.CrossRefGoogle Scholar
- 3.Schoch CL, Seifert KA, Huhndorf S, Robert V, Spouge JL, Levesque CA, Chen W, Consortium FB. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc Natl Acad Sci U S A. 2012;109:6241–6.CrossRefGoogle Scholar
- 4.Kress WJ, Erickson DL. DNA barcodes: genes, genomics, and bioinformatics. Proc Natl Acad Sci U S A. 2008;105:2761–2.CrossRefGoogle Scholar
- 5.Luo A, Zhang A, Ho SY, Xu W, Zhang Y, Shi W, Cameron SL, Zhu C. Potential efficacy of mitochondrial genes for animal DNA barcoding: a case study using eutherian mammals. BMC Genomics. 2011;12:1–13.CrossRefGoogle Scholar
- 6.Kress WJ, Wurdack KJ, Zimmer EA, Weigt LA, Janzen DH. Use of DNA barcodes to identify flowering plants. Proc Natl Acad Sci U S A. 2005;102:8369–74.CrossRefGoogle Scholar
- 7.Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci U S A. 2006;103:12115–20.CrossRefGoogle Scholar
- 8.Saunders GW. Applying DNA barcoding to red macroalgae: a preliminary appraisal holds promise for future applications. Phil Trans R Soc B. 2005;360:1879–88.CrossRefGoogle Scholar
- 9.Hebert PD, Cywinska A, Ball SL, deWaard JR. Biological identifications through DNA barcodes. Proc Biol Sci. 2003;270:313–21.CrossRefGoogle Scholar
- 10.Begerow D, Nilsson H, Unterseher M, Maier W. Current state and perspectives of fungal DNA barcoding and rapid identification procedures. Appl Microbiol Biotechnol. 2010;87:99–108.CrossRefGoogle Scholar
- 11.Ratnasingham S, Hebert PD. BOLD: The barcode of life data system (www.barcodinglife.org). Mol Ecol Notes. 2007;7:355–64.CrossRefGoogle Scholar
- 12.Zhao X, Wan X, He RL, Yau SST. A new method for studying the evolutionary origin of the SAR11 clade marine bacteria. Mol Phylogenet Evol. 2016;98:271–9.CrossRefGoogle Scholar
- 13.Ekrem T, Willassen E, Stur E. A comprehensive DNA sequence library is essential for identification with DNA barcodes. Mol Phylogenet Evol. 2007;43:530–42.CrossRefGoogle Scholar
- 14.Nielsen R, Matz M. Statistical approaches for DNA barcoding. Syst Biol. 2006;55:162–9.CrossRefGoogle Scholar
- 15.Abdo Z, Golding GB. A step toward barcoding life: a model-based, decision-theoretic method to assign genes to preexisting species groups. Syst Biol. 2007;56:44–56.CrossRefGoogle Scholar
- 16.Kuksa P, Pavlovic V. Efficient alignment-free DNA barcode analytics. BMC Bioinformatics. 2009;10:1–18.Google Scholar
- 17.Zhang AB, Muster C, Liang HB, Zhu CD, Crozier R, Wan P, Feng J, Ward RD. A fuzzy-set-theory-based approach to analyse species membership in DNA barcoding. Mol Ecol. 2012;21:1848–63.CrossRefGoogle Scholar
- 18.Weitschek E, Fiscon G, Felici G. Supervised DNA barcodes species classification: analysis, comparisons and results. Biodata Min. 2014;7:1–18.CrossRefGoogle Scholar
- 19.Zhao X, Tian K, He RL, Yau SST. Establishing the phylogeny of Prochlorococcus with a new alignment-free method. Ecol Evol. 2017;7:11057–65.CrossRefGoogle Scholar
- 20.Tian K, Zhao X, Yau SS. Convex hull analysis of evolutionary and phylogenetic relationships between biological groups. J Theor Biol. 2018;456:34–40.CrossRefGoogle Scholar
- 21.Ye J, Janardan R, Li Q. Two-dimensional linear discriminant analysis. Adv Neural Inf Proces Syst. 2005;5:1431–41.Google Scholar
- 22.Breiman L. Random Forests. Mach Learn. 2001;45:5–32.CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.