Hitac: a hierarchical taxonomic classifier for fungal ITS sequences compatible with QIIME2

Miranda, Fábio M.; Azevedo, Vasco C.; Ramos, Rommel J.; Renard, Bernhard Y.; Piro, Vitor C.

doi:10.1186/s12859-024-05839-x

Hitac: a hierarchical taxonomic classifier for fungal ITS sequences compatible with QIIME2

Software
Open access
Published: 02 July 2024

Volume 25, article number 228, (2024)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Hitac: a hierarchical taxonomic classifier for fungal ITS sequences compatible with QIIME2

Download PDF

Fábio M. Miranda^1,2,
Vasco C. Azevedo³,
Rommel J. Ramos^3,4,5,
Bernhard Y. Renard¹^na1 &
…
Vitor C. Piro^1,2^na1

317 Accesses
1 Altmetric
Explore all metrics

Abstract

Background

Fungi play a key role in several important ecological functions, ranging from organic matter decomposition to symbiotic associations with plants. Moreover, fungi naturally inhabit the human body and can be beneficial when administered as probiotics. In mycology, the internal transcribed spacer (ITS) region was adopted as the universal marker for classifying fungi. Hence, an accurate and robust method for ITS classification is not only desired for the purpose of better diversity estimation, but it can also help us gain a deeper insight into the dynamics of environmental communities and ultimately comprehend whether the abundance of certain species correlate with health and disease. Although many methods have been proposed for taxonomic classification, to the best of our knowledge, none of them fully explore the taxonomic tree hierarchy when building their models. This in turn, leads to lower generalization power and higher risk of committing classification errors.

Results

Here we introduce HiTaC, a robust hierarchical machine learning model for accurate ITS classification, which requires a small amount of data for training and can handle imbalanced datasets. HiTaC was thoroughly evaluated with the established TAXXI benchmark and could correctly classify fungal ITS sequences of varying lengths and a range of identity differences between the training and test data. HiTaC outperforms state-of-the-art methods when trained over noisy data, consistently achieving higher F1-score and sensitivity across different taxonomic ranks, improving sensitivity by 6.9 percentage points over top methods in the most noisy dataset available on TAXXI.

Conclusions

HiTaC is publicly available at the Python package index, BIOCONDA and Docker Hub. It is released under the new BSD license, allowing free use in academia and industry. Source code and documentation, which includes installation and usage instructions, are available at https://gitlab.com/dacs-hpi/hitac.

View this article's peer review reports

IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences

Article Open access 09 August 2018

Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin

Article Open access 17 May 2018

CONSTAX: a tool for improved taxonomic resolution of environmental fungal ITS sequences

Article Open access 06 December 2017

Background

In the last decades, there has been a surge of interest in characterizing the mycoflora of communities. This is due to the fact that fungi are key organisms in several important ecological functions, ranging from nutrient recycling to mycorrhizal associations and decomposition of wood and litter [1]. Moreover, fungi naturally inhabit the human body and may benefit the host when administered as probiotics [2], yet little is known about the fungal microbiota of healthy individuals when compared to the bacterial microbiome [3]. Furthermore, identifying the presence of fungi can be challenging, as they rarely form structures visible to the naked eye, and similar structures are frequently composed of several distinct species [4]. Hence, it became common practice to use DNA sequencing in addition to morphological studies in contemporary mycology [5].

One of the approaches that can be employed in environmental studies is whole-genome shotgun sequencing (WGS), where the purpose is to obtain the genetic material of all organisms present in a given microbiome simultaneously [6]. Nonetheless, in studies where the end goal is only to estimate the diversity or discover the evolutionary distance of organisms in a microbiome, it is cheaper and more convenient to sequence only genetic markers such as 16 S rRNA, which is well conserved in all bacteria [7] and also widely used to classify and identify archaea [8]. In mycology, the internal transcribed spacer (ITS) region was adopted as the universal marker for classifying fungi [9].

The main characteristic of 16 S rRNA that makes it a good genetic marker is the presence of nine hypervariable regions V1–V9, which are flanked by conserved regions that can be used to amplify target sequences using universal primers [10]. However, 16 S rRNA has fewer hypervariable domains in fungi, and consequently, it is more appropriate to use the ITS region (Fig. 1), which is flanked by the 18 S and 28 S rRNA genes and interrupted by the conserved 5.8S rRNA gene [11]. Additionally, the ease with which ITS is amplified makes it an appealing target for sequencing samples from environments as challenging as soil and wood, where the initial quantity and quality of DNA is low [4].

From a computational perspective, the taxonomic classification of ITS fungal sequences can be performed by using either similarity-based or machine learning techniques. Similarity-based approaches, such as BLAST [12], require the alignment of a query sequence against all sequences available in a reference database, e.g., Silva [13] and UNITE [14]. However, a major limitation of similarity-based methods is the dependence on homologs in databases, which are not always available, while machine learning methods can learn only the relevant features and criteria for classification and can, therefore, be applied to any new sequence [15].

One of the first machine learning software proposed to perform taxonomic classification was the RDP-Classifier, which uses 8-mer frequencies to train a Naive Bayes classification algorithm [16]. Improving upon the RDP-Classifier, a novel Naive Bayes classifier with multinomial models was developed to increase the accuracy [17]. This multinomial classifier was later reimplemented and optimized on Microclass [18].

Other machine learning approaches have also been proposed in the literature, such as the k-Nearest Neighbor (KNN) algorithm [19]. Given a query sequence, the KNN algorithm identifies the k-most similar sequences in a database and uses the taxonomic information from each of those sequences to determine the consensus taxonomy. Q2_SK [20] implements several supervised learning classifiers from scikit-learn [21], e.g., Random Forest, Support Vector Machines (SVM) and Gradient Boosting. Q2_SK is flexible and allows the selection of features and classifiers.

Despite the advances made in taxonomic classification, there are still opportunities for enhancement due to the complexity of the data. Furthermore, to the best of our knowledge, the machine learning software available may lack in predictive performance since they perform flat classification, which means that they are only trained on leaf node labels and do not fully explore the hierarchical structure of the taxonomic tree when building their models (see Fig. S1 for a depiction of the hierarchical structure of the taxonomic tree). The main disadvantage of this approach is that the information about parent–child class relationships in the hierarchical structure is not entirely considered, which may cause flat classifiers to commit more errors than hierarchical approaches [22]. Therefore, we developed a new hierarchical taxonomic classifier for fungal ITS sequences, denominated HiTaC, which explores the hierarchical structure of the taxonomic tree during the training stage and improves the F1-score and sensitivity of fungal ITS predictions. Furthermore, HiTaC can be easily installed and is compatible with QIIME2, facilitating its adoption in existing ITS sequence analysis pipelines.

Implementation

The local hierarchical taxonomic classifier

HiTaC performs feature extraction, building of the hierarchical model, taxonomic classification, and reporting of the predictions. Figure 2 illustrates the algorithmic steps of HiTaC, where it starts by decomposing the DNA sequences from both reference and query files into their constituents k-mers. Next, two k-mer frequency matrices are built for the reference and query sequences, respectively, where each row contains the k-mer frequency for a sequence. Logistic regression classifiers implemented in the library scikit-learn [21] are trained for each parent node in the taxonomic hierarchy, using the local classifier per parent node implementation from a library called HiClass that we developed and published previously [23]. The k-mer frequency matrix and taxonomic labels of the reference sequences are used as training data to predict children labels (see Fig. 3 for a visual representation). The mycological nomenclature used as labels for the internal nodes originates from the taxonomy provided by the user, which allows the user to decide whether to use Index Fungorum [24], Faces of Fungi [25], Mycobank [26], or The NCBI taxonomy database [27], for example. Predictions are performed using the k-mer frequency matrix computed for the query sequences and are reported in a top-down approach, based on the taxonomic hierarchy. For example, if the classifier at the root node predicts that a query sequence belongs to the phylum Ascomycota, then only that branch of the taxonomic tree will be considered for the remaining taxonomic levels.

Classification uncertainty

To reduce classification errors when faced with uncertainties, we also construct a filter by training a logistic regression classifier for each level of the hierarchy. This filter computes confidence scores for all taxonomic ranks assigned by the local classifier per parent node and excludes labels that are below a user-defined threshold in a bottom-up approach (Fig. 4). The confidence scores are computed using the predict_proba method implemented in scikit-learn’s logistic regression classifier, which in turn uses a one-vs-rest approach; that is, it calculates the probability of each class using the logistic function and assuming it to be positive, then it normalizes these values across all the classes [28].

Code quality assurance

The code base adheres to the PEP 8 code style [29], which is enforced by flake8 and the uncompromising code formatter black to ensure high code quality. Versioning complies with SemVer to increase reproducibility and facilitate dependency management by end users. The code is accompanied by unit tests that cover 98% of the code base and are automatically executed by our continuous integration workflow upon commits.

Installation and usage

HiTaC is hosted on GitLab^{Footnote 1}, where there is also documentation with installation and usage instructions. HiTaC is compatible with Python 3.8+ and can be installed on GNU/Linux, Windows and macOS. It can be easily obtained with pip install hitac, conda install -c bioconda hitac or docker pull mirand863/hitac. Listing 1 shows a basic example of how to fit a hierarchical model and predict the taxonomy using HiTaC. More elaborate examples can be found in the documentation.

Evaluation

We evaluated HiTaC in five datasets with varying sequence lengths (Tables S2–S3) and compared its performance against seventeen taxonomic classifiers using the TAXXI benchmark [30]. All methods evaluated were trained and tested with the same data in order to avoid bias. The commands and parameters executed are summarized in Tables S4–S22.

To measure the performance of our model, we adopted all metrics available in the TAXXI benchmark and extended them by adding standard machine learning metrics. Furthermore, we also computed hierarchical metrics, which can give better insights into which algorithm is better at classifying hierarchical data [22]. The definitions of all these metrics are listed in the supporting information text.

The datasets in the TAXXI benchmark were created using real environmental data, publicly available on NCBI [31], RDP [32], the Warcup fungal ITS training set v2 (WITS) [33], UNITE [34] and other in vivo samples [30]. These datasets were created with a strategy known as cross-validation by identity, where varying distances between query sequences and the reference database are accounted for. In practice, this means that in the dataset with 90% identity all species are novel and there is a mix of novel and known taxa at the genus level, making it the most difficult dataset in the benchmark. As the identity increases, so does the amount of known labels at all taxonomic ranks; hence, the dataset with 100% identity can be considered the easiest in the benchmark.

Results and discussion

First, we evaluated HiTaC on the dataset with 90% identity to assess its behavior under uncertainty. In order to do this, we computed the hierarchical precision to measure the proportion of correct predictions, the hierarchical recall to measure the proportion of detected positive samples and the hierarchical F1-score, which is the harmonic mean of the precision and recall. As shown in Fig. 5, HiTaC_Filter was one of the top-performing methods in terms of precision, scoring 96.05 while maintaining a recall of 78.86, equivalent to other high-ranking methods. Although other software obtained similar or slightly better precision, it cost them a steeper decline in recall, which is also evidenced by the lower F1 scores. This result shows that HiTaC performed well in the presence of uncertainty.

To appraise HiTaC’s behavior when classifying known organisms, we evaluated it on the dataset with 100% identity. As shown in Fig. 6, HiTaC_Filter achieved a perfect precision score of 100 when the taxonomy was fully annotated in the reference database. Furthermore, the recall persisted at 99.68; that is, there was no prominent decrease in recall when compared with the filter-less version. This result shows the potential of HiTaC in accurately classifying known organisms, given that query sequences have perfect matches in the reference database.

In Fig. 7, we summarize the results achieved for the hierarchical metrics by computing the F1-score for all five datasets and sorting by the average of F1 scores. HiTaC_Filter obtained an F1-score equivalent or higher to the top-performing approaches for all datasets, corroborating the results presented in the previous paragraphs. This result demonstrates that HiTaC_Filter can positively identify organisms annotated in databases while being able to leave most new species unannotated.

Similar conclusions can be drawn from Fig. 8, in which we evaluated the percentage of known sequences correctly predicted, i.e., the true positive rate (TPR) for all methods and datasets available in the benchmark. The trend is that HiTaC achieved a TPR higher or equal to top methods. For instance, the best true positive rate at the genus level for the dataset with 90% identity was achieved by HiTaC, which sharply increased the TPR by 6.9 percentage points when compared with BTOP. Regarding the dataset with 95% identity at the genus level, HiTaC obtained a 1.7 percentage points improvement over Microclass, while for the dataset with 97% identity at the genus level, HiTaC surpassed Microclass with a 0.5 increase in percentage points. At the species level, HiTaC achieved a 0.7 percentage points gain over BTOP for the dataset with 99% identity, while for the dataset with 100% identity, HiTaC tied with BTOP, obtaining a perfect score of 100% TPR. These results suggest that the local hierarchical classification approach implemented in HiTaC commits fewer errors than flat classifiers as long as the sequences are previously known.

By default, HiTaC uses 6-mer frequency as a feature extraction method, but the user can change this parameter. Nevertheless, increasing the k-mer size provides small gains in predictive performance and comes with higher computational costs. The k-mer frequency computation method discards substrings with ambiguous nucleotides since they do not often occur in short reads, and accounting for them would require a slower algorithm. For example, a V means that there is either an A, C, or G in that position, and computing all these possibilities would make the method slower. Hence, we opted to disregard ambiguous nucleotides to keep the k-mer counting process as fast as possible. In future releases, we intend to implement a flexible user interface to allow users to use third-party feature extraction methods. Furthermore, we believe allowing for third-party feature extraction methods could also enable accurate taxonomic classification for 16 S rRNA.

Training the filter is slower than most methods evaluated in the benchmark (Tables S23–S27). However, it must only be performed once for the reference database, and after training, the filter can be used to estimate the uncertainty quickly. Moreover, some public databases, such as UNITE and SILVA, are not updated frequently. Nevertheless, we provide models pre-trained on the public database UNITE^{Footnote 2}, speeding up the process for users. Some of these pre-trained models contain all eukaryotic ITS sequences available on UNITE, which enable detection and removal of nonfungal sequences mistakenly amplified by polymerase chain reaction (PCR) [35]. At the time of writing, the current version of UNITE uses the mycological nomenclature published in Fungal Diversity [36]; hence, this is the nomenclature HiTaC learns when trained with the UNITE database and is reported to the user with the pre-trained models. Furthermore, HiTaC uses the unique species hypotheses identifiers provided in the database UNITE as the last taxonomic level in the hierarchy during training and reports them to the user, which increases taxonomic reproducibility.

On average, the entire ITS region is approximately 600 base pairs (bp) in length [37]. This trend can also be observed in the UNITE database, with averages ranging from 578 bp to 651 bp. However, in the UNITE database, the sequence lengths can vary from 250 bp to 16,761 bp, while the ITS sequences in the TAXXI benchmark have a maximum length of 1373 bp.

In summary, HiTaC is a Python package optimized to produce accurate taxonomic classification of fungal ITS sequences. Due to the extensive use of the taxonomic hierarchy during training, HiTaC produces fewer errors than other methods compared in the benchmark. Furthermore, its filter is a reliable classifier under uncertainty. HiTaC provides standalone Python scripts and a QIIME 2 plugin that can be quickly adopted into existing analysis pipelines. HiTaC and its dependencies can be easily installed via pip, conda or docker, which is not the case for most of the taxonomic classifiers available in the literature. HiTaC is released under the new BSD license, allowing free use in academia and industry and free copy, modification, and redistribution of the code as long as a duplicate of the original license is kept and proper acknowledgments are given.

Conclusion

HiTaC is a Python package for the taxonomic classification of fungal ITS sequences, which applies local hierarchical classifiers for accurate predictions. It provides scripts to train classifiers on specialized datasets and to classify new sequences with the trained models. The standard version has high sensitivity and is ideal for exploratory analysis. HiTaC also implements a filter that can express uncertainties in classifications, indicating if the input sequences are complex to recognize. Thanks to its compatibility with QIIME 2, users can quickly adopt it in existing mycobiome analysis pipelines. HiTaC is an open-source software available at https://gitlab.com/dacs-hpi/hitac.

Availability of data materials

HiTaC is freely available on the Python package index, BIOCONDA and docker hub. It is easier to install it by executing pip install hitac in the terminal. All datasets and evaluation scripts used in this paper are publicly available in the TAXXI benchmark [30], and a reproducible evaluation pipeline is available on GitLab https://gitlab.com/dacs-hpi/hitac/-/tree/main/benchmark.

Notes

References

Hawksworth DL. The magnitude of fungal diversity: the 1.5 million species estimate revisited. Mycol Res. 2001;105(12):1422–32.
Article Google Scholar
Banik A, Halder SK, Ghosh C, Mondal KC. Fungal probiotics: opportunity, challenge, and prospects. In: Recent advancement in white biotechnology through fungi: volume 2: perspective for value-added products and environments; 2019. pp. 101–117.
Huffnagle GB, Noverr MC. The emerging world of the fungal microbiome. Trends Microbiol. 2013;21(7):334–41.
Article CAS PubMed PubMed Central Google Scholar
Nilsson RH, Ryberg M, Abarenkov K, Sjökvist E, Kristiansson E. The its region as a target for characterization of fungal communities using emerging sequencing technologies. FEMS Microbiol Lett. 2009;296(1):97–101.
Article CAS PubMed Google Scholar
Hibbett DS. After the gold rush, or before the flood? Evolutionary morphology of mushroom-forming fungi (agaricomycetes) in the early 21st century. Mycol Res. 2007;111(9):1001–18.
Article PubMed Google Scholar
Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W. Environmental genome shotgun sequencing of the Sargasso sea. Science. 2004;304(5667):66–74.
Article CAS PubMed Google Scholar
Fuhrman JA. Metagenomics and its connection to microbial community organization. F1000 Biol Rep. 2012;4:12.
Article Google Scholar
Kim M, Chun J. 16s rrna gene-based identification of bacteria and archaea using the eztaxon server. In: Methods in microbiology, vol 41; Elsevier. pp. 61–74.
Schoch, C.L., Seifert, K.A., Huhndorf, S., Robert, V., Spouge, J.L., Levesque, C.A., Chen, W., Consortium, F.B.,. Nuclear ribosomal internal transcribed spacer (its) region as a universal dna barcode marker for fungi. Proc Natl Acad Sci. 2012;109(16):6241–6.
Article Google Scholar
Chakravorty S, Helb D, Burday M, Connell N, Alland D. A detailed analysis of 16s ribosomal rna gene segments for the diagnosis of pathogenic bacteria. J Microbiol Methods. 2007;69(2):330–9.
Article CAS PubMed PubMed Central Google Scholar
Dos Reis JBA, Lorenzi AS, Vale HMM. Methods used for the study of endophytic fungi: a review on methodologies and challenges, and associated tips. Arch Microbiol. 2022;204(11):675.
Article PubMed PubMed Central Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
Article CAS PubMed PubMed Central Google Scholar
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO. Silva: a comprehensive online resource for quality checked and aligned ribosomal rna sequence data compatible with arb. Nucleic Acids Res. 2007;35(21):7188–96.
Article CAS PubMed PubMed Central Google Scholar
Nilsson RH, Larsson K-H, Taylor AFS, Bengtsson-Palme J, Jeppesen TS, Schigel D, Kennedy P, Picard K, Glöckner FO, Tedersoo L. The unite database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications. Nucleic Acids Res. 2018;47(D1):259–64.
Article Google Scholar
Bzhalava Z, Tampuu A, Bała P, Vicente R, Dillner J. Machine learning for detection of viral sequences in human metagenomic datasets. BMC Bioinform. 2018;19:1–11.
Article Google Scholar
Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007;73(16):5261–7.
Article CAS PubMed PubMed Central Google Scholar
Liu K-L, Wong T-T. Naïve bayesian classifiers with multinomial models for rrna taxonomic assignment. IEEE/ACM Trans Comput Biol Bioinf. 2013;10(5):1–1.
Article Google Scholar
Liland KH, Vinje H, Snipen L. microclass: an r-package for 16s taxonomy classification. BMC Bioinform. 2017;18(1):172.
Article Google Scholar
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75(23):7537–41.
Article CAS PubMed PubMed Central Google Scholar
Bokulich NA, Dillon MR, Bolyen E, Kaehler BD, Huttley GA, Caporaso JG. q2-sample-classifier: machine-learning tools for microbiome classification and regression. J Open Res Softw. 2018;3:30.
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Google Scholar
Silla CN, Freitas AA. A survey of hierarchical classification across different application domains. Data Min Knowl Disc. 2011;22(1–2):31–72.
Article Google Scholar
Miranda FM, Köhnecke N, Renard BY. Hiclass: a python library for local hierarchical classification compatible with scikit-learn. J Mach Learn Res. 2023;24(29):1–17.
Google Scholar
Index Fungorum. https://www.indexfungorum.org/. Accessed 13 May 2024.
Jayasiri SC, Hyde KD, Ariyawansa HA, Bhat J, Buyck B, Cai L, Dai Y-C, Abd-Elsalam KA, Ertz D, Hidayat I. The faces of fungi database: fungal names linked with morphology, phylogeny and human impacts. Fungal Divers. 2015;74:3–18.
Article Google Scholar
Robert V, Vu D, Amor ABH, Wiele N, Brouwer C, Jabas B, Szoke S, Dridi A, Triki M, Daoud SB. Mycobank gearing up for new horizons. IMA Fungus. 2013;4:371–9.
Article PubMed PubMed Central Google Scholar
Federhen S. The ncbi taxonomy database. Nucleic Acids Res. 2012;40(D1):136–43.
Article Google Scholar
Scikit-learn: logistic regression probability estimates. 2023. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Accessed 26 Oct 2023.
Rossum G, Warsaw B, Coghlan N. PEP 8-style guide for Python code. python.org 2001.
Edgar RC. Accuracy of taxonomy prediction for 16s rrna and fungal its sequences. PeerJ. 2018;6:4652.
Article Google Scholar
Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2012;40(D1):13–25.
Article Google Scholar
Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, Brown CT, Porras-Alfaro A, Kuske CR, Tiedje JM. Ribosomal database project: data and tools for high throughput rrna analysis. Nucleic Acids Res. 2014;42(D1):633–42.
Article Google Scholar
Deshpande V, Wang Q, Greenfield P, Charleston M, Porras-Alfaro A, Kuske CR, Cole JR, Midgley DJ, Tran-Dinh N. Fungal identification using a bayesian classifier and the warcup training set of internal transcribed spacer sequences. Mycologia. 2016;108(1):1–5.
Article PubMed Google Scholar
Kõljalg U, Larsson K-H, Abarenkov K, Nilsson RH, Alexander IJ, Eberhardt U, Erland S, Høiland K, Kjøller R, Larsson E. Unite: a database providing web-based methods for the molecular identification of ectomycorrhizal fungi. New Phytol. 2005;166(3):1063–8.
Article PubMed Google Scholar
Rawson C, Zahn G. Inclusion of database outgroups reduces false positives in fungal metabarcoding taxonomic assignments. Mycologia. 2023;8:1–7.
Google Scholar
Tedersoo L, Sánchez-Ramírez S, Koljalg U, Bahram M, Döring M, Schigel D, May T, Ryberg M, Abarenkov K. High-level classification of the fungi and a tool for evolutionary ecological analyses. Fungal Divers. 2018;90:135–59.
Article Google Scholar
Porras-Alfaro A, Liu K-L, Kuske CR, Xie G. From genus to phylum: large-subunit and internal transcribed spacer rrna operon regions show similar classification accuracies influenced by database composition. Appl Environ Microbiol. 2014;80(3):829–40. https://doi.org/10.1128/AEM.02894-13.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

FMM gratefully acknowledges Marcos Augusto dos Santos (UFMG) for initial inspiration, Robert C. Edgar (independent researcher) for providing the scripts and datasets in the TAXXI benchmark, Sven Giese (HPI) for code improvement suggestions and Elizabeth Yuu (HPI) for proofreading.

Funding

Open Access funding enabled and organized by Projekt DEAL. VCP was financed by the Deutsche Forschungsgemeinschaft (DFG) under grant number 458163427.

Author information

Bernhard Y. Renard and Vitor C. Piro shared senior co-authorship.

Authors and Affiliations

Data Analytics and Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
Fábio M. Miranda, Bernhard Y. Renard & Vitor C. Piro
Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
Fábio M. Miranda & Vitor C. Piro
Institute of Biological Sciences, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Vasco C. Azevedo & Rommel J. Ramos
Institute of Biological Sciences, Federal University of Pará, Belém, Brazil
Rommel J. Ramos
Centro de Computação de Alto Desempenho, Universidade Federal do Pará, Belém, Brazil
Rommel J. Ramos

Authors

Fábio M. Miranda
View author publications
You can also search for this author in PubMed Google Scholar
Vasco C. Azevedo
View author publications
You can also search for this author in PubMed Google Scholar
Rommel J. Ramos
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Y. Renard
View author publications
You can also search for this author in PubMed Google Scholar
Vitor C. Piro
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

FMM designed and implemented HiTaC’s algorithm, performed analysis, benchmarked the experiments and wrote the manuscript. VAA and RJR assisted with analysis and revised the manuscript. BYR assisted with analysis, visualizations and reviewed the manuscript. VCP oversaw experimental and analysis aspects of the project and reviewed the manuscript. All authors edited and approved the final manuscript.

Corresponding author

Correspondence to Vitor C. Piro.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interest

The authors declare that they have no competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supporting information text, Figures S1–S4 and tables S1–S92

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Miranda, F.M., Azevedo, V.C., Ramos, R.J. et al. Hitac: a hierarchical taxonomic classifier for fungal ITS sequences compatible with QIIME2. BMC Bioinformatics 25, 228 (2024). https://doi.org/10.1186/s12859-024-05839-x

Download citation

Received: 15 March 2024
Accepted: 11 June 2024
Published: 02 July 2024
DOI: https://doi.org/10.1186/s12859-024-05839-x

Hitac: a hierarchical taxonomic classifier for fungal ITS sequences compatible with QIIME2