Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering

Barone, Federico; Russo, Elena Tea; Villegas Garcia, Edith Natalia; Punta, Marco; Cozzini, Stefano; Ansuini, Alessio; Cazzaniga, Alberto

doi:10.1038/s41597-024-03131-4

Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering

Data Descriptor
Open access
Published: 01 June 2024

Volume 11, article number 568, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering

Download PDF

Federico Barone^1,2,
Elena Tea Russo¹,
Edith Natalia Villegas Garcia ORCID: orcid.org/0000-0002-7338-2068^1,2,
Marco Punta^3,4,
Stefano Cozzini¹,
Alessio Ansuini¹ &
…
Alberto Cazzaniga ORCID: orcid.org/0000-0001-6271-3303¹

306 Accesses
2 Altmetric
Explore all metrics

Abstract

Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository.

MetaGeneHunt for protein domain annotation in short-read metagenomes

Article Open access 07 May 2020

SeqWiz: a modularized toolkit for next-generation protein sequence database management and analysis

Article Open access 17 May 2023

HIPPI: highly accurate protein family classification with ensembles of HMMs

Article Open access 11 November 2016

Background & Summary

In recent years, technological advances led to a considerable drop in cost and time of sequencing experiments resulting in an exponential growth of the number of known protein sequences. For example, the latest release of the UniProtKB database, containing 227 M sequences, expands the 2019 version by 90%^1,2. Metagenomic databases grow even faster: the EMBL-EBI MGnify, including hundreds of millions of sequences produced from high-quality assembled genomes, reported a 48-fold growth since its first release in 2017^3,4.

Functional annotation of protein sequences is crucial for a broad spectrum of biological applications, ranging from understanding cellular mechanisms to drug discovery^5,6. Since large-scale experimental characterization is unfeasible, researchers rely on alternative tools to impute sequence-function relations. A prominent strategy consists in grouping homologous protein regions into families to formulate functional hypotheses based on common ancestry. Current efforts to provide protein family annotations rely at least to some degree on manual intervention, thus limiting their ability to match the ever-increasing influx of sequenced data while inherently introducing human bias. For example, the number of families in the Pfam database, focused on classifying all protein domains in evolutionary-related groups, increased by 6% over the last five years. Despite the introduction of novel families, Pfam maintained a steady level of sequence and residue coverage of the UniProtKB database over the same time period, respectively 77% and 53%. Increasing the level of coverage with the inclusion of novel families is becoming more challenging as new Pfam entries tend to cover a small taxonomic range^7,8. Furthermore, the proliferation of metagenomic projects will likely induce a further drop in Pfam coverage. In fact, family annotation levels of metagenomic protein sequence databases are rather low, typically around 46%³.

To address this problem, we developed the DPCfam pipeline, a two-step Density Peak Clustering (DPC) algorithm⁹ which classifies sequence-similar protein regions into putative protein families called metaclusters (MCs)¹⁰. Crucially, the pipeline generates the final classification through a fully unsupervised algorithm that relies only on local pairwise sequence alignments. Supported by a high-performance modular implementation that ensures scaling over large databases, DPCfam consists of a fully automated alternative tool that can boost the annotation of protein datasets. In previous work, we successfully used DPCfam to classify the UniRef50 dataset (version 2017_07), containing about 23 million sequences¹¹. As part of this work, 45,000 metaclusters were generated, of which 30% consisted of protein regions that lacked Pfam annotation. These regions may represent novel protein families. The results from this procedure have already been used to identify 63 new protein families in Pfam 35.0.

In the present study, we apply the DPCfam pipeline to metagenomics data, focusing on the Unified Human Gastrointestinal Protein catalogue clustered at 50% amino acid identity (UHGP-50). This dataset is part of the MGnify catalogue and contains 4 million entries representing 600 million proteins encoded in the Unified Human Gastrointestinal Genome collection¹². The overall scope of the Unified Human Gastrointestinal initiative was to generate a nonredundant dataset of the human gut genomes and proteins by merging together the most relevant studies^{13,14,15,16,17} into a unified curated catalogue. The dataset has been widely used by the scientific community and has already fostered important applications in medicine and biology^18,19; despite this, large portions of the catalogue lack functional annotation.

The DPCfam pipeline on UHGP-50 took 15 days to complete on a 5-node cluster, leveraging 120 cores of Intel Xeon Gold 6126 processors, each with 500 GB of RAM. The pipeline generated 10,778 metaclusters, each consisting of a set of sequence-similar protein regions called seeds. We filter seed regions based on sequence similarity for each metacluster to build profile-HMMs²⁰. These profiles were then used to perform a more sensitive search on UHGP-50. We took all significant profile-HMM hits (domain E-value = 0.03, protein E-value = 0.01) as our new metacluster member regions. Using this extended metacluster representation, DPCfam covers 50.02% of all residues and 53.38% of all sequences in UHGP-50. This represents a substantial gain in coverage if compared with the Pfam annotation reported in the original UHGP release¹², which has a residue and sequence coverage for UHGP-50 of 26.2% and 37.55% respectively. Of all metaclusters, 1,261 do not overlap either with families in Pfam or with metaclusters in DPCfam-UniRef50 and thus represent potential novel families, which might be gut metagenome-specific.

The aim of this study is to provide researchers with a database that could accelerate the identification of novel protein families in the Human Gastrointestinal Proteome, thus potentially enabling the formulation of new functional hypotheses. Future plans include further optimization of the clustering procedure allowing for faster processing of newly deposited metagenomic data, and combining automatic classification with functional annotation leveraging deep learning^21,22 for even more comprehensive annotation of UHGP.

Methods

We present a schematic overview of DPCfam in Fig. 1. The pipeline consists of four main stages: 1) all-versus-all sequence alignment; 2) domain identification, also referred to as first clustering; 3) family creation, also referred to as secondary clustering or metaclustering; 4) metacluster merging and filtering procedure. Pre- and post-processing steps depends on each specific case. We briefly review each stage of the pipeline, and we refer to the work of Russo et al.¹¹ for further details.

Pre-processing

The DPCfam algorithm takes as input a dataset of non-redundant sequences with at most 50% amino acid identity between its entries. The filtering procedure by sequence identity avoids the over-representation of the number of pairwise alignments of a given protein region which could introduce artefacts in the final output. For this study, we used the clustered dataset UHGP-50 version 1.0 provided by MGnify (https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v1.0/uhgp_catalogue/).

All-versus-all alignment

The first step of the pipeline is an all-versus-all local pairwise sequence alignment computed with blastp, a software part of the BLAST + v2.2.30 suite²³. The result is a collection of search sequence regions that align with each query protein. This step is by far the most computationally demanding, despite being embarrassingly parallel, as each query alignment search is independent.

Primary clustering

For each query sequence Q, the output of blastp consists of a collection of protein regions that align to different portions of Q. The primary clustering procedure groups them based on their overlap along the query sequence. To this end, we first define the following distance measure between two generic search sequence regions i and j of a query sequence Q,

$${d}_{ij}^{Q}=1-\frac{\left|{Q}_{i}\cap {Q}_{j}\right|}{\left|{Q}_{i}\cup {Q}_{j}\right|},$$

(1)

where Q_i and Q_j are the portions, expressed in number of amino acids, of the query Q covered by the respective search sequence regions. Note that if they cover exactly the same region of the query, the distance is zero, and if they do not overlap, the distance is 1. For each query Q, we use the above distance to cluster the aligned protein regions using the DPC algorithm. Each cluster, or primary cluster, represents a high-density region along the query sequence and comprises a set of search sequence regions. From a query perspective, primary clustering segments the sequence into regions akin to domains or domain combinations (architectures). This is also reflected on the associated local alignments, which likely represent domains or domain combinations of the search sequences they originate from. Primary clusters thus represent a first family-like group of sequences anchored to a specific query.

Metaclustering (or Secondary Clustering)

Once we identify the segmentation of the query sequences according to clusters of search sequence regions, we proceed to classify them into metaclusters. We define the distance between two primary clusters as the number of members they have in common with an overlap bigger than 0.8, normalized by the size of the smaller primary cluster. The secondary clustering procedure is done once again using the DPC algorithm, and the resulting metaclusters are collections of primary clusters, which in turn contain a collection of protein regions.

Merging and filtering

At this stage of the pipeline, we have a set of protein regions classified into metaclusters, with each metacluster associated to a density peak in the landscape of primary clusters. Given the nature of the algorithm, two peaks separated by a high-density saddle point will be considered as independent metaclusters but could still share a large number of sequences. To avoid this phenomenon we perform a merging procedure in which we join metaclusters that match this condition. We define the distance between two metaclusters as the average primary cluster distance taken over all their member pairs (primary cluster distance as defined in the previous section). If the average distance is smaller than 0.9, we merge the metaclusters. After the merging step we perform a filtering step to remove duplicates within MCs and, separately, to remove sequences that belong to only one of the merging clusters. The list of protein regions within each metaclusters constitute the seeds of DPCfam-generated putative families.

Post-processing

The output of the previous stage is a list of metaclusters, each containing a set of protein regions selected among the original local alignments. To further extend the sequence and residue coverage of our MCs, for each of them, we build a profile-HMM according to the following procedure: 1) pruning of the MC seed sequences with CD-HIT v4.7²⁴ at 60 percent identity; 2) generation from the pruned seed of a multiple sequence alignment (MSA) with the software MUSCLE v3.8.31²⁵; 3) construction from the MSA of a profile-HMM with HMMER - hmmbuild v3.1b2²⁶; 4) running of the profile-HMM against UHGP-50 using HMMER-hmmsearch v3.1b2 (domain E-value = 0.03, protein E-value = 0.01). Note that the profile-HMM models can further be used to automatically annotate previously unseen proteins without the need to rerun the DPCfam clustering algorithm.

Data Records

The full dataset is available for download from Zenodo²⁷ (https://doi.org/10.5281/zenodo.10611777), and it can be explored interactively from our dedicated website at https://dpcfam.areasciencepark.it/uhgp. Only metaclusters with more than 50 protein seeds and an average sequence length of more than 50 amino acids are included in the dataset, resulting in a total of 10,778 metaclusters. To make the data more interoperable and reusable, metadata about all the seeds have been combined into a single XML file. This format allows for different types of information to be included in a hierarchical but flexible structure that can easily be built up sequentially. It is a machine-readable and well-established format used in many projects in the life sciences. In particular, it is the format used by the UniProt consortium to publish protein annotations. With this in mind, we also built an XML file containing all UHGP-50 proteins and subsequently annotated it with DPCfam-UHGP50 metaclusters and Pfam family membership.

In the remaining part of this section, we provide a detailed description of the content of the Zenodo repository, including the complete list of files associated with each metacluster and the XML files containing metacluster information organized hierarchically along with UHGP-50 protein annotation.

Metacluster Files

For each of the 10,778 metaclusters, we provide a fasta file with the metacluster seed sequences, a fasta file with the representative seed sequences after clustering at 60 percent identity, a fasta file with the MSA of the representative seed sequences, and a file with the corresponding profile-HMM. Files are grouped by category and compressed into the following archives: seeds.zip, filtered_seeds.zip, metaclusters_msas.tar.gz, and metaclusters_hmms.tar.gz.

XML files

The metaclusters_xml.tar.gz archive contains the metaclusters_uhgp50.xml file and auxiliary parsing scripts, the use of which is described in README.md. In addition to the seed sequences, the XML file contains the following information about each metacluster: the number of seed sequences, the mean and standard deviation of the amino acid length of the seed sequences, the fraction of amino acids in a low-complexity region, the fraction of amino acids in a coiled-coil region, the fraction of amino acids in a disordered region and the mean number of transmembrane regions.

We also include data comparing the UHGP-50 metaclusters with the Pfam annotation found in the UHGP-50 downloadable database. The metadata resulting from this comparison is described in Table 1. For an in-depth description of the analysis this data derives from, we refer the reader to Technical Validation and Russo et al.¹¹. Metaclusters are classified into four groups: equivalent, reduced, extended, and shifted. The specific category assigned to each metacluster depends on the fraction of the seed region that Pfam does not cover and vice versa. These values are averaged over all metacluster seed sequence - Pfam sequence pairs. If both are within 0.2, we consider the metacluster “equivalent” to the Pfam families it overlaps with; if both fractions are larger than 0.2, the metacluster is categorized as shifted. If only one of the fractions is larger than 0.2, the metacluster is either extended or reduced (see Fig. 2).

Table 1 Detailed description for each XML field containing the DPCfam vs Pfam comparison metadata.

Full size table

The archive uhgp_xml.tar.gz contains the UHGP-50 protein dataset annotated with Pfam and DPCfam families. For DPCfam, it is also indicated if the protein is a seed sequence or if it was found after searching the database with the profile-HMMs.

Mapping File

To make our results compatible with future versions of UHGP, we have included a file named uhgp_protein_mapping.txt in the Zenodo repository that provides a mapping between protein identifiers from versions 1.0 and 2.0.2 of the UHGP dataset. This mapping consists of three different identifiers for each protein: the first corresponds to the identifier of the protein from the original dataset used in this study (UHGP-50 version 1.0), the second corresponds to the identifier for version 2.0.2 of UHGP, and the third corresponds to the representative of the protein when clustered at 100% identity (UHGP-100 version 2.0.2). The latter is necessary because UHGP version 2.0.2 only provides the amino acid sequences for UHGP-100 representative proteins to avoid repetition.

Technical Validation

First, we look into the generic properties of the metaclusters, including size distribution, predicted average number of transmembrane regions, and the fraction of disordered, coiled-coil, and low-complexity regions. Second, using Pfam annotation as a reference, we test the existence of homologous relationships between metacluster members and analyse the quality of the boundaries. Finally, we study the overlap between UHGP-50 and UniRef50 DPCfam-generated metaclusters. From these analyses, we identified 1,261 unknown metaclusters, i.e. metaclusters that do not overlap with either Pfam or DPCfam-UniRef50 annotation. While overlapping metaclusters may be helpful for improving existing annotation, the subset of unknown entries might represent protein families that are novel and potentially unique to the gut microbiome.

Metaclusters general properties

Clustering UHGP-50 with DPCfam produced 40,738 metaclusters. While this is a rather large number, it should be noted that DPCfam metacluster collections feature a certain degree of redundancy, with different MCs mapping to overlapping regions of the same protein¹¹. The median metacluster size (i.e., number of sequences) was 24, while the average size was 90.5. Figure 3 shows the Complementary Cumulative Distribution Function calculated from the metaclusters size distribution for UHGP-50 (blue) and, as a reference, UniRef50 (grey). The relationship between how frequent protein domain families appear in nature and their size was first described by Gerstein et al.²⁸ and can be explained by a birth, death, and innovation evolutionary model (BDIM)²⁹. The resulting analytical expression is a Generalized Pareto Distribution which approximately fits the data with an exponent of 1.23 ± 0.01 (red).

In this regime of ‘the rich get richer’, elements larger in size are usually the more interesting ones. Moreover, as noted in our previous work¹¹, metaclusters of smaller size are, on average, of lower quality. Because of this, we decided to focus our downstream analysis on the set of MCs containing at least 50 members. We also excluded MCs with an average domain length of less than 50 amino acids since length also affects the quality of the alignments and most known structural domains are longer. After applying these filters, we obtained 10,778 metaclusters which, in the following, we refer to as the DPCfam-UHGP50 dataset. The median metacluster size after filtering was 282.7, while the average size was 110.

Next, we examined the fraction of amino acids which are predicted to be found in a coiled-coil, low-complexity, and disordered region, using the software DeepCoil v2.0.1³⁰, Segmasker part of the BLAST + v2.2.30 suite³¹ and IUPred2A³² respectively. We labeled metaclusters with more than 10% of amino acids from all of their members in a low-complexity or coiled-coil region as Low Complexity MCs or Coiled Coil MCs, respectively. To label metaclusters as Disordered we used instead a threshold of 50% of all its amino acids in a disordered state. Figure 4 shows a Venn Diagram with the percentage of metaclusters in one or more of these categories. Although not a direct measure of quality, high values of these quantities indicate a bias in the amino acid composition of their member sequences, which makes it harder to infer homology from sequence similarity. This is especially true for low-complexity regions and coiled-coil regions.

Results were in line with what we had observed when running DPCfam on UniRef50¹¹; indeed, the majority of the metaclusters (88.96%) did not fall into any of these categories. We noted that the fraction of disordered MCs was rather small (about 1%). This was expected, however, since intrinsic disorder is not very common in prokaryotes³³, which constitute the vast majority of gut microbiome organisms¹². As an additional measure, we used the software Phobius v1.01³⁴ to predict the number of transmembrane helical domains in each protein of the dataset. Among all metaclusters, 6.8% had an average of at least two transmembrane regions per member and were thus likely to represent transmembrane helical bundle domains.

Comparison with Pfam

To assess homologous relationships between metacluster seed sequences and to check how well our automatically-generated MCs boundaries mapped to those of manually annotated protein families, we compared our results with UHGP-50 Pfam annotation.

To perform the comparison between the two classifications, we first extracted the Pfam labels from the Interpro scan file, which comes as part of the UHGP-50 downloadable database. In order to include multiple family architectures in the analysis, since metaclusters may extend over one or more families, we extended the Pfam annotation by combining single families into all possible family architectures. In what follows, the term Pfam architecture will be used to refer to either single families or multiple family architectures.

To identify the Pfam architecture which best represents a particular DPCfam metacluster, we adopted the same criteria used in Russo et al.¹¹. In particular, we defined the MC’s dominant architecture as the Pfam architecture shared by the largest number of its seed sequences (note that annotating a single amino acid is enough to associate a seed sequence to a Pfam family). To compare DPCfam and Pfam, we selected only those MCs that have a dominant architecture shared by at least 50% of the seed sequences; we will refer to this as “DA ≥ 50%”.

Figure 5 shows the distribution of the average overlap between each metacluster and its corresponding dominant architecture. The histogram includes the 4,434 metaclusters for which a Pfam DA ≥ 50% was present. Colours correspond to the type of overlap between the two annotations (see Fig. 2). Among the MCs we consider here, 41.7% were classified as equivalent, 26.2% were reduced, 20.5% were extended and 11.6% were shifted.

Equivalent metaclusters have good boundary agreements with their Pfam dominant architecture. Reduced and extended metaclusters identify regions that are, on average, shorter or larger than the dominant architecture, respectively. Shifted metaclusters, conversely, correspond to cases where the metacluster covers regions that are consistently located N-terminal or C-terminal to the dominant architecture.

Among the remaining metaclusters without a well-defined DA, 2,277 do not overlap with any Pfam annotation and were therefore labeled as unknown to Pfam. This pool of metaclusters is particularly intriguing since it may include novel protein families.

Comparison with DPCfam-UniRef50

To further evaluate our results, we compared DPCfam-UHGP50 metaclusters (UHGP-metaclusters) against DPCfam-UniRef50 metaclusters (UR-metaclusters) which were obtained by clustering a dataset 5 times bigger¹¹. The procedure was analogous to the one described in the previous section in which we compared UHGP-metaclusters against Pfam. The objective was to find, for each UHGP-metacluster, the correspondent UR-metacluster dominant architecture in order to establish a link between the two datasets. To obtain UR-metaclusters labels for UHGP-50 sequences, we ran the UR-metacluster profile-HMMs against UHGP-50 using HMMER - hmmsearch software.

Figure 6 shows the comparison results, where colours indicate the type of overlap with the dominant architecture. Among the 6,764 UHGP-metaclusters with a well-defined UR-metacluster dominant architecture, 50.6% are labeled as equivalent. This meant that most UHGP-50-generated metaclusters that had a well-matched UR-metacluster had close boundary correspondence with the latter. Only 8.7% of metaclusters were classified as shifted, further indicating that DPCfam can quite consistently identify family boundaries when run on different datasets.

Among those without a well-defined dominant architecture, we identified 1,953 with no overlapping UR-metacluster annotation. We labeled this subset of metaclusters as unknown to UR-metacluster. The 1,261 metaclusters unknown to both Pfam and UR-metacluster annotation creates a collection of potential new families that could be specific to the gut metagenome.

Usage Notes

For ease of use, we provide parser scripts that allow to transform the XML files into space-separated tables.

For the XML file containing the metaclusters, the parser script generates a folder containing one fasta file per metacluster, as well as seven space-separated tables. Each fasta file contains all of the seed sequences of the corresponding metacluster. The space-separated txt files include the following: a list of all metacluster IDs, a table with statistical information on the seed sequences for each metacluster, a table with Pfam comparison measures for each metacluster, and four tables with additional metacluster properties (low-complexity, disordered, coiled-coil and transmembrane regions).

For the XML file listing all UHGP-50 proteins, the script generates a list of DPCfam matches (with start and end positions in the matching protein), a list of Pfam matches, a list of all UGHP-50 proteins, and lists of metacluster seed sequences before and after the CD-HIT filtering step.

Specific details on how to run the scripts are described in the README file. Parts of the code can be commented out to avoid generating all output files.

Dedicated Webserver

The dataset can also be explored online on the dedicated DPCfam browser available at https://dpcfam.areasciencepark.it/uhgp. This website includes all the putative protein families from the DPCfam-UHGP50 clustering and lets the user search either for individual proteins or specific metaclusters, as well as Pfam families. On the website we additionally provide all this information for DPCfam v1.1 on Uniref-50.

Each DPCfam metacluster has its entry with information on basic statistics (including the percentage of low-complexity, coiled-coil, disordered domains, and the average number of transmembrane helices), the Pfam dominant architecture if present, as well as the complete list of seed sequences. The seed sequence list, multiple sequence alignments, and HMM model can be downloaded from the metacluster page.

Proteins also have individual entries (see Fig. 7). For each protein, the corresponding entry page displays the list of DPCfam and Pfam domains and a graphical representation of their positions along the protein sequence. Results for Pfam queries include, if present, the matching DPCfam metaclusters and the degree and type of overlap.

Code availability

The DPCfam pipeline source code is available for download at https://gitlab.com/area7/DPCfam/dpcfam. The repository includes usage notes and example datasets to test the algorithm.

References

UniProt Consortium, T. UniProt: the universal protein knowledgebase. Nucleic Acids Research 46, 2699–2699, https://doi.org/10.1093/nar/gky092 (2018).
Article CAS PubMed Google Scholar
Consortium, T. U. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Research 51, D523–D531, https://doi.org/10.1093/nar/gkac1052 (2022).
Article CAS Google Scholar
Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Research 51, D753–D759, https://doi.org/10.1093/nar/gkac1080 (2022).
Article CAS PubMed Central Google Scholar
Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Research 48, D570–D578, https://doi.org/10.1093/nar/gkz1035 (2019).
Article CAS PubMed Central Google Scholar
Luck, K. et al. A reference map of the human binary protein interactome. Nature 580, 402–408, https://doi.org/10.1038/s41586-020-2188-x (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Rifaioglu, A. S. et al. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Briefings in bioinformatics 20, 1878–1912, https://doi.org/10.1093/bib/bby061 (2019).
Article CAS PubMed Google Scholar
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Research 47, D427–D432, https://doi.org/10.1093/nar/gky995 (2018).
Article CAS PubMed Central Google Scholar
Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Research 49, D412–D419, https://doi.org/10.1093/nar/gkaa913 (2020).
Article CAS PubMed Central Google Scholar
Rodriguez, A. & Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492–1496, https://doi.org/10.1126/science.1242072 (2014).
Article ADS CAS PubMed Google Scholar
Russo, E. T., Laio, A. & Punta, M. Density peak clustering of protein sequences associated to a pfam clan reveals clear similarities and interesting differences with respect to manual family annotation. BMC bioinformatics 22, 1–28, https://doi.org/10.1186/s12859-021-04013-x (2021).
Article CAS Google Scholar
Russo, E. T. et al. Dpcfam: unsupervised protein family classification by density peak clustering of large sequence datasets. PLOS Computational Biology 18, 1–29, https://doi.org/10.1371/journal.pcbi.1010610 (2022).
Article CAS Google Scholar
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature biotechnology 39, 105–114, https://doi.org/10.1038/s41587-020-0603-3 (2021).
Article CAS PubMed Google Scholar
Kitts, P. A. et al. Assembly: a resource for assembled genomes at ncbi. Nucleic acids research 44, D73–D80, https://doi.org/10.1093/nar/gkv1226 (2016).
Article CAS PubMed Google Scholar
Chen, I.-M. A. et al. Img/m v. 5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucleic acids research 47, D666–D677, https://doi.org/10.1093/nar/gky901 (2019).
Article CAS PubMed Google Scholar
Wattam, A. R. et al. Improvements to patric, the all-bacterial bioinformatics database and analysis resource center. Nucleic acids research 45, D535–D542, https://doi.org/10.1093/nar/gkw1017 (2017).
Article CAS PubMed Google Scholar
Forster, S. C. et al. A human gut bacterial genome and culture collection for improved metagenomic analyses. Nature biotechnology 37, 186–192, https://doi.org/10.1038/s41587-018-0009-7 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zou, Y. et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nature biotechnology 37, 179–185, https://doi.org/10.1038/s41587-018-0008-8 (2019).
Article CAS PubMed PubMed Central Google Scholar
Trebicka, J., Bork, P., Krag, A. & Arumugam, M. Utilizing the gut microbiome in decompensated cirrhosis and acute-on-chronic liver failure. Nature reviews Gastroenterology & hepatology 18, 167–180, https://doi.org/10.1038/s41575-020-00376-3 (2021).
Article Google Scholar
Qin, Y. et al. Combined effects of host genetics and diet on human gut microbiota and incident disease in a single population cohort. Nature Genetics 54, 134–142, https://doi.org/10.1038/s41588-021-00991-z (2022).
Article CAS PubMed PubMed Central Google Scholar
Mistry, J., Finn, R. D., Eddy, S. R., Bateman, A. & Punta, M. Challenges in homology search: Hmmer3 and convergent evolution of coiled-coil regions. Nucleic acids research 41, e121–e121, https://doi.org/10.1093/nar/gkt263 (2013).
Article CAS PubMed PubMed Central Google Scholar
Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nature Biotechnology 1–6, https://doi.org/10.1038/s41587-021-01179-w (2022).
Valeriani, L. et al. The geometry of hidden representations of large transformer models. Advances in Neural Information Processing Systems 34, https://doi.org/10.48550/arXiv.2302.00294 (2023).
Boratyn, G. M. et al. Blast: a more efficient report with usability improvements. Nucleic acids research 41, W29–W33, 10.1093
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659, https://doi.org/10.1093/bioinformatics/btl158 (2006).
Article CAS PubMed Google Scholar
Edgar, R. C. Muscle: multiple sequence alignment with high accuracy and high throughput. NAR 32, 1792–1797, https://doi.org/10.1093/nar/gkh340 (2004).
Article CAS PubMed PubMed Central Google Scholar
Eddy, S. R. Accelerated profile hmm searches. PLoS computational biology 7, e1002195, https://doi.org/10.1371/journal.pcbi.1002195 (2011).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Barone, F. et al. Unified Human Gastrointestinal Proteome clustering results by DPCfam. Zenodo https://doi.org/10.5281/zenodo.10611777 (2024).
Qian, J., Luscombe, N. M. & Gerstein, M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J. Mol. Biol. 313, 673–681, https://doi.org/10.1006/jmbi.2001.5079 (2001).
Article CAS PubMed Google Scholar
Koonin, E., Wolf, Y. & Karev, G. The structure of the protein universe and genome evolution. Nature 420, 218–223, https://doi.org/10.1038/nature01256 (2002).
Article ADS CAS PubMed Google Scholar
Ludwiczak, J., Winski, A., Szczepaniak, K., Alva, V. & Dunin-Horkawicz, S. DeepCoil—a fast and accurate prediction of coiled-coil domains in protein sequences. Bioinformatics 35, 2790–2795, https://doi.org/10.1093/bioinformatics/bty1062 (2019).
Article CAS PubMed Google Scholar
Camacho, C. et al. Blast+: architecture and applications. BMC Bioinformatics 41, https://doi.org/10.1186/1471-2105-10-421 (2009).
Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Research 46, W329–W337, https://doi.org/10.1093/nar/gky384 (2018).
Article CAS PubMed PubMed Central Google Scholar
Basile, W., Salvatore, M., Bassot, C. & Elofsson, A. Why do eukaryotic proteins contain more intrinsically disordered regions? PLoS computational biology 15, e1007186, https://doi.org/10.1371/journal.pcbi.1007186 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Käll, L., Krogh, A. & Sonnhammer, E. L. A combined transmembrane topology and signal peptide prediction method. Journal of Molecular Biology 338, 1027–1036, https://doi.org/10.1016/j.jmb.2004.03.016 (2004).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors acknowledge the AREA Science Park supercomputing platform ORFEO made available for conducting the research reported in this paper, and the technical support of the staff of the Laboratory of Data Engineering. We thank Alessandro Laio for the conversations on the material of this paper. F.B., E.T.R., and E.N.V.G. were supported by the project PON “BIO Open Lab (BOL) - Rafforzamento del capitale umano”—CUP: J72F20000940007. S.C. A.A. and A. C. were supported by the European Union — NextGenerationEU within the project PNRR “PRP@CERIC” IR0000028 - Mission 4 Component 2 Investment 3.1 Action 3.1.1.

Author information

Authors and Affiliations

Area Science Park, Padriciano, 99, 34149, Trieste, Italy
Federico Barone, Elena Tea Russo, Edith Natalia Villegas Garcia, Stefano Cozzini, Alessio Ansuini & Alberto Cazzaniga
University of Trieste, Trieste, 34127, Italy
Federico Barone & Edith Natalia Villegas Garcia
IRCCS San Raffaele Institute, Center for Omics Sciences, Milan, 20132, Italy
Marco Punta
IRCCS San Raffaele Institute, Unit of Immunogenetics, Leukemia Genomics and Immunobiology, Division of Immunology, Transplantation and Infectious Disease, Milan, 20132, Italy
Marco Punta

Authors

Federico Barone
View author publications
You can also search for this author in PubMed Google Scholar
Elena Tea Russo
View author publications
You can also search for this author in PubMed Google Scholar
Edith Natalia Villegas Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Marco Punta
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Cozzini
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Ansuini
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Cazzaniga
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.B., E.T.R., M.P., A.A., and A.C. conceptualized the UHGP-DPCfam project. F.B. and E.T.R. analyzed the UHGP dataset using the DPCfam pipeline. E.N.V.G. curated the metadata schema and organization of the dataset, in collaboration with E.T.R. F.B. developed the web server for online browsing of the data. S.C. supervised the management of computational resources for the project. A.C. and A.A. supervised the project. F.B., E.T.R., E.N.V.G., M.P., S.C., A.A., and A.C., contributed to the writing of the manuscript.

Corresponding authors

Correspondence to Alessio Ansuini or Alberto Cazzaniga.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Barone, F., Russo, E.T., Villegas Garcia, E.N. et al. Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering. Sci Data 11, 568 (2024). https://doi.org/10.1038/s41597-024-03131-4

Download citation

Received: 13 June 2023
Accepted: 08 March 2024
Published: 01 June 2024
DOI: https://doi.org/10.1038/s41597-024-03131-4
Springer Nature Limited

Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering

Abstract

Similar content being viewed by others

MetaGeneHunt for protein domain annotation in short-read metagenomes

SeqWiz: a modularized toolkit for next-generation protein sequence database management and analysis

HIPPI: highly accurate protein family classification with ensembles of HMMs

Background & Summary