Background & Summary

In recent years, technological advances led to a considerable drop in cost and time of sequencing experiments resulting in an exponential growth of the number of known protein sequences. For example, the latest release of the UniProtKB database, containing 227 M sequences, expands the 2019 version by 90%1,2. Metagenomic databases grow even faster: the EMBL-EBI MGnify, including hundreds of millions of sequences produced from high-quality assembled genomes, reported a 48-fold growth since its first release in 20173,4.

Functional annotation of protein sequences is crucial for a broad spectrum of biological applications, ranging from understanding cellular mechanisms to drug discovery5,6. Since large-scale experimental characterization is unfeasible, researchers rely on alternative tools to impute sequence-function relations. A prominent strategy consists in grouping homologous protein regions into families to formulate functional hypotheses based on common ancestry. Current efforts to provide protein family annotations rely at least to some degree on manual intervention, thus limiting their ability to match the ever-increasing influx of sequenced data while inherently introducing human bias. For example, the number of families in the Pfam database, focused on classifying all protein domains in evolutionary-related groups, increased by 6% over the last five years. Despite the introduction of novel families, Pfam maintained a steady level of sequence and residue coverage of the UniProtKB database over the same time period, respectively 77% and 53%. Increasing the level of coverage with the inclusion of novel families is becoming more challenging as new Pfam entries tend to cover a small taxonomic range7,8. Furthermore, the proliferation of metagenomic projects will likely induce a further drop in Pfam coverage. In fact, family annotation levels of metagenomic protein sequence databases are rather low, typically around 46%3.

To address this problem, we developed the DPCfam pipeline, a two-step Density Peak Clustering (DPC) algorithm9 which classifies sequence-similar protein regions into putative protein families called metaclusters (MCs)10. Crucially, the pipeline generates the final classification through a fully unsupervised algorithm that relies only on local pairwise sequence alignments. Supported by a high-performance modular implementation that ensures scaling over large databases, DPCfam consists of a fully automated alternative tool that can boost the annotation of protein datasets. In previous work, we successfully used DPCfam to classify the UniRef50 dataset (version 2017_07), containing about 23 million sequences11. As part of this work, 45,000 metaclusters were generated, of which 30% consisted of protein regions that lacked Pfam annotation. These regions may represent novel protein families. The results from this procedure have already been used to identify 63 new protein families in Pfam 35.0.

In the present study, we apply the DPCfam pipeline to metagenomics data, focusing on the Unified Human Gastrointestinal Protein catalogue clustered at 50% amino acid identity (UHGP-50). This dataset is part of the MGnify catalogue and contains 4 million entries representing 600 million proteins encoded in the Unified Human Gastrointestinal Genome collection12. The overall scope of the Unified Human Gastrointestinal initiative was to generate a nonredundant dataset of the human gut genomes and proteins by merging together the most relevant studies13,14,15,16,17 into a unified curated catalogue. The dataset has been widely used by the scientific community and has already fostered important applications in medicine and biology18,19; despite this, large portions of the catalogue lack functional annotation.

The DPCfam pipeline on UHGP-50 took 15 days to complete on a 5-node cluster, leveraging 120 cores of Intel Xeon Gold 6126 processors, each with 500 GB of RAM. The pipeline generated 10,778 metaclusters, each consisting of a set of sequence-similar protein regions called seeds. We filter seed regions based on sequence similarity for each metacluster to build profile-HMMs20. These profiles were then used to perform a more sensitive search on UHGP-50. We took all significant profile-HMM hits (domain E-value = 0.03, protein E-value = 0.01) as our new metacluster member regions. Using this extended metacluster representation, DPCfam covers 50.02% of all residues and 53.38% of all sequences in UHGP-50. This represents a substantial gain in coverage if compared with the Pfam annotation reported in the original UHGP release12, which has a residue and sequence coverage for UHGP-50 of 26.2% and 37.55% respectively. Of all metaclusters, 1,261 do not overlap either with families in Pfam or with metaclusters in DPCfam-UniRef50 and thus represent potential novel families, which might be gut metagenome-specific.

The aim of this study is to provide researchers with a database that could accelerate the identification of novel protein families in the Human Gastrointestinal Proteome, thus potentially enabling the formulation of new functional hypotheses. Future plans include further optimization of the clustering procedure allowing for faster processing of newly deposited metagenomic data, and combining automatic classification with functional annotation leveraging deep learning21,22 for even more comprehensive annotation of UHGP.

Methods

We present a schematic overview of DPCfam in Fig. 1. The pipeline consists of four main stages: 1) all-versus-all sequence alignment; 2) domain identification, also referred to as first clustering; 3) family creation, also referred to as secondary clustering or metaclustering; 4) metacluster merging and filtering procedure. Pre- and post-processing steps depends on each specific case. We briefly review each stage of the pipeline, and we refer to the work of Russo et al.11 for further details.

Fig. 1
figure 1

Steps of the DPCfam pipeline. Pre-processing: we take all sequences in UHGP-50 or the redundancy-reduced version of UHGP. All-versus-all alignment: for each sequence in UHGP-50 (query sequence), local alignments are generated against all other UHGP-50 sequences (search sequences) using blastp. Primary clustering: the sequence search regions that align to a query sequence are grouped together using DPC according to the relative position of their alignments along the query sequence. Metaclustering: the primary clusters from different query sequences are grouped together to form metaclusters using DPC according to how many search sequence regions they share. Merging and filtering: metaclusters that after the previous step still share a significant number of search sequence regions are merged together to obtain the final list of metaclusters. Post-processing: profile Hidden Markov Models are generated for each metacluster and run against UHGP-50 in order to increase coverage of the automatic classification.

Pre-processing

The DPCfam algorithm takes as input a dataset of non-redundant sequences with at most 50% amino acid identity between its entries. The filtering procedure by sequence identity avoids the over-representation of the number of pairwise alignments of a given protein region which could introduce artefacts in the final output. For this study, we used the clustered dataset UHGP-50 version 1.0 provided by MGnify (https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v1.0/uhgp_catalogue/).

All-versus-all alignment

The first step of the pipeline is an all-versus-all local pairwise sequence alignment computed with blastp, a software part of the BLAST + v2.2.30 suite23. The result is a collection of search sequence regions that align with each query protein. This step is by far the most computationally demanding, despite being embarrassingly parallel, as each query alignment search is independent.

Primary clustering

For each query sequence Q, the output of blastp consists of a collection of protein regions that align to different portions of Q. The primary clustering procedure groups them based on their overlap along the query sequence. To this end, we first define the following distance measure between two generic search sequence regions i and j of a query sequence Q,

$${d}_{ij}^{Q}=1-\frac{\left|{Q}_{i}\cap {Q}_{j}\right|}{\left|{Q}_{i}\cup {Q}_{j}\right|},$$
(1)

where Qi and Qj are the portions, expressed in number of amino acids, of the query Q covered by the respective search sequence regions. Note that if they cover exactly the same region of the query, the distance is zero, and if they do not overlap, the distance is 1. For each query Q, we use the above distance to cluster the aligned protein regions using the DPC algorithm. Each cluster, or primary cluster, represents a high-density region along the query sequence and comprises a set of search sequence regions. From a query perspective, primary clustering segments the sequence into regions akin to domains or domain combinations (architectures). This is also reflected on the associated local alignments, which likely represent domains or domain combinations of the search sequences they originate from. Primary clusters thus represent a first family-like group of sequences anchored to a specific query.

Metaclustering (or Secondary Clustering)

Once we identify the segmentation of the query sequences according to clusters of search sequence regions, we proceed to classify them into metaclusters. We define the distance between two primary clusters as the number of members they have in common with an overlap bigger than 0.8, normalized by the size of the smaller primary cluster. The secondary clustering procedure is done once again using the DPC algorithm, and the resulting metaclusters are collections of primary clusters, which in turn contain a collection of protein regions.

Merging and filtering

At this stage of the pipeline, we have a set of protein regions classified into metaclusters, with each metacluster associated to a density peak in the landscape of primary clusters. Given the nature of the algorithm, two peaks separated by a high-density saddle point will be considered as independent metaclusters but could still share a large number of sequences. To avoid this phenomenon we perform a merging procedure in which we join metaclusters that match this condition. We define the distance between two metaclusters as the average primary cluster distance taken over all their member pairs (primary cluster distance as defined in the previous section). If the average distance is smaller than 0.9, we merge the metaclusters. After the merging step we perform a filtering step to remove duplicates within MCs and, separately, to remove sequences that belong to only one of the merging clusters. The list of protein regions within each metaclusters constitute the seeds of DPCfam-generated putative families.

Post-processing

The output of the previous stage is a list of metaclusters, each containing a set of protein regions selected among the original local alignments. To further extend the sequence and residue coverage of our MCs, for each of them, we build a profile-HMM according to the following procedure: 1) pruning of the MC seed sequences with CD-HIT v4.724 at 60 percent identity; 2) generation from the pruned seed of a multiple sequence alignment (MSA) with the software MUSCLE v3.8.3125; 3) construction from the MSA of a profile-HMM with HMMER - hmmbuild v3.1b226; 4) running of the profile-HMM against UHGP-50 using HMMER-hmmsearch v3.1b2 (domain E-value = 0.03, protein E-value = 0.01). Note that the profile-HMM models can further be used to automatically annotate previously unseen proteins without the need to rerun the DPCfam clustering algorithm.

Data Records

The full dataset is available for download from Zenodo27 (https://doi.org/10.5281/zenodo.10611777), and it can be explored interactively from our dedicated website at https://dpcfam.areasciencepark.it/uhgp. Only metaclusters with more than 50 protein seeds and an average sequence length of more than 50 amino acids are included in the dataset, resulting in a total of 10,778 metaclusters. To make the data more interoperable and reusable, metadata about all the seeds have been combined into a single XML file. This format allows for different types of information to be included in a hierarchical but flexible structure that can easily be built up sequentially. It is a machine-readable and well-established format used in many projects in the life sciences. In particular, it is the format used by the UniProt consortium to publish protein annotations. With this in mind, we also built an XML file containing all UHGP-50 proteins and subsequently annotated it with DPCfam-UHGP50 metaclusters and Pfam family membership.

In the remaining part of this section, we provide a detailed description of the content of the Zenodo repository, including the complete list of files associated with each metacluster and the XML files containing metacluster information organized hierarchically along with UHGP-50 protein annotation.

Metacluster Files

For each of the 10,778 metaclusters, we provide a fasta file with the metacluster seed sequences, a fasta file with the representative seed sequences after clustering at 60 percent identity, a fasta file with the MSA of the representative seed sequences, and a file with the corresponding profile-HMM. Files are grouped by category and compressed into the following archives: seeds.zip, filtered_seeds.zip, metaclusters_msas.tar.gz, and metaclusters_hmms.tar.gz.

XML files

The metaclusters_xml.tar.gz archive contains the metaclusters_uhgp50.xml file and auxiliary parsing scripts, the use of which is described in README.md. In addition to the seed sequences, the XML file contains the following information about each metacluster: the number of seed sequences, the mean and standard deviation of the amino acid length of the seed sequences, the fraction of amino acids in a low-complexity region, the fraction of amino acids in a coiled-coil region, the fraction of amino acids in a disordered region and the mean number of transmembrane regions.

We also include data comparing the UHGP-50 metaclusters with the Pfam annotation found in the UHGP-50 downloadable database. The metadata resulting from this comparison is described in Table 1. For an in-depth description of the analysis this data derives from, we refer the reader to Technical Validation and Russo et al.11. Metaclusters are classified into four groups: equivalent, reduced, extended, and shifted. The specific category assigned to each metacluster depends on the fraction of the seed region that Pfam does not cover and vice versa. These values are averaged over all metacluster seed sequence - Pfam sequence pairs. If both are within 0.2, we consider the metacluster “equivalent” to the Pfam families it overlaps with; if both fractions are larger than 0.2, the metacluster is categorized as shifted. If only one of the fractions is larger than 0.2, the metacluster is either extended or reduced (see Fig. 2).

Table 1 Detailed description for each XML field containing the DPCfam vs Pfam comparison metadata.
Fig. 2
figure 2

Visual representation of the different types of overlap between a Pfam family and a DPCfam metacluster. The orange region represents the ground truth (Pfam), while the blue region represents the DPCfam metacluster. For a more detailed description of how these categories are assigned, refer to the original DPCfam article11.

The archive uhgp_xml.tar.gz contains the UHGP-50 protein dataset annotated with Pfam and DPCfam families. For DPCfam, it is also indicated if the protein is a seed sequence or if it was found after searching the database with the profile-HMMs.

Mapping File

To make our results compatible with future versions of UHGP, we have included a file named uhgp_protein_mapping.txt in the Zenodo repository that provides a mapping between protein identifiers from versions 1.0 and 2.0.2 of the UHGP dataset. This mapping consists of three different identifiers for each protein: the first corresponds to the identifier of the protein from the original dataset used in this study (UHGP-50 version 1.0), the second corresponds to the identifier for version 2.0.2 of UHGP, and the third corresponds to the representative of the protein when clustered at 100% identity (UHGP-100 version 2.0.2). The latter is necessary because UHGP version 2.0.2 only provides the amino acid sequences for UHGP-100 representative proteins to avoid repetition.

Technical Validation

First, we look into the generic properties of the metaclusters, including size distribution, predicted average number of transmembrane regions, and the fraction of disordered, coiled-coil, and low-complexity regions. Second, using Pfam annotation as a reference, we test the existence of homologous relationships between metacluster members and analyse the quality of the boundaries. Finally, we study the overlap between UHGP-50 and UniRef50 DPCfam-generated metaclusters. From these analyses, we identified 1,261 unknown metaclusters, i.e. metaclusters that do not overlap with either Pfam or DPCfam-UniRef50 annotation. While overlapping metaclusters may be helpful for improving existing annotation, the subset of unknown entries might represent protein families that are novel and potentially unique to the gut microbiome.

Metaclusters general properties

Clustering UHGP-50 with DPCfam produced 40,738 metaclusters. While this is a rather large number, it should be noted that DPCfam metacluster collections feature a certain degree of redundancy, with different MCs mapping to overlapping regions of the same protein11. The median metacluster size (i.e., number of sequences) was 24, while the average size was 90.5. Figure 3 shows the Complementary Cumulative Distribution Function calculated from the metaclusters size distribution for UHGP-50 (blue) and, as a reference, UniRef50 (grey). The relationship between how frequent protein domain families appear in nature and their size was first described by Gerstein et al.28 and can be explained by a birth, death, and innovation evolutionary model (BDIM)29. The resulting analytical expression is a Generalized Pareto Distribution which approximately fits the data with an exponent of 1.23 ± 0.01 (red).

Fig. 3
figure 3

Log-Log plot of the Complementary Cumulative Distribution Function (CCDF) of UHGP-MCs (blue) and UR50-MCs (grey) sizes. The red line is the best fit of the data with a CCDF calculated from a Generalized Pareto Distribution (exponent of the Pareto 1.23 ± 0.01).

In this regime of ‘the rich get richer’, elements larger in size are usually the more interesting ones. Moreover, as noted in our previous work11, metaclusters of smaller size are, on average, of lower quality. Because of this, we decided to focus our downstream analysis on the set of MCs containing at least 50 members. We also excluded MCs with an average domain length of less than 50 amino acids since length also affects the quality of the alignments and most known structural domains are longer. After applying these filters, we obtained 10,778 metaclusters which, in the following, we refer to as the DPCfam-UHGP50 dataset. The median metacluster size after filtering was 282.7, while the average size was 110.

Next, we examined the fraction of amino acids which are predicted to be found in a coiled-coil, low-complexity, and disordered region, using the software DeepCoil v2.0.130, Segmasker part of the BLAST + v2.2.30 suite31 and IUPred2A32 respectively. We labeled metaclusters with more than 10% of amino acids from all of their members in a low-complexity or coiled-coil region as Low Complexity MCs or Coiled Coil MCs, respectively. To label metaclusters as Disordered we used instead a threshold of 50% of all its amino acids in a disordered state. Figure 4 shows a Venn Diagram with the percentage of metaclusters in one or more of these categories. Although not a direct measure of quality, high values of these quantities indicate a bias in the amino acid composition of their member sequences, which makes it harder to infer homology from sequence similarity. This is especially true for low-complexity regions and coiled-coil regions.

Fig. 4
figure 4

Percentage of metaclusters labeled as Low Complexity, Disordered, or Coiled Coil based on the fraction of amino acids of their seed sequences that are predicted to be found in a low-complexity, disordered, or coiled-coil region.

Results were in line with what we had observed when running DPCfam on UniRef5011; indeed, the majority of the metaclusters (88.96%) did not fall into any of these categories. We noted that the fraction of disordered MCs was rather small (about 1%). This was expected, however, since intrinsic disorder is not very common in prokaryotes33, which constitute the vast majority of gut microbiome organisms12. As an additional measure, we used the software Phobius v1.0134 to predict the number of transmembrane helical domains in each protein of the dataset. Among all metaclusters, 6.8% had an average of at least two transmembrane regions per member and were thus likely to represent transmembrane helical bundle domains.

Comparison with Pfam

To assess homologous relationships between metacluster seed sequences and to check how well our automatically-generated MCs boundaries mapped to those of manually annotated protein families, we compared our results with UHGP-50 Pfam annotation.

To perform the comparison between the two classifications, we first extracted the Pfam labels from the Interpro scan file, which comes as part of the UHGP-50 downloadable database. In order to include multiple family architectures in the analysis, since metaclusters may extend over one or more families, we extended the Pfam annotation by combining single families into all possible family architectures. In what follows, the term Pfam architecture will be used to refer to either single families or multiple family architectures.

To identify the Pfam architecture which best represents a particular DPCfam metacluster, we adopted the same criteria used in Russo et al.11. In particular, we defined the MC’s dominant architecture as the Pfam architecture shared by the largest number of its seed sequences (note that annotating a single amino acid is enough to associate a seed sequence to a Pfam family). To compare DPCfam and Pfam, we selected only those MCs that have a dominant architecture shared by at least 50% of the seed sequences; we will refer to this as “DA ≥ 50%”.

Figure 5 shows the distribution of the average overlap between each metacluster and its corresponding dominant architecture. The histogram includes the 4,434 metaclusters for which a Pfam DA ≥ 50% was present. Colours correspond to the type of overlap between the two annotations (see Fig. 2). Among the MCs we consider here, 41.7% were classified as equivalent, 26.2% were reduced, 20.5% were extended and 11.6% were shifted.

Fig. 5
figure 5

DPCfam-UHGP50 metaclusters average overlap with respect to its Pfam dominant architecture (only metaclusters with DA ≥ 50%). Colors show the overlap class decomposition for each bin. The pie chart shows the aggregated contribution of each class expressed as a percentage of the 4,434 metaclusters.

Equivalent metaclusters have good boundary agreements with their Pfam dominant architecture. Reduced and extended metaclusters identify regions that are, on average, shorter or larger than the dominant architecture, respectively. Shifted metaclusters, conversely, correspond to cases where the metacluster covers regions that are consistently located N-terminal or C-terminal to the dominant architecture.

Among the remaining metaclusters without a well-defined DA, 2,277 do not overlap with any Pfam annotation and were therefore labeled as unknown to Pfam. This pool of metaclusters is particularly intriguing since it may include novel protein families.

Comparison with DPCfam-UniRef50

To further evaluate our results, we compared DPCfam-UHGP50 metaclusters (UHGP-metaclusters) against DPCfam-UniRef50 metaclusters (UR-metaclusters) which were obtained by clustering a dataset 5 times bigger11. The procedure was analogous to the one described in the previous section in which we compared UHGP-metaclusters against Pfam. The objective was to find, for each UHGP-metacluster, the correspondent UR-metacluster dominant architecture in order to establish a link between the two datasets. To obtain UR-metaclusters labels for UHGP-50 sequences, we ran the UR-metacluster profile-HMMs against UHGP-50 using HMMER - hmmsearch software.

Figure 6 shows the comparison results, where colours indicate the type of overlap with the dominant architecture. Among the 6,764 UHGP-metaclusters with a well-defined UR-metacluster dominant architecture, 50.6% are labeled as equivalent. This meant that most UHGP-50-generated metaclusters that had a well-matched UR-metacluster had close boundary correspondence with the latter. Only 8.7% of metaclusters were classified as shifted, further indicating that DPCfam can quite consistently identify family boundaries when run on different datasets.

Fig. 6
figure 6

DPCfam-UHGP50 metaclusters average overlap with their DPCfam-UniRef50 dominant architecture (only metaclusters with DA ≥ 50%). Colours show the overlap class decomposition for each bin. The pie chart shows the aggregated contribution of each class expressed as a percentage of the 6,764 metaclusters.

Among those without a well-defined dominant architecture, we identified 1,953 with no overlapping UR-metacluster annotation. We labeled this subset of metaclusters as unknown to UR-metacluster. The 1,261 metaclusters unknown to both Pfam and UR-metacluster annotation creates a collection of potential new families that could be specific to the gut metagenome.

Usage Notes

For ease of use, we provide parser scripts that allow to transform the XML files into space-separated tables.

For the XML file containing the metaclusters, the parser script generates a folder containing one fasta file per metacluster, as well as seven space-separated tables. Each fasta file contains all of the seed sequences of the corresponding metacluster. The space-separated txt files include the following: a list of all metacluster IDs, a table with statistical information on the seed sequences for each metacluster, a table with Pfam comparison measures for each metacluster, and four tables with additional metacluster properties (low-complexity, disordered, coiled-coil and transmembrane regions).

For the XML file listing all UHGP-50 proteins, the script generates a list of DPCfam matches (with start and end positions in the matching protein), a list of Pfam matches, a list of all UGHP-50 proteins, and lists of metacluster seed sequences before and after the CD-HIT filtering step.

Specific details on how to run the scripts are described in the README file. Parts of the code can be commented out to avoid generating all output files.

Dedicated Webserver

The dataset can also be explored online on the dedicated DPCfam browser available at https://dpcfam.areasciencepark.it/uhgp. This website includes all the putative protein families from the DPCfam-UHGP50 clustering and lets the user search either for individual proteins or specific metaclusters, as well as Pfam families. On the website we additionally provide all this information for DPCfam v1.1 on Uniref-50.

Each DPCfam metacluster has its entry with information on basic statistics (including the percentage of low-complexity, coiled-coil, disordered domains, and the average number of transmembrane helices), the Pfam dominant architecture if present, as well as the complete list of seed sequences. The seed sequence list, multiple sequence alignments, and HMM model can be downloaded from the metacluster page.

Proteins also have individual entries (see Fig. 7). For each protein, the corresponding entry page displays the list of DPCfam and Pfam domains and a graphical representation of their positions along the protein sequence. Results for Pfam queries include, if present, the matching DPCfam metaclusters and the degree and type of overlap.

Fig. 7
figure 7

Screenshot of a protein entry page on the DPCfam web browser. A list of families is shown for the protein (both DPCfam and Pfam families) and their relative positions can be visualized in the schematic view.