The value of position-specific scoring matrices for assessment of protein allegenicity

Lim, Shen Jean; Tong, Joo Chuan; Chew, Fook Tim; Tammi, Martti T

doi:10.1186/1471-2105-9-S12-S21

The value of position-specific scoring matrices for assessment of protein allegenicity

Research
Open access
Published: 12 December 2008

Volume 9, article number S21, (2008)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

The value of position-specific scoring matrices for assessment of protein allegenicity

Download PDF

Shen Jean Lim¹,
Joo Chuan Tong²,
Fook Tim Chew³ &
…
Martti T Tammi^1,3,4

3455 Accesses
6 Citations
Explore all metrics

Abstract

Background

Bioinformatics tools are commonly used for assessing potential protein allergenicity. While these methods have achieved good accuracies for highly conserved sequences, they are less effective when the overall similarity is low. In this study, we assessed the feasibility of using position-specific scoring matrices as a basis for predicting potential allergenicity in proteins.

Results

Two simple methods for predicting potential allergenicity in proteins, based on general and group-specific allergen profiles, are presented. Testing results indicate that the performances of both methods are comparable to the best results of other methods. The group-specific profile approach, with a sensitivity of 84.04% and specificity of 96.52%, gives similar results as those obtained using the general profile approach (sensitivity = 82.45%, specificity = 96.92%).

Conclusion

We show that position-specific scoring matrices are highly promising for constructing computational models suitable for allergenicity assessment. These data suggest it may be possible to apply a targeted approach for allergenicity assessment based on the profiles of allergens of interest.

An Alignment-Independent Platform for Allergenicity Prediction

In Silico Prediction of Allergenic Proteins

Allergen Database

Background

Atopic allergy and other forms of hypersensitivity reactions pose a major concern for public health, affecting up to 25% of the population in industrial nations [1, 2]. With the rapid growth in the number of genetically modified (GM) food, biopharmaceuticals and other biotechnology-derived products, identifying potential allergenicity in proteins has become crucial in product safety assessment [3, 4].

Unlike laboratory-based allergenicity assessment methods such as the skin prick test and RAST (radioallergosorbent test), which are often rigorous and time-consuming, the use of bioinformatics tools has come in favorably for accelerating the discovery of novel allergens. Guidelines to evaluate allergenicity potential of proteins have been jointly proposed by the Food and Agriculture Organization (FAO)/World Health Organization (WHO) Expert Consultation on Allergenicity of Foods Derived from Biotechnology [5]. According to the bioinformatics section of the guidelines, a protein is a potential allergen if it either has an identity of ≥ 6 continuous amino acids or ≥ 35% sequence similarity over a window of 80 amino acids with a known allergen.

Although useful in some cases, it has been shown that the FAO/WHO joint recommendation produces a large number of false positives, resulting in specificities that are too low to be of practical use [6, 7]. To address these drawbacks, more sophisticated bioinformatics tools have been developed. These include support vector machines (SVM) [8], Gaussian classification algorithms [9, 10], wavelet transform models [11], allergen motifs [12], IgE sequence comparisons [13, 14] and the use of allergen-representative peptides (ARP) [15]. While these systems are effective for high similarity allergen sequences, they are less effective for when the overall similarity is low [16].

Position-specific scoring matrices (PSSM) have been very successful for detecting distantly related protein sequences [17–19], but have yet been applied for assessing allergenic potentials in proteins. In this study, we shall examine the feasibility of using PSSM as a basis for developing an effective allergenicity prediction system. As will be seen below, the use of an iterative PSI-BLAST in combination with various filters for accuracy optimization shows great promise for constructing general and group-specific profiles suitable for allergenicity assessment.

Results and discussion

The performance of both profile-based approaches was evaluated using eight different E-value thresholds (Table 1). We consider values of SP ≥ 80% and SE ≥ 80% useful in practice [20] and assessed suitability of both methods using the above cutoffs.

Table 1 Prediction quality of the profile-based methods

Full size table

General profile model

The predictive performance of the general allergen profile approach is in accordance with expected allergenic patterns in proteins and provided an accuracy (ACC) of greater than 85% (SE > 82%, SP > 85%) for E-value cutoffs of ≤ 10^-1. This approach is shown to perform best at the E-value threshold of 10^-9 (ACC = 95.02%). At this threshold, the sensitivity and specificity of the model is 82.45% and 96.92% respectively.

Group-specific profile model

Allergen sequences are currently classified into 9 major groups by the IUIS Allergen Nomenclature Sub-Committee http://www.allergen.org – i) weeds, ii) fungi, iii) grasses, iv) trees, v) mites, vi) animals, vii) insects, viii) food, and ix) others [21]. We constructed group-specific profiles based on all 9 major allergen groups, and tested their capability in predicting allergen sequences. As illustrated in Table 1, the approach achieved similar performance as the general profile model, and can predict allergens with high accuracy (ACC > 84%, SE > 84%, SP > 84%) at E-value thresholds of ≤ 10^-1. The best performance is observed at the E-value threshold of 10^-9 (ACC = 94.88%). At this threshold, the sensitivity and specificity of the model is 96.52% and 84.04% respectively.

Next, we tested the ability of group-specific profiles in identifying allergens that belong to their respective group category (Table 2). Among the 9 group-specific profile models, 7 are capable of predicting allergens with accuracy greater than 80%. Mite profile model achieved the best performance with an accuracy of 95.29% (SE = 90.81%, SP = 95.80%), followed by grass profile model (ACC = 87.81%, SE = 87.16%, SP = 87.91%), and insect profile model (ACC = 87.20%, SE = 82.08%, SP = 87.82%). The poorest performance was observed for food model (ACC= 69.63%, SE = 83.22%, SP= 63.89%). This may be attributable to the fact that food allergens contain highly diverse protein sequences that do not share much common features and sequence patterns.

Table 2 Average prediction quality of the group-specific profiles. Performance of group-specific profile models at E-value threshold of 10^-9.

Full size table

Comparison with existing methods

To benchmark the performance of the profile-based prediction methods, the five testing datasets, each consisting of 302 allergen sequences and 2000 non-allergen sequences, was used to evaluate six available techniques – the FAO/WHO evaluation scheme [5], SVM global description approach [8], SVM amino acid composition approach [14], SVM dipeptide composition approach [14], MEME motif discovery tool [12] and ARP technique [15]. The overall performance of each technique is indicated by the average performance over the five datasets.

As illustrated in Table 3, the overall performance of both general and group-specific profile-based models outperforms all other existing prediction systems investigated in this study. Both SVM amino acid and dipeptide composition methods [14] and the ARP technique [15] achieved high sensitivity (~89%) but low specificity (~57%) was also observed. The SVM global description approach [8] achieved the closest performance to the profile-based models in terms of accuracy (~93%). However, it exhibits high specificity (~95%) but low prediction sensitivity (~77%). The MEME motif discovery approach is shown to produce the lowest sensitivity (1.26%), which is lower than the reported sensitivity of 7% (at 0.001 E-value) [12]. This may be due to several reasons: i) differences in the testing dataset; and ii) the derived MEME motifs did not manage to capture essential features in allergen sequences. In agreement with previous reports [6, 7], the FAO/WHO evaluation scheme predicts allergens with low specificity (23.31%) and low accuracy (31.58%). In contrast to PSSM, the FAO/WHO similarity-based evaluation scheme incorrectly predicts a large proportion of proteins derived from bacteria (37%), viruses (9%) and yeasts (9%) as positives. It is possible that some of these proteins may contain Ig-binding epitopes, though not necessarily demonstrate IgE binding. Among the false negatives, majority are distant homologues derived from fungi (39%), food (23%) and insect (9%).

Table 3 Comparison of the performance between the profile-based methods and existing allergenicity prediction systems

Full size table

Conclusion

It is shown that profile-based methods are highly promising for assessing potential allergenicity and cross-reactivity in proteins with sensitivities and specificities of over 80%. The strength of such models lies in its ability to detect distantly related protein homologues through the use of iterated profiles [17–19]. To date, the exact mechanisms of allergy remains unclear as the structural, functional or biochemical properties of allergens that leads to allergic responses have yet to be elucidated. The allergen profiles that are constructed in this study may also be used as a basis for identifying common amino acid residues or physicochemical properties that support allergenicity [20].

Methods

Dataset

The training and testing dataset consist of 11,510 non-redundant (1,510 experimentally verified allergens and 10,000 putative non-allergens) sequences. Known allergen protein sequences were extracted from Swiss-Prot [23], GenBank [24], the Allergen Nomenclature database of the International Union of Immunological Societies (IUIS) [21], Allergome [25], the Food Allergy Research and Resource Program (FARRP) Protein AllergenOnline Database [7] and the Structural Database of Allergen Proteins (SDAP) [13]. The distribution of the allergen data used in this study is illustrated in Figure 1. An initial list of protein sequences unlikely to be associated with allergy was generated by extracting all protein sequences from Swiss-Prot with the exception of entries containing text strings 'allergen', 'allergy', 'atopy' or derivatives thereof in the annotation [9]. From this list, 10,000 putative non-allergens were randomly selected for model construction. Only 1 putative non-allergen sequence is extracted from each protein family to avoid bias.

The dataset was shuffled randomly and partitioned into five sets for five-fold cross validation, each time using one set for testing and the remaining four sets for training. Each training set contains 1,208 experimentally determined allergens and 8,000 non-allergens while each testing set contains 302 experimentally determined allergens and 2,000 non allergens.

Profile-based methods

The general strategy of our iterative profile-based methods is shown in Figure 2. Allergen profiles are generated and optimized using sequences in the training set while sequences in the testing set are used to evaluate the overall performance of the profile-based methods. The system is implemented using the NCBI BLAST package [17] and PERL scripts.

Method 1: general allergen profiles

This method predicts potential allergens by performing a RPS-BLAST search against a database of general allergen profiles optimized for accuracy and performance. The construction of allergen profiles involves an initial screening step and a subsequent optimization step, as outlined in Figure 3.

During the initial screening step, a PSI-BLAST search (10 iterations, e-value threshold 10^-3) was performed on each allergen sequence in the training set against all other allergen sequences in the dataset. This generates a profile or PSSM for each allergen protein sequence. In this study, a minimum of two sequences was used for constructing a profile.

In the optimization step, another round of PSI-BLAST search was performed on each of the selected allergen sequence using eight different e-value thresholds (10, 1, 10^-1, 10^-2, 10^-3, 10^-4, 10^-6 and 10^-9). This generates eight profiles for each allergen sequence corresponding to the different e-value threshold. Each of the eight profiles was tested by RPS-BLAST using allergen sequences in the training set as query. For each allergen sequence in the training dataset, the best profile (with the highest accuracy) was selected and incorporated into the predictive model. This approach produces a collection of general allergen profiles optimized for accuracy and performance.

Method 2: group-specific allergen profiles

This method predicts protein allergenicity by performing a RPS-BLAST search against a database of group-specific allergen profiles optimized for accuracy and performance.

Allergen sequences in the training set were partitioned into nine groups – i) weeds, ii) fungi, iii) grasses, iv) trees, v) mites, vi) animals, vii) insects, viii) food, and ix) others, according to the recommendation by the IUIS Allergen Nomenclature Sub-Committee [24]. For the screening phase, PSI-BLAST was performed by partitioning allergens into the 9 major groups and using individual groups of allergens as the training dataset. This generates profiles specific to each particular group of allergens, which are subsequently optimized according to their predictive accuracy and used for constructing group-specific allergenicity prediction systems.

Performance measures

The predictive performance of the general and group-specific models was evaluated using sensitivity (SE), specificity (SP), accuracy (ACC), positive predictive value (PPV), negative predictive value (NPV), and Matthews correlation coefficient (MCC) [26]. In the latter, the positive dataset consists of testing allergen sequences belonging to a specified group whereas the negative dataset consists of all other allergen sequences in the testing set except the selected group. SE = TP/(TP+FN), SP = TN/(TN+FP) and ACC = (TP+TN)/N, indicate percentages of correctly predicted allergens, non-allergens and all proteins respectively. PPV = TP/(TP+FP) and NPV = TN/(TN+FN) denote the proportion of allergens and non-allergens that are correctly predicted, respectively. TP (true positives) represents known allergens and TN (true negatives) for non-allergens. FN (false negatives) denotes known allergens predicted as non-allergens, and FP (false positives) represents non-allergens predicted as allergens. The MCC, which is used to measure the randomness of the prediction, is computed and defined as follow:

M C C = \frac{(T P \times T N) - (F N \times F P)}{\sqrt{(T N + F N) (T P + F N) (T N + F P) (T P + F P)}}

The MCC returns a value between -1 and 1: MCC = 1 for 100% agreement of the prediction, MCC = 0 for completely random prediction and MCC = -1 for 100% disagreement of the prediction.

Five-Fold cross validation

Five-fold cross validation was performed to assess the quality of all predictive models described in this study [20]. In k-fold cross-validation, k random, (approximately) equal-sized, disjoint partitions of the sample data are constructed, and a given model is trained on (k-1) partitions and tested on the excluded partition. The results are averaged after k such experiments, and the observed error rate may be taken as an estimate of the error rate expected upon generalization to new data.

References

Mekori YA: Introduction to allergic diseases. Crit Rev Food Sci Nutr 1996, 36: S1–18.
Article CAS PubMed Google Scholar
Nieuwenhuizen NE, Lopata AL: Fighting food allergy: current approaches. Ann NY Acad Sci 2005, 1056: 30–45. 10.1196/annals.1352.003
Article CAS PubMed Google Scholar
Goodman RE, Hefle SL, Taylor SL, Ree RV: Assessing genetically modified crops to minimize the risk of increased food allergy: a review. Int Arch Allergy Immunol 2005, 137: 153–166. 10.1159/000086314
Article CAS PubMed Google Scholar
Heppenheimer TA: The growth of genetically modified foods. Am Herit Invent Technol 2003, 19: 16–25.
CAS PubMed Google Scholar
FAO/WHO: Codex Principles and Guidelines on Foods Derived from Biotechnology. Joint FAO/WHO Food Standards Programme, Rome, Italy; 2003.
Google Scholar
Fiers MW, Kleter GA, Nijland H, Peijnenberg AA, Nap JP, van Ham RC: Allermatch, a webtool for the prediction of potential allergenicity according to current FAO/WHO Codex alimentarius guidelines. BMC Bioinformatics 2004, 5: 133. 10.1186/1471-2105-5-133
Article PubMed Central PubMed Google Scholar
Hileman RE, Silvanovich A, Goodman RE, Rice EA, Holleschak G, Astwood JD, et al.: Bioinformatic methods for allergenicity assessment using a comprehensive ALLERGEN database. Int Arch Allergy Immunol 2002, 128: 280–291. 10.1159/000063861
Article CAS PubMed Google Scholar
Cui J, Han LY, Li H, Ung CY, Tang ZQ, Zheng CJ, Cao ZW, Chen YZ: Computer prediction of allergen proteins from sequence-derived protein structural and physicochemical properties. Mol Immunol 2007, 4: 514–520. 10.1016/j.molimm.2006.02.010
Article Google Scholar
Soeria-Atmadja D, Zorzet A, Gustafsson MG, Hammerling U: Statistical evaluation of local alignment features predicting allergenicity using supervised classification algorithms. Int Arch Allergy Immunol 2004, 133: 101–112. 10.1159/000076382
Article CAS PubMed Google Scholar
Zorzet A, Gustafsson M, Hammerling U: Prediction of food protein allergenicity: a bioinformatic learning systems approach. In Silico Biol 2002, 2: 525–534.
CAS PubMed Google Scholar
Li KB, Isaac P, Krishnan A: Predicting allergenic proteins using wavelet transform. Bioinformatics 2004, 20: 2572–2578. 10.1093/bioinformatics/bth286
Article CAS PubMed Google Scholar
Stadler MB, Stadler BM: Allergenicity prediction by protein sequence. FASEB J 2003, 17: 1141–1143.
CAS PubMed Google Scholar
Ivanciuc O, Schein CH, Braun W: SDAP: database and computational tools for allergenic proteins. Nucleic Acids Res 2003, 31: 359–362. 10.1093/nar/gkg010
Article PubMed Central CAS PubMed Google Scholar
Saha S, Raghava GPS: AlgPred: prediction of allergenic proteins and mapping of IgE epitopes. Nucleic Acids Res 2006, 34: W202-W209. 10.1093/nar/gkl343
Article PubMed Central CAS PubMed Google Scholar
Björklund AK, Soeria-Atmadja D, Zorzet A, Hammerling U, Gustafsson MG: Supervised identification of allergen-representative peptides for in silico detection of potentially allergenic proteins. Bioinformatics 2005, 21: 39–50. 10.1093/bioinformatics/bth477
Article PubMed Google Scholar
Tong JC, Tammi MT: Methods and protocols for the assessment of protein allergenicity and cross-reactivity. Front Biosci 2008, 13: 4882–4888. 10.2741/3047
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
Article PubMed Central CAS PubMed Google Scholar
Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292: 195–202. 10.1006/jmbi.1999.3091
Article CAS PubMed Google Scholar
Xie D, Li A, Wang M, Fan Z, Feng H: LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acid Res 2005, 33: 105–110. 10.1093/nar/gki359
Article Google Scholar
Tong JC, Zhang GL, Tan TW, August JT, Brusic V, Ranganathan S: Prediction of HLA-DQ3.2β ligands: Evidence of multiple registers in class II binding peptides. Bioinformatics 2006, 22: 1232–1238. 10.1093/bioinformatics/btl071
Article CAS PubMed Google Scholar
King TP, Hoffman D, Lowenstein H, Marsh DG, Platts-Mills TA, Thomas W: Allergen nomenclature. WHO/IUIS allergen nomenclature subcommittee. Int Arch Allergy Immunol 1994, 105: 224–233.
Article CAS PubMed Google Scholar
Breiteneder H, Mills EN: Molecular properties of food allergens. J Allergy Clin Immunol 2005, 115: 14–23. 10.1016/j.jaci.2004.10.022
Article CAS PubMed Google Scholar
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365–370. 10.1093/nar/gkg095
Article PubMed Central CAS PubMed Google Scholar
Dennis AB, Ilene KM, David JL, James O, David LW: Genbank. Nucleic Acid Res 2005, 33: D34-D38. 10.1093/nar/gni032
Article Google Scholar
Mari A, Mari V, Ronconi A: Allergome – a database of Allergenic molecules: structure and data implementations of a web-based resource. J Allergy Clin Immunol 2005, 115: S87. 10.1016/j.jaci.2004.12.359
Article Google Scholar
Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16: 412–424. 10.1093/bioinformatics/16.5.412
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported in part by grant R-154-000-265-112 from the National University of Singapore.

This article has been published as part of BMC Bioinformatics Volume 9 Supplement 12, 2008: Asia Pacific Bioinformatics Network (APBioNet) Seventh International Conference on Bioinformatics (InCoB2008). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S12.

Author information

Authors and Affiliations

Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 8 Medical Drive, Singapore, 117597
Shen Jean Lim & Martti T Tammi
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore, 119613
Joo Chuan Tong
Department of Biological Sciences, National University of Singapore, 14 Science Drive 4, Singapore, 117543
Fook Tim Chew & Martti T Tammi
Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm, 17177, Sweden
Martti T Tammi

Authors

Shen Jean Lim
View author publications
You can also search for this author in PubMed Google Scholar
Joo Chuan Tong
View author publications
You can also search for this author in PubMed Google Scholar
Fook Tim Chew
View author publications
You can also search for this author in PubMed Google Scholar
Martti T Tammi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martti T Tammi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MTT designed the study, while all authors performed the experiments and the analyses. All authors have read and approved this manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Lim, S.J., Tong, J.C., Chew, F.T. et al. The value of position-specific scoring matrices for assessment of protein allegenicity. BMC Bioinformatics 9 (Suppl 12), S21 (2008). https://doi.org/10.1186/1471-2105-9-S12-S21

Download citation

Published: 12 December 2008
DOI: https://doi.org/10.1186/1471-2105-9-S12-S21

The value of position-specific scoring matrices for assessment of protein allegenicity