Background

Atopic allergy and other forms of hypersensitivity reactions pose a major concern for public health, affecting up to 25% of the population in industrial nations [1, 2]. With the rapid growth in the number of genetically modified (GM) food, biopharmaceuticals and other biotechnology-derived products, identifying potential allergenicity in proteins has become crucial in product safety assessment [3, 4].

Unlike laboratory-based allergenicity assessment methods such as the skin prick test and RAST (radioallergosorbent test), which are often rigorous and time-consuming, the use of bioinformatics tools has come in favorably for accelerating the discovery of novel allergens. Guidelines to evaluate allergenicity potential of proteins have been jointly proposed by the Food and Agriculture Organization (FAO)/World Health Organization (WHO) Expert Consultation on Allergenicity of Foods Derived from Biotechnology [5]. According to the bioinformatics section of the guidelines, a protein is a potential allergen if it either has an identity of ≥ 6 continuous amino acids or ≥ 35% sequence similarity over a window of 80 amino acids with a known allergen.

Although useful in some cases, it has been shown that the FAO/WHO joint recommendation produces a large number of false positives, resulting in specificities that are too low to be of practical use [6, 7]. To address these drawbacks, more sophisticated bioinformatics tools have been developed. These include support vector machines (SVM) [8], Gaussian classification algorithms [9, 10], wavelet transform models [11], allergen motifs [12], IgE sequence comparisons [13, 14] and the use of allergen-representative peptides (ARP) [15]. While these systems are effective for high similarity allergen sequences, they are less effective for when the overall similarity is low [16].

Position-specific scoring matrices (PSSM) have been very successful for detecting distantly related protein sequences [1719], but have yet been applied for assessing allergenic potentials in proteins. In this study, we shall examine the feasibility of using PSSM as a basis for developing an effective allergenicity prediction system. As will be seen below, the use of an iterative PSI-BLAST in combination with various filters for accuracy optimization shows great promise for constructing general and group-specific profiles suitable for allergenicity assessment.

Results and discussion

The performance of both profile-based approaches was evaluated using eight different E-value thresholds (Table 1). We consider values of SP ≥ 80% and SE ≥ 80% useful in practice [20] and assessed suitability of both methods using the above cutoffs.

Table 1 Prediction quality of the profile-based methods

General profile model

The predictive performance of the general allergen profile approach is in accordance with expected allergenic patterns in proteins and provided an accuracy (ACC) of greater than 85% (SE > 82%, SP > 85%) for E-value cutoffs of ≤ 10-1. This approach is shown to perform best at the E-value threshold of 10-9 (ACC = 95.02%). At this threshold, the sensitivity and specificity of the model is 82.45% and 96.92% respectively.

Group-specific profile model

Allergen sequences are currently classified into 9 major groups by the IUIS Allergen Nomenclature Sub-Committee http://www.allergen.org – i) weeds, ii) fungi, iii) grasses, iv) trees, v) mites, vi) animals, vii) insects, viii) food, and ix) others [21]. We constructed group-specific profiles based on all 9 major allergen groups, and tested their capability in predicting allergen sequences. As illustrated in Table 1, the approach achieved similar performance as the general profile model, and can predict allergens with high accuracy (ACC > 84%, SE > 84%, SP > 84%) at E-value thresholds of ≤ 10-1. The best performance is observed at the E-value threshold of 10-9 (ACC = 94.88%). At this threshold, the sensitivity and specificity of the model is 96.52% and 84.04% respectively.

Next, we tested the ability of group-specific profiles in identifying allergens that belong to their respective group category (Table 2). Among the 9 group-specific profile models, 7 are capable of predicting allergens with accuracy greater than 80%. Mite profile model achieved the best performance with an accuracy of 95.29% (SE = 90.81%, SP = 95.80%), followed by grass profile model (ACC = 87.81%, SE = 87.16%, SP = 87.91%), and insect profile model (ACC = 87.20%, SE = 82.08%, SP = 87.82%). The poorest performance was observed for food model (ACC= 69.63%, SE = 83.22%, SP= 63.89%). This may be attributable to the fact that food allergens contain highly diverse protein sequences that do not share much common features and sequence patterns.

Table 2 Average prediction quality of the group-specific profiles. Performance of group-specific profile models at E-value threshold of 10-9.

Comparison with existing methods

To benchmark the performance of the profile-based prediction methods, the five testing datasets, each consisting of 302 allergen sequences and 2000 non-allergen sequences, was used to evaluate six available techniques – the FAO/WHO evaluation scheme [5], SVM global description approach [8], SVM amino acid composition approach [14], SVM dipeptide composition approach [14], MEME motif discovery tool [12] and ARP technique [15]. The overall performance of each technique is indicated by the average performance over the five datasets.

As illustrated in Table 3, the overall performance of both general and group-specific profile-based models outperforms all other existing prediction systems investigated in this study. Both SVM amino acid and dipeptide composition methods [14] and the ARP technique [15] achieved high sensitivity (~89%) but low specificity (~57%) was also observed. The SVM global description approach [8] achieved the closest performance to the profile-based models in terms of accuracy (~93%). However, it exhibits high specificity (~95%) but low prediction sensitivity (~77%). The MEME motif discovery approach is shown to produce the lowest sensitivity (1.26%), which is lower than the reported sensitivity of 7% (at 0.001 E-value) [12]. This may be due to several reasons: i) differences in the testing dataset; and ii) the derived MEME motifs did not manage to capture essential features in allergen sequences. In agreement with previous reports [6, 7], the FAO/WHO evaluation scheme predicts allergens with low specificity (23.31%) and low accuracy (31.58%). In contrast to PSSM, the FAO/WHO similarity-based evaluation scheme incorrectly predicts a large proportion of proteins derived from bacteria (37%), viruses (9%) and yeasts (9%) as positives. It is possible that some of these proteins may contain Ig-binding epitopes, though not necessarily demonstrate IgE binding. Among the false negatives, majority are distant homologues derived from fungi (39%), food (23%) and insect (9%).

Table 3 Comparison of the performance between the profile-based methods and existing allergenicity prediction systems

Conclusion

It is shown that profile-based methods are highly promising for assessing potential allergenicity and cross-reactivity in proteins with sensitivities and specificities of over 80%. The strength of such models lies in its ability to detect distantly related protein homologues through the use of iterated profiles [1719]. To date, the exact mechanisms of allergy remains unclear as the structural, functional or biochemical properties of allergens that leads to allergic responses have yet to be elucidated. The allergen profiles that are constructed in this study may also be used as a basis for identifying common amino acid residues or physicochemical properties that support allergenicity [20].

Methods

Dataset

The training and testing dataset consist of 11,510 non-redundant (1,510 experimentally verified allergens and 10,000 putative non-allergens) sequences. Known allergen protein sequences were extracted from Swiss-Prot [23], GenBank [24], the Allergen Nomenclature database of the International Union of Immunological Societies (IUIS) [21], Allergome [25], the Food Allergy Research and Resource Program (FARRP) Protein AllergenOnline Database [7] and the Structural Database of Allergen Proteins (SDAP) [13]. The distribution of the allergen data used in this study is illustrated in Figure 1. An initial list of protein sequences unlikely to be associated with allergy was generated by extracting all protein sequences from Swiss-Prot with the exception of entries containing text strings 'allergen', 'allergy', 'atopy' or derivatives thereof in the annotation [9]. From this list, 10,000 putative non-allergens were randomly selected for model construction. Only 1 putative non-allergen sequence is extracted from each protein family to avoid bias.

Figure 1
figure 1

Distribution of the allergen data used in this study.

The dataset was shuffled randomly and partitioned into five sets for five-fold cross validation, each time using one set for testing and the remaining four sets for training. Each training set contains 1,208 experimentally determined allergens and 8,000 non-allergens while each testing set contains 302 experimentally determined allergens and 2,000 non allergens.

Profile-based methods

The general strategy of our iterative profile-based methods is shown in Figure 2. Allergen profiles are generated and optimized using sequences in the training set while sequences in the testing set are used to evaluate the overall performance of the profile-based methods. The system is implemented using the NCBI BLAST package [17] and PERL scripts.

Figure 2
figure 2

General strategy of the profile-based method. The general strategy involves performing a RPS-BLAST search on the query protein against a searchable database of allergen profiles generated by PSI-BLAST. Query sequences that generate hits above the specified e-value threshold are predicted to be potential allergens.

Method 1: general allergen profiles

This method predicts potential allergens by performing a RPS-BLAST search against a database of general allergen profiles optimized for accuracy and performance. The construction of allergen profiles involves an initial screening step and a subsequent optimization step, as outlined in Figure 3.

Figure 3
figure 3

Schematic representation of how allergen profiles are constructed in this study. The development of this approach consists of A) a preliminary screening step and B) an optimization step.

During the initial screening step, a PSI-BLAST search (10 iterations, e-value threshold 10-3) was performed on each allergen sequence in the training set against all other allergen sequences in the dataset. This generates a profile or PSSM for each allergen protein sequence. In this study, a minimum of two sequences was used for constructing a profile.

In the optimization step, another round of PSI-BLAST search was performed on each of the selected allergen sequence using eight different e-value thresholds (10, 1, 10-1, 10-2, 10-3, 10-4, 10-6 and 10-9). This generates eight profiles for each allergen sequence corresponding to the different e-value threshold. Each of the eight profiles was tested by RPS-BLAST using allergen sequences in the training set as query. For each allergen sequence in the training dataset, the best profile (with the highest accuracy) was selected and incorporated into the predictive model. This approach produces a collection of general allergen profiles optimized for accuracy and performance.

Method 2: group-specific allergen profiles

This method predicts protein allergenicity by performing a RPS-BLAST search against a database of group-specific allergen profiles optimized for accuracy and performance.

Allergen sequences in the training set were partitioned into nine groups – i) weeds, ii) fungi, iii) grasses, iv) trees, v) mites, vi) animals, vii) insects, viii) food, and ix) others, according to the recommendation by the IUIS Allergen Nomenclature Sub-Committee [24]. For the screening phase, PSI-BLAST was performed by partitioning allergens into the 9 major groups and using individual groups of allergens as the training dataset. This generates profiles specific to each particular group of allergens, which are subsequently optimized according to their predictive accuracy and used for constructing group-specific allergenicity prediction systems.

Performance measures

The predictive performance of the general and group-specific models was evaluated using sensitivity (SE), specificity (SP), accuracy (ACC), positive predictive value (PPV), negative predictive value (NPV), and Matthews correlation coefficient (MCC) [26]. In the latter, the positive dataset consists of testing allergen sequences belonging to a specified group whereas the negative dataset consists of all other allergen sequences in the testing set except the selected group. SE = TP/(TP+FN), SP = TN/(TN+FP) and ACC = (TP+TN)/N, indicate percentages of correctly predicted allergens, non-allergens and all proteins respectively. PPV = TP/(TP+FP) and NPV = TN/(TN+FN) denote the proportion of allergens and non-allergens that are correctly predicted, respectively. TP (true positives) represents known allergens and TN (true negatives) for non-allergens. FN (false negatives) denotes known allergens predicted as non-allergens, and FP (false positives) represents non-allergens predicted as allergens. The MCC, which is used to measure the randomness of the prediction, is computed and defined as follow:

M C C = ( T P × T N ) ( F N × F P ) ( T N + F N ) ( T P + F N ) ( T N + F P ) ( T P + F P ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyta0Kaem4qamKaem4qamKaeyypa0tcfa4aaSaaaeaacqGGOaakcqWGubavcqWGqbaucqGHxdaTcqWGubavcqWGobGtcqGGPaqkcqGHsislcqGGOaakcqWGgbGrcqWGobGtcqGHxdaTcqWGgbGrcqWGqbaucqGGPaqkaeaadaGcaaqaaiabcIcaOiabdsfaujabd6eaojabgUcaRiabdAeagjabd6eaojabcMcaPiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaojabcMcaPiabcIcaOiabdsfaujabd6eaojabgUcaRiabdAeagjabdcfaqjabcMcaPiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabdcfaqjabcMcaPaqabaaaaaaa@5F5F@

The MCC returns a value between -1 and 1: MCC = 1 for 100% agreement of the prediction, MCC = 0 for completely random prediction and MCC = -1 for 100% disagreement of the prediction.

Five-Fold cross validation

Five-fold cross validation was performed to assess the quality of all predictive models described in this study [20]. In k-fold cross-validation, k random, (approximately) equal-sized, disjoint partitions of the sample data are constructed, and a given model is trained on (k-1) partitions and tested on the excluded partition. The results are averaged after k such experiments, and the observed error rate may be taken as an estimate of the error rate expected upon generalization to new data.