Background

The safety of genetically engineered foods must be assessed before authorities in most nations will consider granting market approval. An important issue in current food safety assessment is the evaluation of the potential allergenicity of food derived from biotechnology. Since many food allergens are proteins, introduction of a new ("foreign") protein in food by genetic engineering can in theory cause allergic reactions. Therefore the allergenicity of novel proteins needs to be assessed. Potential allergenicity of a protein is a complex issue and various tests can be used for prediction, including bioinformatics, in vitro digestibility and binding of antisera of allergic patients. A step-by-step procedure to assess allergenicity is described by the Codex alimentarius and the FAO/WHO consultation group [1, 2]. An important step in this procedure is to use bioinformatics to determine whether the primary structure (amino acid sequence) of a given transgenic protein is sufficiently similar to sequences of known allergenic proteins. The recommended procedure [1] to establish the possibility of allergenicity is to:

  1. (1)

    Obtain the amino acids sequences of known allergens in protein databases in FASTA format (using the amino acids from the mature proteins only, disregarding the leader sequences, if any).

  2. (2)

    Prepare the complete set of 80-amino acid length sequences derived from the query protein (again disregarding the leader sequence, if any).

  3. (3)

    Compare each of the sequences of (2) with all sequences of (1), using the program FASTA [3] with default settings for gap penalty and extension.

According to the Codex alimentarius [2], potential allergenicity should be considered, when there is either:

  1. (a)

    More than 35 % similarity over a window of 80 amino acids of the query protein with a known allergen.

  2. (b)

    A stretch of identity of 6 to 8 contiguous amino acids.

This procedure is described in more detail by the expert consultation and the Codex Alimentarius. Potential allergenicity requires further testing of the protein with panels of patient sera and possibly animal exposure tests [1, 2].

Construction and content

Three allergen databases were created, one derived from SwissProt [4] and one from the WHO-IUIS allergen list [5]. A third database is a non-redundant combination of the other two. The databases were created by extracting all proteins from public databases; SwissProt (version 44.1, July 5 2004, [4]), PIR [6] and GenPept http://www.ncbi.nlm.nih.gov. Leader sequences were, if annotated, trimmed from the sequence. The SwissProt allergen list contains 334 mature protein sequences, while the WHO-IUIS allergen list (version June 7, 2004) contains 632 sequences (correcting for three internal duplications). These two databases contain 236 duplicate entries. The non-redundant combined database contains 730 sequences (Figure 1).

Figure 1
figure 1

A Venn-diagram showing the relationships of the three databases provided by Allermatch™. This figure shows the size and overlap between the SwissProt and WHO-IUIS allergen databases.

Allermatch™ is build around the FASTA package (version 3.4t21; ftp://ftp.virginia.edu/pub/fasta/, [3]) running with default parameters (ktup = 2, matrix = Blosum50, Gap open = -10, Gap extend = -2). The Allermatch™ analysis tool and the web interface are written in Python and run on a Suse L Linux Enterprise server with an Apache web server (version 1.3.26). Allermatch™ provides two search methods (mode 1 & 2) corresponding with the FAO/WHO guidelines described above and a third method (mode 3) is provided as an extra tool. The outline of the application is schematically presented in Figure 2.

Figure 2
figure 2

Schematic representation of the Allermatch™ webtool. The user submits a protein sequence of interest to the Allermatch™ webtool and chooses one of the three alignment methods and three databases available. Upon completion the results are formatted and returned to the user.

Mode 1: Sliding window approach

The query protein sequence is divided into 80 amino acid (aa) windows using a sliding window with steps of a single residue. Each of these windows is compared with all sequences in the allergen database of choice. All database entries showing a similarity higher than a configurable threshold percentage (default is 35%) to any of the 80 aa query sequence windows are flagged. Upon completion of the analysis, a table is shown with all flagged database entries. Per entry, the highest similarity score is given, as well as the number of windows having a similarity above the cut-off percentage. For each allergen database entry identified, more detailed information on the similarity between the allergen and query sequence can be retrieved, such as those areas of both proteins within all 80 aa windows scoring above the cut-off percentage. The similarity score calculated by FASTA can apply to stretches smaller than 80 aa, Allermatch™ converts such a similarity score to an 80 aa window. For example, 40% similarity on a stretch of 40 aa converts to 20% similarity on an 80 aa window.

Mode 2: Wordmatch

This method looks for short sub-sequences (words), which have a perfect identity with a database entry. The wordsize is configurable (default is 6 aa). The output given is similar to the output given by Mode 1. All database entries with at least one hit are listed and for each of these, more detailed information can be retrieved upon request.

Mode 3: full FASTA alignment with an Allermatch™ allergen database

The Allermatch™ webtool also offers a full alignment of the query sequence with either of the allergen databases using FASTA. Although this full alignment is currently not required by the FAO/WHO guidelines, the full alignment of protein sequences helps positioning of regions of potential allergenicity in the whole primary structure of the protein. The FASTA output is parsed and information from the allergen database is added and presented.

Utility and discussion

To examine the predictive performance of the FAO/WHO criteria for potential allergenicity, we have performed two tests. The first test determines the percentage of false negative and the second test assesses the amount of false positives. Both tests are performed with standard settings; for the sliding window approach an 80 amino acid window with a 35% similarity cutoff is used and for the wordmatch approach 6, 7 and 8 aa word sizes are tested.

The false negative error-rate is estimated by a leave-one-out method, testing all sequences in each Allermatch™ database against that database with the tested sequence excluded. Each sequence not resulting in a hit is considered a false negative. The results of each method/database combination are summarized in Table 1, column 1. The results show that the number of false negatives decreases when a larger database of allergen sequences is used. This may (partly) be explained by an increased proportion of similar, but not equal, sequences in the larger databases, such as isoallergens listed by WHO-IUIS. In examining the results, various sequences were observed that were not able to produce a hit (data not shown) due to their short length, since a perfect hit on a sequence shorter than 28 amino acids cannot convert to a 35% hit on an 80 amino acid window. Column 2 of the same table shows the corrected false negative rate after exclusion of these sequences. Also after this correction the wordmatch with 6 amino acids method shows lower numbers of false negatives than the sliding window approach. It is clear, however, that in case of short protein sequences the sensitivity of the sliding window methods is reduced.

Table 1 Prediction quality of the FAO/WHO methods.

In the second test, we assess the odds of a false positive by testing 12 protein sequences known to be non allergenic. This is based on non-reactivity of these proteins towards IgE-sera of allergy patients or on the inability to cause IgE-responses in experimental animals (Table 2). It should be noted that such data are only available for a limited number of proteins, which accounts for the size of this dataset. Each of these 12 sequences was tested against all databases with all methods. Each non-allergenic sequence resulting in a hit is considered a false positive (Table 1, column 3). The number of false positives grows with the database size, as is to be expected: the chance of a random hit increases with a larger database. In contrast to the false negative hit rates the sliding window method gives the lower error rate. This test might, however, overestimate the number of false positives. A number of these non-allergens are related to and display similarities with their allergenic counterparts, i.e. T1 (related to Bet v 1), human serum albumin (related to animal serum albumins), and human heat shock protein 70 (similar to heat shock proteins from fungi and other allergens). A selection of unrelated, non-allergenic proteins is therefore likely to give a lower false positive rate. Caution should be taken in interpreting these false hit rates. The used methods might perform differently with other sets of proteins. For example, a member of a completely novel group of valid allergens is likely to generate a false negative result.

Table 2 Sequences used for the negative control

The imperfect results show here agree with literature where the FAO/WHO methods for sequence comparisons are also shown to lack full predictive capability [79]. Interestingly, the results show that there is a balance between false positives and negatives when increasing the threshold level for short exact matches from 6 to 8 amino acids, with the number of false positives sharply decreasing at 8 amino acids (Table 1). The outcomes of these tests therefore need to be further refined by checking for the presence of potential IgE-epitopes as recommended by Kleter and Peijnenburg [7], as well as combined with results of other assays as recommended by the Codex. Other methods to decrease false hit rates may also be considered [8, 9]. We plan to implement such supplementary methods in the future to support the Codex based predictions of potential allergenicity.

The prediction of potential allergenicity by primary sequence comparison depends on the quality of the data used for comparison. Addition of a non-allergenic or poorly annotated protein to any of the Allermatch™ allergen databases would obviously result in undesired false positives and should be prevented. A workable strategy could be to use multiple databases, i.e. a database based on SwissProt's list of allergens, which contains well-annotated sequences from SwissProt, simultaneously with a larger database based on the WHO-IUIS list, which contains possibly less well annotated sequences from other protein databases, such as GenPept. For example, a number of protein accessions in the WHO-IUIS database do not mention the presence of signal- and/or pro-peptides, where removal of such peptides is essential to prevent false positives. Users of Allermatch™ should, at all times, take into account the possibility of a false positive or negative, for example by checking original data (accessions, clinical literature) and confirm results, before arriving at conclusions. To prevent false positives as much as possible, one should choose for the well-annotated SwissProt database. To prevent false negatives, the combination of the larger WHO-IUIS database with that of SwissProt is more appropriate. Updates to the SwissProt and WHO-IUIS allergen lists will be incorporated in the Allermatch™ databases on a regular basis.

Several other websites in the public domain offer sequence alignment facilities that support the prediction of potential allergenicity, such as SDAP [10, 11], AllerPredict [12] and Farrp [13]. These websites offer search algorithms that find contiguous similar amino acids between a query sequence and database sequences (SDAP, AllerPredict) and more than 35% identity in alignments (SDAP, AllerPredict), as well as a general FASTA of a query protein sequence against the database (SDAP, Farrp).

Conclusions

Allermatch™ is an efficient and comprehensive webtool that combines all bioinformatics approaches required to assess the allergenicity of protein sequences according to the current guidelines in the Codex. The application will be kept up to date with the FAO/WHO criteria and the SwissProt and WHO-IUIS allergen lists. It will be extended with other, supplementary methods to support and refine the prediction of allergenicity.

Availability and requirements

Allermatch™ is platform independent and accessible using any Netscape 4+ compatible webbrowser at http://allermatch.org.