Introduction

The problem of bacterial drug resistance did not exist in 1930s, when antibiotics were introduced to treat bacterial infections. Since then, due to various factors--such as irresponsible dosage of antibiotics, naturally occurring mutations, transmission of drug-resistant strains, etc.--drug resistance has become a serious health problem. This has drawn the attention of WHO (World Health Organization), ECDC (European Centre for Disease Prevention and Control) and CDC (Centers for Disease Control and Prevention), which monitor and report the spreading of drug-resistant pathogens in the world. As a consequence, for example, WHO launched in 2006 a new global program "Stop TB Strategy" to fight the spreading of M. tuberculosis (MTB). A recent WHO report on MTB estimates that the bacteria was responsible for around 1.7 million deaths world-wide in 2009 [1]. According to the report, 3.3% of new MTB cases in 2009 were multi-drug resistant (MDR). Moreover, 58 countries reported cases of extensively-drug-resistant (XDR) isolates of the bacteria. Very recently, ECDC reported that as high as 58% of all Staphylococcus aureus isolates tested in Malta was methicillin resistant (MRSA) [2].

The emergence of drug resistance is appalling, because it is often not economically justifiable for pharmaceutical companies to develop new drugs against it [3]. One promising approach to address the problem is to use old drugs that were designed for treating other diseases and are also effective against pathogens [4]. An effort in this direction was recently undertaken in a research study on M. tuberculosis [5]. The authors used three-dimensional docking to identify in-silico some putative drug-target interactions. For example, they predicted Comtan, a drug used in treating Parkinson's disease, as potentially effective against M. tuberculosis infections.

The need of more efficient strategies to develop new drugs stimulates research to better understand drug resistance mechanisms. Several drug resistance mechanisms have been discovered so far. They can be categorized as: (i) drug target modification; (ii) drug molecule modification by specialized enzymes; (iii) reduced accumulation of the drug inside a bacteria cell by decreased cell wall permeability or by pumping out the drug; and (iv) alternative metabolic pathways [6]. Moreover, there are known genes and mutations responsible for most of the drug resistance mechanisms. While genomics can be used on samples before and after drug resistance emerges to identify the likely associated mutations, most of the known mutations and genes associated with drug resistance were discovered by analyzing a priori candidates such as drug target genes or genes located on plasmids. Some information on drug target genes is available in the drugbank.ca database [7]; and some lists of genes known to be responsible for drug resistance (specific to bacterial species and drugs) are available in the ARDB (Antibiotic Resistance Genes Database) database [8].

Despite the above mentioned achievements, our understanding of drug resistance mechanisms is still incomplete. For example, there are reports of S. aureus isolates with atypical drug resistance profiles [911], which have not been explained yet. We hypothesize that these atypical drug resistance profiles might be due to genomic mutations in genes which are not a priori suspected of being involved in drug resistance mechanisms.

In this work, we use whole-genome sequences to identify and associate genetic mutations with drug resistance phenotype for bacterial strains (within S. aureus). Thus, conceptually our approach is similar to Genome-Wide Association Study (GWAS) approaches, which have been successfully applied to identify SNPs associated with human diseases [12, 13]. We hypothesize that similar approaches, when applied to bacteria, should bring interesting results. However, it may not make sense to directly transfer this methodology to bacteria, because, for example, horizontal gene transfer (HGT) plays an important role in the evolution of bacteria. Besides mutations which would explain the reported atypical drug resistance profiles, we expect, by applying our approach, to identify also mutations that can be interpreted as compensatory mutations. These compensatory mutations are not directly involved in drug resistance, but they are important to neutralizing the deleterious effect (caused by mutations directly responsible for the resistance mechanisms) on bacterial fitness [1416].

There are published studies, based on comparative analysis of whole-genome sequences, associating genetic mutations with drug resistance [1721]. However, the methodologies used in these studies are simple and were applied to a relatively small number of strains. In our opinion, this is caused by two main problems: first, the number of fully sequenced bacterial strains within the same species have not been sufficiently large until recently; second, phenotype data with respect to drug susceptibility tests are spread throughout the literature and are not easy to collect.

In this work, we collected genotype data for 100 fully sequenced S. aureus strains and addressed the second problem by a careful search of the literature for results of drug susceptibility tests of the strains considered. We also developed and tested a new approach to associate mutations and genes with drug resistance.

Materials and methods

Below we present details of our methodology including the problem setting, collection of data and subsequent steps of the identification of drug resistance associated genetic features. These subsequent steps comprise:

  • unification of protein-coding gene annotations of bacterial strains and determination of gene families;

  • computing multiple alignments for the gene families and reconstructing the consensus phylogenetic tree;

  • identification of genetic features, possibly associated with drug resistance, such as point mutations and gene gain/losses, based on the multiple alignments and the determined gene families; and

  • association of genetic features with drug resistance phenotypes.

Problem setting

We consider a set S of bacterial strains and their response to the application of a given drug. The response, which we called drug resistance profile, is represented by a vector v : S { S , R , ? } , where by 'S' and 'R' we denote respectively drug-susceptible and drug-resistant strain phenotypes, by '?' we indicate that the phenotype is unknown. Additionally, we denote by S v S and S v R the sets of drug-susceptible and drug-resistant strains for a given drug resistance profile, respectively.

We also assume to have a given set of genetic mutations among the considered bacterial strains. Analogous to drug resistance profiles, we represent mutations as vectors m : S ? , where ∑ denotes an alphabet of possible states, such as amino acids in the strain sequence corresponding to a given position in the multiple alignment. By '?' in the mutation profile we denote strains which are not present in the aligned gene family corresponding to the considered point mutation.

Then, the problem is mainly to identify a subset of genetic mutations associated with a given drug resistance profile and, secondarily, to use the identified mutations to predict unknown places in the given drug resistance profile (marked by '?').

Genotype data

We collected genotype data (genome sequences and annotations) for the following 100 fully sequenced strains of S. aureus from the GenBank [22] and PATRIC databases [23]. Additionally, genotype data for strain EMRSA-15 were downloaded from the Wellcome Trust Sanger Institute website. At the time of writing, 31 out of the 100 S. aureus strains had "completed" sequencing status. For the remaining strains whose genomes are still being assembled, contig sequences (covering around 90% of the genomes) and annotations are provided.

We unify the original annotations employing our previously published method, called CAMBer [24]. Briefly, CAMBer iteratively extends the protein-coding annotations by homology transfer, until the transitive closure of a given homology relation is computed. This homology relation defines the consolidation graph. In this graph, there is an edge between a pair of genes if there was an accepteble BLAST hit between them.

Then, we determine gene families as connected components in the consolidation graph. However, we additionally extend the consolidation graph by edges coming from BLAST amino-acid queries. More formally, we add an edge between a pair of genes to the consolidation graph if the percent of identity (calculated as the number of identities over the length of the longer gene) of the BLAST hit between them exceeds a threshold P(L) given by the HSSP curve formula [25]:

P L = 100 L 11 c + 480 . L - 0 . 32 . ( 1 + e - L / 1000 ) 11 < L 450 c + 19 . 5 L > 450
(1)

Here, c is set to 40.5 and L is the number of aligned amino acid residues.

Each connected component in the consolidation graph corresponds to a gene family [24]. We compute multiple alignments using MUSCLE [26] for all these gene families. Then, we consider two kinds of genetic variations:

  • gene gain/loss,

  • amino acid point mutations.

Intuitively, we represent the considered genetic variations as 0 - 1 vectors, indexed by strains, where 0 denotes the reference state and 1 denotes some change. We call vectors of these genetic variations as gain/loss profiles and point mutation profiles.

Gene gain/loss profiles are transformed from gene families which do not span the set S of all considered strains. For each such gene family, we transform it into a vector representation g : S G , L as follows: for a given strain i, we define g(i) = 'G' if the gene family contains at least one gene in that gene family for strain i; otherwise we set g(i) = 'L'.

Similarly, point mutation profiles are transformed from columns in multiple alignments computed for gene families with elements present in at least |S|-1 strains. We take into account only columns which contain at least two different characters (ignoring '?'). For each such column (in the multiple alignment), we transform it into a vector representation m:S A A - , ? as follows: for a given strain i, we set m(i) = 'x' if the character 'x' is present (one of 20 amino-acids or '-') in the row corresponding to strain i; and set m(i) = '?' if strain i is not present in the aligned gene family.

Phylogenetic tree of the strains

We compute the phylogenetic tree of the input strains using a consensus method with majority rule implemented in the PHYLIP package [27]. We apply the consensus method to trees constructed for all gene families with exactly one element in each strain. The trees are constructed using the maximum likelihood approach implemented in the PHYLIP package [27].

Phenotype data (drug susceptibility)

Drug susceptibility data were collected from the following sources: (i) publications issued together with the fully sequenced genomes: USA300_TCH1516 and USA300_TCH959 [28], MRSA252 and MSSA476 [29], 04-02981 [30], T0131 [31], ST398 [32], COL [33], JKD6008 [34], 16 K [35], TW20 [36], Newman [37], RN4220 [38], O46 and O11 [39], RF122 [40], Mu3 [41], MRSA252 [29], CF-Marseille [42], N315 and Mu50 [43], MW2 [44], MSHR1132 [45], ECT-R_2 [46], JH1 and JH9 [17], ED98 [47], JKD6159 [48], LGA251 [49], ED133 [50]; (ii) NARSA project http://www.narsa.net; (iii) email exchange with the authors of publications related to strains ST398 and TW20; and (iv) other publications found by searching of related literature [21, 23, 33, 37, 5198]. The complete collected phenotype data are available in the supplementary table (additional file 1).

We represent the collected information as a set of drug resistance profiles, defined for each drug separately.

Essential mutations

For a given drug resistance vector v we introduce a function r v which describes the reference state of a given point mutation or gene gain/loss profile p. We define it as the most often-occurring state in drug-susceptible strains (ignoring '?'), i.e.:

r v p = arg max x Σ i S v S p i = x
(2)

Here, and in all the following equations, square brackets are used for Iverson's notation.

In the current implementation, in the case when there is a multiple number of states present in the same maximal number of strains, the function r v returns the first state in the lexicographical order. Note that this is just a technical assumption, since such mutations will not be considered as associated with drug resistance.

We say that a point mutation m is present in a strain i if m(i) ∉ {r v (m),'?'}; otherwise we say that the point mutation m is absent in strain i.

Then, we distinguish two categories of gene gain/loss and point mutation profiles depending on how they correspond to a given drug resistance profile. We categorize a given mutation profile m as:

  • essential mutation, when m is absent in all drug-susceptible strains,

  • conflict mutation, when m is present in at least one drug-susceptible strain.

Further, we distinguish neutral mutations as a subclass of essential mutations, these are essential mutations that are not present in any of drug-resistant strains.

Analogously, we transfer the above introduced concepts to gene/loss profiles, defining essential, neutral and conflict gain/loss profiles.

Support

We aim to identify genetic variations which are likely to be associated with drug resistance. Intuitively, such mutations or gained genes should often be present in drug-resistant strains and rarely in drug-susceptible strains. To reflect this intuition we assign a score, which we call a support, to all point mutation and gene gain/loss profiles. For a given point mutation or gene gain/loss profile p and drug resistance profile v, the support (s v ) is defined as the number of drug-resistant strains with the mutation present (or gene gained) minus the number of drug-susceptible strains with the mutation present (or gene gained):

s v ( p ) = i S v R [ p ( i ) r v ( p ) ] α v i S v S [ p ( i ) r v ( p ) ]
(3)

Here, α v is a weight which we use to punish mutations for their presence in drug-susceptible strains. It is defined as the proportion of the number of drug-resistant to the number of drug-susceptible strains, so that occurrences of a mutation are given equal emphasis in drug-resistant and drug-susceptible strains. More formally:

α v = | S v R | | S v S |
(4)

Weighted support

Although the support is a simple and intuitive score, it does not incorporate any phylogenetic information. For example, let us assume there are two point mutations with the same support 3, where the first mutation covers only drug-resistant strains within one subtree of the phylogenetic tree, whereas the second mutation covers the same number of strains but spread throughout the whole tree. The first mutation is likely to be associated with the phylogeny, driven by some environmental changes. This suggests that the second mutation should have a greater score as it has to be acquired a few times independently during the evolution process.

We propose weighted support as a score to account for the above situation. For a given phylogenetic tree T and gene gain/loss or point mutation profile p, weighted support (ws v ) is defined as follows:

w s v T ( p ) = i S w i T [ p ( i ) r v ( p ) ]
(5)

where w i T are weights assigned to each cell in a given drug resistance profile.

In all our experiments we assign weights in the following way: all drug-susceptible strains are assigned weight -α v (defined as above); each drug-resistant strain i is assigned a weight 1 n , where n is the number of drug-resistant strains in the subtree (containing strain i) determined by its highest parental node, such that the subtree does not contain any drug-susceptible strain in its leaves. All strains without drug resistance information are assigned weights 0.

Note that the support score can also be expressed as weighted support, where w i are assigned as -α v , 1, 0 for drug-susceptible, drug-resistant and strains without drug resistance information, respectively.

Figure 1 illustrates the concept of support and weight-support.

Figure 1
figure 1

Support and weighted support. A schematic example of classification of genetic variation profiles and computation of their supports. Point mutations 1 and 4 are essential, mutation 2 is conflict and mutation 3 is neutral. Light blue circles mark nodes which appear in the definition of weighted support. These are nodes the highest parental nodes (for the leaf nodes corresponding to drug-resistant strains), that their subtrees do not contain any drug-susceptible strains in leaves. The scores (a) support and (b) weighted support are assigned to these mutations. For this drug-resistance profile, the ratio α v equals 5 3 .

In order to make the support scores more comparable between drugs, we introduce normalized versions of the scores, normalized support and normalized weighted support which denote the respective support value divided the maximal possible support or weighted support, respectively.

Odds ratio

For a given drug resistance profile v and mutation p, we calculate odds ratio using the formula:

o d d s _ r a t i o v p = n R 1 n S 0 m a x 1 , n R 0 m a x 1 , n S 1
(6)

Here, nR 1, nS 0, nR 0and nS 1denote the number of drug-resistant strains with mutation p, drug-susceptible strains without mutation p, drug-resistant strains without mutation p and drug-susceptible strains with mutation p, respectively.

The same formula is used to calculate odds ratio for gene gain/loss profiles.

Statistical significance

In order to assess statistical significance of the associations we calculate their p-value.

More precisely, for a given drug resistance profile v, let X be the random variable giving support of a random mutation. Then, for a given observed mutation with support = c, its p-value is defined by the following formula:

( X c ) = n = 1 | S | ( X c | N = n ) ( N = n )
(7)

Here, N is a random variable which denotes the number of mutated strains in a random mutation. For each n the probability ℙ(N = n) of observing a mutation present in n strains is estimated (as the number of mutations present in n strains to the total number of considered mutations) from the data for point mutation and gene gain/loss profiles separately. The details follow. Assume that weights, for a given drug resistance profile v, take k different values: l1, l2, ..., l k . For 1 ≤ jk, let m j be the number of strains which take value l j . Clearly we have m1 + m2 + ⋯ + m k = |S|. Then, the probability ℙ(Xc|N = n) (from the equation 7) is given by the formula:

0 n 1 m 1 0 n 2 m 2 0 n k m k n 1 + n 2 + + n k = n j = 1 k ( n j m j ) ( n | S | ) [ j = 1 k n j l j c ]
(8)

Here we describe our algorithm for calculating the p-value. It should be clear that the problem reduces to computing X c | N = n = t c n | S | n for each 0 ≤ n ≤ |S|, where t c (n) denotes the number of ways for distributing n ones over |S| strains, such that the corresponding sum of weights is greater or equal than c. The term | S | n is the total number of possible ways for distributing n ones over |S| strains. Thus, the problem reduces to calculating t c (n) for each 0 ≤ n ≤ |S|. Additionally, without any loss of generality, we may assume that the weight levels are strictly decreasing: l1 >l2 > ⋯ >l k , where l k < 0 and lk-1≥ 0.

The algorithm iteratively generates partial combinations (without n k ) starting from the partial combination (n1 = m1, ⋯, nk - 1 = mk - 1) in the following manner: if j is the highest index of the non-zero n i in the current partial combination, the next partial combination will be (n1, ⋯, n j - 1, nj + 1 = mj + 1, ⋯, nk - 1 = mk - 1). The algorithms terminates generating partial combinations when two following partial combinations have their corresponding sum of weights below the level of c. At each step of the algorithm, all possible full combinations (n1, ⋯ nk - 1, n k ) are generated from the current partial combination (n1, ⋯ nk - 1). If for the full combination its corresponding sum of weights is greater or equal c i = 1 k n i l i c , then we increment the value t c (n) by j = 1 k m i n i , where n = n1 + ⋯ + n k . As the outcome, we obtain t c (n) and, thus, also ℙ(Xc|N = n) for each n.

The last step is to calculate formula 7 using these calculated probabilities.

Note that, since support is a special case of weighted support, the same formula and algorithm can be used to compute its corresponding p-values.

Results and discussion

We verify the usability of our approach by trying to re-identify the known drug resistance determinants. In this experiment, we compare our proposed scoring methods --support and weighted support --to odds ratio, which is a popular measure used in genome-wide association studies. Table 1 shows rankings of the gene gain/loss profiles corresponding to genes which are known drug resistance determinants. The experiment suggests that weighted support identifies putative associations better than support and odds ratio, both of which do not incorporate additional information about phylogeny.

Table 1 Rankings of known drug resistance determining genes

This experiment also reveals that the amount of the collected drug resistance information is not sufficient to correctly identify drug resistance associated genes. However, the high consistency of drug resistance profiles corresponding to the collected information and the presence of drug resistance determinants (summing over drugs, there are 117 drug resistant strains, where only 4 of them do not have any known drug resistance determinants; and there are 112 drug-susceptible strains, where only 8 of them have at least one drug resistance determinant) suggests that we can use the determinants to predict drug resistance in the strains without drug resistance information available. It is perhaps questionable to predict drug resistance in those strains for which the whole-genome sequence is not determined yet. So we do prediction only for those strains with completed sequencing or at least information on their plasmids (which often carry the drug resistance determinants). Nevertheless, we predict drug resistance also for those strains that are not yet fully sequenced, provided the presence of drug resistance determining genes has been confirmed for them. Moreover, we predict drug resistance to rifampicin and ciprofloxacin for all 100 strains, as the drug resistance for rifampicin and ciprofloxacin is determined by point mutations in genes rpoB, gyrA and grlA (synonymous name to parC), which are sequenced in all strains. More precisely, we predicted as rifampicin-resistant all strains with any mutation present in the rifampicin resistance determining region (RRDR). We defined the RRDR as the amino-acid range from 463 to 530 in the rpoB gene sequence (according to [94]). Analogously, we predicted as ciprofloxacin-resistant all strains with any point mutation in the quinolone resistance determining region (QRDR). We defined QRDR as the amino-acid ranges from position 68 to 107 and from position 64 to 103 in the grlA and parC gene sequences, respectively (according to [65]). Figure 2 shows the complete information about drug susceptibility after prediction.

Figure 2
figure 2

The collected dataset of phenotypes with predictions. The collected dataset of phenotypes put together with results of our drug resistance predictions based on the presence of known drug resistance determinants. Due to the high number of strains the table is split into two panels. Columns represent drugs, rows represent S. aureus strains included in the study in the order corresponding to the reconstructed phylogenetic tree of strains. Green, yellow and red cell colors represent susceptible, intermediate resistant and resistant phenotypes, respectively. Analogously, light green and light red cell colors represent predicted susceptible and resistant phenotypes, respectively. White cell color represents unknown (not determined by experiments or prediction) drug resistance phenotypes.

Then, we applied our approach to the dataset supplemented by the predicted information about drug susceptibility for the following drugs: tetracycline, β-lactames (penicillin, oxacillin, methicillin), erythromycin, gentamicin, vancomycin, ciprofloxacin and rifampicin.

We discuss in the subsections below the results of our approach applied separately to the following drugs: tetracycline, β-lactames (penicillin, methicillin), erythromycin, gentamicin, vancomycin, ciprofloxacin. We do not discuss here results for oxacillin and clindamycin, since they have very similar drug resistance profiles to methicillin and erythromycin, respectively. All other drugs were excluded from the analysis due to the low number of strains with available drug resistance information on these drugs.

Tables 2 and 3 present the top-scored gene gain/loss, and point mutation profiles for the dis-cussed drugs, respectively. The genes presented in the tables were selected according to the following procedure: for each drug we construct a function, which gives for each gene (listed in descending order with respect to normalized weighted support) the minus logarithm of p-value (-log(p-value)) of this score. Then, we report genes which correspond to the portion of the graph of this function before it gets flattened. Complete results for all the drugs are provided in supplementary Excel tables (additional files 3 and 4).

Table 2 The top scored gene gain/loss profiles
Table 3 The top scored point mutation profiles, only for essential mutations

Tetracycline

Tetracycline acts by binding to the 30S ribosomal subunit (rpsS, 16S rRNA are its direct targets), preventing binding of tRNA to the mRNA-ribosome complex, and thus inhibiting protein synthesis [7].

The most common drug resistance mechanism to tetracycline in S. aureus is mediated by ribosome protection proteins (RPPs) such as tet and tetM, which bind to the ribosome complex, thus preventing the binding of tetracycline [99, 100].

Proteins tet and tetM mediating the mechanism cover all drug-resistant strains except MW2. This may be caused by errors in the drug susceptibility tests, errors in sequencing, or by some other not yet known drug resistance mechanism. The inconsistent information about strain MW2's tetracyline susceptibility (see supporting Table 1) and the lack of identified drug resistance determinants suggest that the strain is possibly drug susceptible. In our experiment we initially assumed that the tetracycline resistance information is not available for strain MW2.

Our method shows that, besides tet and tetM, there are a few more genes that have highly scored gene gain/loss profiles. Especially interesting are the following genes which are not gained by any of the drug susceptible strains: repC, pre, thiI, int, clfB (see Table 2). There are studies reporting the significance of these clfB and repC genes in drug resistance [101, 102]. Interestingly, the gene repC seems to co-evolve with tet (correlated gene gain/loss profiles).

Applying our method to point mutations we have identified two highly scored (and essential) point mutations in ribosomal complex proteins: K101R in rpsL and K57M in rpsJ. According to our knowledge, this is the first report on the significance of the point mutations for drug resistance in S. aureus. However, we found a study associating mutations in rpsJ with tetracycline resistance in another bacteria Neisseria gonorrhoeae [103].

Beta-lactams

Beta-lactams are a broad class of antibiotics, which possess (by definition) the β-lactam ring in their structure. The ring is capable of binding transpeptidase proteins (also known as Penicillin Binding Proteins -- PBPs) [7], which are important to synthesis of the peptidoglycan layer of bacterial cell wall. PBPs with attached drug molecules are no longer able to synthesize peptidoglycan, leading to bacterial death [104]. In our case study we consider three β-lactam antibiotics: penicillin, oxacillin and methicillin. However, since the drug resistance profile and drug resistance mechanisms for oxacillin and methicillin are very similar we discuss results only for methicillin.

There are two common β-lactames resistance mechanisms in S. aureus [104, 105]. The first one is mediated by β-lactamase enzymes, which bind drug molecules and break the β-lactam ring, thus deactivating the drug molecules. This mechanism is effective against penicillin (which is β-lactamase sensitive) and not effective against methicillin and oxacillin (which are β-lactamase resistant) [106]. The second β-lactam resistance mechanism is mediated by proteins which are capable of functionally substituting for PBPs, but have much smaller affinity to β-lactam molecules. This mechanism is effective against penicillin, methicillin and oxacillin.

Penicillin

In our dataset all strains resistant to penicillin possess proteins responsible for one of the two mechanisms. More precisely, there are 69 drug-resistant strains (with available drug resistance information), which possess BlaZ -- the standard β-lactamase protein (note that its regulators BlaR1 and blaI do not always co-occur). All the remaining penicillin-resistant strains have mecA, which is an altered PBP. Table 2 provides information about the top-scored gene gain/loss profiles.

Applying our method we have also identified the uncategorized putative protein, SAR0056, as putatively associated with penicillin resistance (see Table 2). We suggest to examine further the role of that gene in β-lactams resistance.

Methicillin

Applying our approach to gene gain/loss profiles we identified (beside mecA) genes ugpQ and maoC. The correlation of gene profiles to the profile of mecA and their close proximity on the genomes suggests that these genes co-evolve (see Figure 3 for more details). This co-evolution may reflect some important role played by these genes in methicillin resistance. This calls for further study of the role of these two genes in methicillin resistance.

Figure 3
figure 3

Genes related to methicillin resistance. Presence and relative genome coordinates of genes related to methicillin resistance (mecA, mecR1, mecI, ccrA, ccrB, ccrC), put together with the identified genes: ugpQ and maoC. The gene presence profiles are clustered with respect to the genes order. In this figure we include only these methicillin-resistant strains for which all the genes where located on the main genome and within the same sequence conting (in order to determine the relative positions).

We have also identified a few point mutations that are putatively associated with methicillin resistance. Interestingly, two of the mutations in the top 10 essential mutations according to weighted support (I72F in SAR0420 and E208QKD in SAR0436) are present in cell membrane proteins. This suggests some compensatory mechanism to the presence of mecA.

Ciprofloxacin

Ciprofloxacin belongs to a broad class of antibiotics, called fluoroquinolones, which are functional against bacteria by binding DNA gyrase subunit A (encoded by gyrA) and DNA topoisomerase 4 subunit A (encoded by parC), which are enzymes necessary to separate bacterial DNA, thereby inhibiting cell division [7]. The most common ciprofloxacin resistance mechanism is mediated by point mutations in the drug targets, parC and gyrA.

Applying our approach we identified (by highest support) two point mutations in ciprofloxacin target genes -- S80 FY in parC and S90 AL in gyrA -- which are located in QRDR and known to be responsible for the first mechanism of ciprofloxacin resistance [65]. The presence of these mutations is correlated with the ciprofloxacin resistance profile for strains with available drug resistance information. However, they differ for two strains ED98 and 16 K (only the mutation in parC is present). This may suggest intermediate drug resistance level for these strains. Unfortunately ciprofloxacin resistance information is not available for these strains.

Erythromycin

Erythromycin acts by binding the 23S rRNA molecule (in the 50S subunit) of the bacterial ribosome complex, leading to inhibition of protein synthesis [7].

There are three known erythromycin resistance mechanisms [107]. First -- the most common mechanism -- is by methylation (addition of two residues to the domain V of 23S rRNA) of the 23S rRNA molecule, which prevents the ribosome from binding with erythromycin. This methylation is mediated by enzymes from the erm gene family, the most common are ermA and ermC. The second mechanism is mediated by the presence of macrolide efflux pumps (encoded by msrA and msrB). The third mechanism is the inactivation of drug molecules by specialized enzymes such as EreA or EreB [107].

We found that none of the strains in our case study possess genes EreA or EreB. Genes encoding efflux pumps (msrA and msrB) are present also in drug-susceptible strains (for example, NCTC 8325 and Newman), which may suggest that the mechanism is inactive for the considered strains of S. aureus or the enzyme production rates are too small, which we are not able to account by our method. Using our approach we identified (by the highest support) the gene ermA responsible for the most common drug resistance mechanism.

Here, there is one erythromycin-susceptible strain, USA300_TCH959, which harbours the ermA gene. This may suggest disruption of the drug resistance mechanism in that strain, errors in drug susceptibility testing or errors in sequencing.

Interestingly, we identified gene SAR1736(spc2) (which is a known spectinomycin resistance determinant) as potentially associated with erythromycin resistance. This suggests that drug resistance to spectinomycin and erythromycin co-evolved, despite these two drugs belonging to different classes according to the ATC drug classification system [106].

Gentamicin

Gentamicin works by inhibition of protein synthesis by binding the 30S subunit of the ribosome complex [108].

Interestingly, strain USA300-FPR3757 exhibits intermediate drug resistance, which is correlated with the absence of aacA-aphD gene in its genome sequence. Since our method requires binary information on drug susceptibility, we marked this strain as drug susceptible for experiments.

The most common resistance mechanism responsible for high levels of Gentamicin resistance is mediated by the drug-modifying enzyme SaurJH1_2806(aacA-aphD). Applying our methodology we identified the gene encoding it as likely to be associated with drug resistance (maximal support). Moreover, we identified also the gene SaurJH1_2805 as putatively associated with gentamicin resistance. The close proximity of these two genes in the genomes and their highly correlated gene gain/loss profiles suggest co-evolution. We hypothesize that the gene SaurJH1_2805 plays some role in drug resistance for gentamicin.

Conclusion

In this work we present a novel approach to associate genes and mutations with drug resistance phenotypes by comparative analysis of fully sequenced bacterial strains (within the same species).

In order to apply our approach we collected genotype and phenotype data. Genome sequences and annotations were downloaded for 100 fully sequenced S. aureus strains. A challenge was to collect drug resistance information, which is spread throughout the literature. We retrieved the data from 71 publications. The collected dataset is available in the supplementary material (additional file 1).

In our method we consider two types of genetic differences as potentially associated with drug resistance: nonsynonymous point mutations and gene acquisition, represented by their mutation and gene gain/loss profiles, respectively. Then, the approach is based on the newly introduced concept of support, which is a score assigned to all mutation and gene gain/loss profiles. Intuitively, the higher support of a mutation profile, the better chance of the mutation to be associated with the drug-resistant phenotype. We also generalize the concept of support into weighted support, which incorporates phylogenetic information.

Applying our approach, we were able to successfully re-identify most of the known drug resistance determinants. Here, on average, weighted support outperforms support and odds ratio.

Moreover, applying our methodology, we identified some putative novel resistance-associated genes and mutations. We expect that these associations and drug resistance predictions will attract the experimental research community to verify their role in drug resistance mechanisms.

Finally, although the presented approach shows promise, it has some obvious limitations. Firstly, it is not clear what threshold for the weighted support is appropriate. Secondly, in this approach we only consider genome variations, whereas some drug resistance mechanisms may be related to changes in protein production rates (such as efflux pumps). Thirdly, the current approach ignores the role of non-coding RNA in drug resistance mechanisms. That, would not be detected by our approach. We plan to address these mentioned problems in some future work.