Background

Animal models have facilitated the investigation of the mechanisms of several diseases. For drug addiction in particular, inbred mouse strains have proved to be invaluable [1, 2], and have facilitated the mapping of aspects of addiction-related behaviour to specific genetic loci. Inbred Long Sleep (ILS) and Inbred Short Sleep (ISS) mice, for instance, present many contrasts with respect to a number of alcoholism related phenotypes [36]. They have been widely used to model ethanol sensitivity [7, 8]. Ethanol sensitivity has a genetic basis [9], the comprehensive workings of which remain elusive. Consequently, a comparison of relevant brain region transcriptomes of ILS and ISS mice has the potential of revealing unique patterns of gene expression [10] that could be relevant to the mechanisms of alcoholism.

The cerebellum has long been almost exclusively associated with balance and motor co-ordination. It has relatively recently been found to be more involved with cognition than previously thought [11]. During neurodevelopment, the cerebellum is especially susceptible to ethanol toxicity [12]. Studies indicate a role for activation of the cerebellum in alcoholism. A Functional Magnetic Resonance Imaging study has indicated that ethanol odour-induced craving in untreated recently abstinent male alcoholics involves activation of the cerebellum along with the subcortical-limbic region of the right amygdala/hippocampal area [13]. Positron Emission Tomography studies in drug addiction similarly indicate a role for cerebellar activation [14, 15]. The identification of specific pathways contributing to alcoholism-related events in the cerebellum would, therefore, be important.

The phosphoinositide (PI) and cyclic adenosine 3',5'-monophosphate (cAMP) signalling pathways have long been thought to be important in the development of ethanol dependence and tolerance [16]. There are several pieces of evidence suggesting a role for the cAMP/protein kinase A (PKA)/cAMP-response-element-binding protein (CREB) signalling pathway in addiction, even though they do not necessarily involve the cerebellum: Alcohol preferring (P) rats have lower levels of CREB and the transcriptionally-active phospho-CREB in the medial amygdala and central amygdala (CeA) than non-preferring (NP) rats [17]. Ethanol administration (or PKA activator [Sp-cAMP] administration into the CeA) increases CREB function in the CeA of P (but not NP) rats. Also, 24 hours following a single intra-peritoneal 2 mg/kg ethanol dose to C57BL/6J mice, there is long-term potentiation of GABA synaptic transmission at Ventral Tegmental Area dopaminergic neurons, via a cAMP-PKA-dependent mechanism [18]. One mechanism by which ethanol increases CREB levels involves inhibition of adenosine reuptake which results in increases in extracellular adenosine and activation of the adenosine A2 receptor, leading to increases in cAMP levels [19]. The ethanol-induced increase in CRE-mediated gene transcription requires PKA and involves an adenosine receptor-dependent phase and a later adenosine receptor-independent phase [20].

The emergence of high throughput data has facilitated the study of patterns of transcription. Machine Learning (ML) is one such avenue for mining such data [21]. It concentrates on methods for computer programs to improve their performance (i.e. modifying behaviour) by learning from previous data examples. ML is useful for the purpose of class prediction. During the learning process, structural patterns in the "training set" are established; these then constitute the basis upon which predictions are made when presented with data of unknown classification ("test set").

In the current studies, genes found to be differentially expressed in the cerebella of ILS and ISS mice [22] were examined to identify the extent to which CREB transcription regulates addiction mechanisms in the cerebellum. Nucleotide sequences of the promoter regions of various genes were analyzed to generate the data used for ML. The Composition, Transitions, and Distributions [23] of individual nucleotide bases as well as groups of nucleotide bases (Table 1), along with the presence and relative positions of specific cis elements were the basis on which genes were classified as being either transcriptionally CREB regulated or otherwise. The results reveal a strong pattern, in the cerebellum, of CREB regulation among genes differentially expressed between ILS and ISS mice.

Table 1 Nucleotide base groupings used

Results

Four ML schemes were evaluated: a Decision Tree (J48, an implementation of the C4.5 algorithm), a Support Vector Machine (SVM), a Naïve Bayes classifier (NN) and a Multi-layer Perceptron (MLP). Two alternate models for ML were tested in this study, using a dataset of 46 instances and two classes. These were:

• -a two-class model with classifications: "CREB-regulated" and "NOT CREB-regulated", and

• -a three-class model with a third classification "Nrf2-regulated" [24]

Nrf2 (NF-E2-related factor 2), the primary transcription factor that binds the Antioxidant Response Element (ARE), was selected because, like CREB, Nrf2 is a ubiquitous transcription factor. Secondly, it has a requirement for CREB Binding Protein for enhanced transcription activity [25]. Using the leave-one-out cross-validation technique, the two-class model had lower Mean Absolute Error rates for all learning schemes explored than the three-class model (Figure 1A). Also, of the four schemes and two models evaluated, the area under the Receiver Operating Characteristic (ROC) curve, a measure of test accuracy, was highest for the C4.5 scheme under the two-class model (Figure 1A).

Figure 1
figure 1

Learning Scheme Accuracy and Error Rates. Accuracy and error rates for learning on a two-class and a three-class model (defined in the Methods section), using the Leave-one-out Cross Validation technique. A) A comparison of accuracy and error rates for learning on the two-class model versus the three-class model. The two-class model yields the higher area under the ROC curve and the lower Mean Absolute Errors for the C4.5 and the Naïve Bayes classifiers. B) A depiction of the error rates determined using the two-class model. The C4.5 scheme has the lowest Mean Absolute Error and Relative Absolute Error.

Of the four ML schemes, using the leave-one-out cross-validation technique and the two-class model, the C4.5 Decision Tree algorithm had the lowest overall predicted error rate (Figure 1B; Table 2). Its ROC curve was closest to the left-hand border and the top border of the ROC space (Figure 2 and Additional File 1), indicating that it had the most optimal trade-off between sensitivity and specificity among the four schemes evaluated. It also had the highest area under the ROC curve (Table 3). The C4.5 Decision Tree algorithm [26] works top-down, seeking at each stage an attribute that best separates the classes. The attribute with the greatest information gain is chosen. It then recursively processes the sub-problems resulting from the split until the information either reaches a maximum or is zero. The information measure (entropy) is calculated thus:

Table 2 Performance of learning schemes following 460 runs**
Table 3 Area under ROC curves, two-class model.
Figure 2
figure 2

Learning Scheme ROC Curves. Receiver Operating Characteristic (ROC) curve for learning schemes using the two-class model and the Leave-one-out Cross Validation technique. The C4.5 test is closest to the left-hand border and the top-border of the ROC space, and therefore the most accurate of the schemes.

Entropy (p1, p2, .... pn) = -p1log 2 p1-p2log 2 p2....-pnlog 2 pn

where p1, p2, .... pn are fractions representing the data distribution at a node (attribute) and sum up to 1.

The two-class model was also used to test an independent dataset generated from 21 genes of known CREB regulation status. C4.5 correctly classified 81% of instances (Table 4) with F-measures of 0.87 and 0.67 respectively for the classes "CREB-regulated" and "NOT CREB-regulated" respectively. The F-measure is the harmonic mean of Precision and Sensitivity and can be used as a single measure of a test's performance:

Table 4 Evaluation of two-class model: C4.5 predictions on an independent set of genes of known CREB regulation status**

F-measure = (2 * Precision * Sensitivity)/(Precision + Sensitivity)

where Precision = True Positives/(True Positives + False Positives)

Sensitivity (or Recall) is a measure of the probability that the test would reject a false null hypothesis:

Sensitivity = True Positives/(True Positives + False Negatives)

Additionally, using the two-class model, three out of four genes determined by two independent microarray platforms to be up-regulated in the ILS cerebellum [22] were determined by C4.5 to be transcriptionally CREB-regulated (Table 5). The platforms were the Affymetrix (Santa Clara, CA) platform Mouse Expression Set 430 (MOE430) and the cDNA arrays NIA15K manufactured at the University of Colorado's School of Medicine. Similarly, three out of four genes up-regulated by both platforms in the ISS cerebellum were deemed CREB-regulated (Table 6). Furthermore, 64% and 52% of a cross-section of other up-regulated cerebellar genes in ILS and ISS mice, respectively (as per the MOE430 platform), were deemed CREB-regulated.

Table 5 C4.5 two-class model predictions for up-regulated genes (cross-validated between MOE430 and NIA15k platforms) in ILS mouse cerebellum

Discussion

Lowered sensitivity to the effects of ethanol increases the risk of developing alcoholism. Differing sensitivities to ethanol is, at least in part, attributable to heredity [9], and inbred mouse strains have facilitated the investigation of this complex behavioral phenomenon. In studying CREB's gene regulating activity in ethanol sensitivity, a set of differentially expressed genes in the ILS/ISS mouse model of ethanol sensitivity were examined. The two-class model had lower error rates than the three-class model (Figure 1A). This is probably due to the inherent difficulty of distinguishing between the classifications "CREB-regulated" and "Nrf2-regulated". Indeed the case can be made that Nrf2 genes are dependent on CREB for enhanced transcription activity [25]. The complexity of the machinery for transcription makes the two-class model the preferred model for this study.

Properties of stretches of nucleotides can impact their affinity for specific transcription factors; this principle can be exploited for its therapeutic promise [27]. A central premise of this observation is the fact that the characteristics of individual nucleotide bases in any such oligonucleotide contribute to its structure and function [28]. As an example, hydrogen-bonded base pairs help determine the structure and function of nucleic acids. Strength of hydrogen bonding and other nucleotide base classifications used in generating the characteristics of each gene's promoter sequence for this ML study have been outlined in Table 1.

Of the four learning schemes evaluated using the two-class model, C4.5 was the most consistent performer, having the lowest overall error rates (Figure 1B), and the highest accuracy (Figure 2; Table 3), area under the ROC curves being measures of test accuracy. Because of variability between independent evaluation sets, performance evaluations based on evaluation sets are only instructive when such evaluation sets are large in size. Since the evaluation set used consisted of only 21 instances, the cross-validation techniques are better indicators of each learning scheme's performance. The Ten-fold Cross Validation technique is a standard way for predicting the error rate of a learning scheme [29, 30]. When applying this technique, an average value is obtained for ten different sets of the re-organized data such that in each case, 90% of the data is used for training and 10% used for testing. The leave-one-out technique is, in essence, an n-fold cross-validation technique (n being the number of instances in the dataset) and, for a small dataset, a good predictor of a scheme's performance on an independent dataset. In this study, 81% of genes of known classification used as an evaluation set were correctly classified by C4.5 (Table 4), with F-measures of 0.87 and 0.67 respectively for the classes "CREB-regulated" and "NOT CREB-regulated" respectively.

The stretch of nucleotides between the cAMP Response Element (CRE) and the Transcription Start Site (TSS) and the stretch between the CRE and the Transcription Factor II D (TFIID) bind site were identified as important determinants of a gene's CREB regulation status (Figure 3). Two types of CRE with different affinities for the transcription factor CREB have been reported. One class containing the symmetrical TGACGTCA site shows a high binding affinity for CREB; the other type has asymmetric and weak binding sites ("CGTCA") [31]. The TATA-binding protein (TBP) and TBP-associated factors (TAFs) constitute the TFIID complex. The TFIID complex is a major component of the general RNA polymerase II (RNAP II) transcription machinery with intrinsic sequence-specific DNA-binding activity [32]. The binding of TFIID to a gene's core promoter region is an important rate-limiting step in the assembly of the transcription initiation complex. With the notable exception of the stretch between the CRE and the TFIID bind site, CREB target promoter regions have relatively high levels of nucleotide bases with strong Hydrogen Bonding (data not shown).

Figure 3
figure 3

Promoter Region Distance Metrics. Boxplots depicting the relative positions of the CREB Response Element (CRE), the Transcription Start Site (TSS), and the Transcription Factor IID binding site. Promoter regions of genes regulated by CREB were more likely to have CREs. In the absence of the CRE ("TGACGTCA", "CGTCA" or "TGCGTCA") in a gene's promoter, the entire promoter sequence was characterized, resulting in longer "distances" for those transcriptionally CREB-independent genes.

The transcription factor, CREB, is ubiquitously expressed in brain cells and is involved, among others, in learning and memory, anxiety, depression, and addiction [33]. A number of different signalling pathways culminate in the activation of CREB. These include pathways involving PKA, MAPK-activated ribosomal S6 kinases (RSKs), and calcium/calmodulin-dependent kinase IV (CaMKIV) [34]. Others such as CaMKII reduce CREB transcriptional activity [35]. Four genes have previously been found, by two independent microarray platforms, to be up-regulated in the ILS mouse cerebellum relative to the ISS cerebellum [22]. Of these, three were predicted by C4.5 as being CREB-dependent (Table 5). Similarly, three out of four genes up-regulated in the ISS cerebellum relative to the ILS cerebellum were predicted by the C4.5 scheme to be transcriptionally CREB-dependent (Table 6). Of a cross-section of genes up-regulated in the ILS cerebellum relative to ISS per the Affymetrix MOE430 platform [22], 64% were predicted by the C4.5 scheme to be transcriptionally CREB-dependent. Out of a similar cross-section up-regulated in the ISS cerebellum relative to the ILS cerebellum, 52% were predicted to be CREB-dependent. These indicate that CREB may be playing a central transcription-regulating role in the cerebellum in this ethanol sensitivity model.

Table 6 C4.5 two-class model predictions for up-regulated genes (cross-validated between MOE430 and NIA15k platforms) in ISS mouse cerebellum

Conclusion

Taken together, the observations made suggest that, in the cerebellum, CREB plays a key role in ethanol sensitivity and presents the field with a central hypothesis that needs to be further tested. CREB's role in mediating a number of complex behaviours has been documented [33]. Events in the extended amygdala have long been associated with the reinforcing effects of addicting drugs [36]. It is evident that the cerebellum, though less well studied in this regard, is involved in addiction [1315]. Since CREB's transcription regulating activity differs from cell type to cell type [37], pursuit of the implications of a key role for CREB in this addiction model's cerebellar molecular milieu would be both promising and instructive.

Methods

A training dataset for ML was created out of twenty-three known targets of CREB transcriptional regulation [38, 39], and twenty-three genes out of a set of twenty-eight (Table 7) whose transcription regulations are potentially CREB-independent. An independent set of twenty-one genes served as an evaluation set.

Table 7 List of "potentially CREB-independent" genes

"Potentially CREB-independent genes"

Nrf2 binds to CREB Binding Protein for enhanced transcription activating activity [25]. Cigarette Smoke (CS)-induced oxidative stress has been associated with the expression of Nrf2 transcription-dependent antioxidant and cytoprotective genes [40]. In experiments conducted by authors V.M and S.B., Nrf2 knockout and Wild-type mice were exposed to CS and Air. The genes listed in Table 7 were up-regulated in both groups, suggesting that their transcriptional regulation is Nrf2-independent (see "Oligonucleotide Microarray" below for further details on what constitutes "Nrf2-independent" genes). Furthermore, none of these genes is known specifically to be a target of CREB transcription regulation. Additionally, as depicted in Figure 3, these genes are distinguishable from those that are known targets for CREB transcription regulation.

CS Exposure

Mice of both genotypes were subjected to cigarette smoke exposure using a machine similar to the one used by [41]. The control groups were kept in a filtered air environment, and the experimental groups were subjected to CS for 5 hours by burning 2R4F reference cigarettes (2.45 mg nicotine per cigarette; Tobacco Research Institute, University of Kentucky), using a smoking machine (Model TE-10, Teague Enterprises). Details of the smoking protocol have been described previously [40]. Mice were fed AIN-76A diet (Harlan Teklad) and had access to water ad libitum; they were housed under controlled conditions (23 ± 2°C; 12-hour light/dark cycles). All experimental protocols conducted on the mice were performed in accordance with the standards established by the US Animal Welfare Acts, as set forth in NIH guidelines and in the Policy and Procedures Manual of the Johns Hopkins University Animal Care and Use Committee.

Oligonucleotide Microarray

Lungs were isolated after 5 hours of CS exposure. Total RNA from the lungs was extracted, using TRIZOL reagent (Invitrogen Corp.). The isolated RNA was hybridized to Murine Genome MOE 430 2.0 GeneChip arrays (Affymetrix, Santa Clara, CA) according to procedures described previously [40]. This array contains probes for detecting approximately 14,500 well-characterized genes and 4371 expressed sequence tags. Scanned output files were analyzed using Affymetrix GeneChip Operating Software version 1.3, and were independently normalized to an average intensity of 500. The data was further analyzed as described previously [42], by performing 9 pairwise comparisons for each group (Nrf2+/+, CS, n = 3, versus Nrf2+/+, air, n = 3, and Nrf2-/-, CS, n = 3, versus Nrf2-/-, air, n = 3). To limit the number of false positives, only those altered genes that showed more than a 1.5-fold change (FC) in magnitude and appeared in, at least, 6 of the 9 comparisons were selected. In addition, the Mann-Whitney pairwise comparison test was performed to rank the results by the significance (P ≤ 0.05) of each identified change in gene expression. In identifying transcriptionally Nrf2-independent genes, only those genes which passed all of these criteria were selected. Further, only those genes that were differentially induced (or repressed) by CS to a similar extent in both genotypes, and having a FC ≥ 2.0 magnitude were considered to be independent of Nrf2's transcription regulating activity. This last dataset was combined with data from previously published work [40] (Genechip used was Murine U74A version 2) to arrive at a comprehensive "Nrf2-independent" gene set.

Promoter Sequence Characteristics

Promoter sequences (1000 nucleotides upstream to 100 nucleotides downstream) corresponding to each gene was obtained from the cited database source [43]. For each promoter sequence, a vector of size 132, with elements characterizing features of the sequence (Figure 4) was generated using a Common Lisp [44] algorithm. The elements of the vector included a Boolean indicating whether or not the cAMP Response Element (CRE) was present, the number of nucleotide base pairs ("distance") between the CRE ("TGACGTCA", "CGTCA" or "TGCGTCA") and the Transcription Start Site (TSS), and the "distance" between the CRE and the TFIID bind site ("TATAGAA", "TATAAAA," "TATAG", or "TATA").

Figure 4
figure 4

Promoter Region Vectors. Breakdown of the elements of the vector generated for each gene. For each gene, 132 numbers were generated from the promoter sequence. Blocks A through O are itemizations of the elements of the vector obtained for each gene.

In addition to these, the three kinds of features of nucleotide sequences used were Composition, Transition and Distribution [23]. Composition is a reference to the proportions of nucleotide base types contributing to the promoter sequence make-up. Transitions represent the frequency with which specific nucleotide base types are followed or preceded, within the sequence, by other nucleotide base types. Distribution is a statement concerning the dissemination of specific nucleotide base types within portions of the sequence (or the entire sequence).

Nucleotide Base Types

For the purpose of the sequence characterizations just described nucleotide bases were grouped based on whether they were purine or pyrimidine, the strength with which they form hydrogen bonds, and whether or not they were "keto" or "amino" (Table 1).

The breakdown of the elements of each vector (Figure 4) is as follows: percent Compositions for the individual nucleotide bases (positions 1 to 4); percent Compositions, Transitions, and Distributions for the Purine versus Pyrimidine base types (positions 5 – 17, consisting of two positions for Compositions, one for Transitions, and ten for Distributions); percent Compositions, Transitions, and Distributions for Strong versus Weak Hydrogen Bonding base types (positions 18 – 30, consisting of two positions for Compositions, one for Transitions, and ten for Distributions), percent Compositions, Transitions, and Distributions for "Keto" versus "Amino" base types (positions 31 – 43, consisting of two positions for Compositions, one for Transitions, and ten for Distributions). The presence or absence of a CRE was indicated by a "1" or a "0" respectively at position 44. The sub-sequence made up of the stretch of bases between the CRE and the TSS was characterized at positions 45 through 88. At position 45, the "distance" was stated. In the absence of a CRE, the entire promoter sequence was characterized in lieu of the sought sub-sequence. In other words, in the absence of a CRE as defined above, the "distance" was longer. Details for positions 46 through 48 were as follows: individual nucleotide base percent Compositions were indicated at positions 46 – 49; Purine versus Pyrimidine base type data were at positions 50 – 62; Strong versus Weak Hydrogen Bonding base type data were at positions 63 – 75; "Keto" versus "Amino" base type data were at positions 76 – 88. Correspondingly, the sub-sequence made up of the stretch of bases between the CRE and the TFIID bind site was similarly characterized at positions 89 through 132.

Four ML schemes were evaluated for their learning performance on the models created: a Decision Tree (J48, an implementation of the C4.5 algorithm), a Support Vector Machine (SVM), a Naïve Bayes classifier (NN) and a Multi-layered Perceptron (MLP), all available through the Weka ML workbench [45]. The C4.5 algorithm emerged as having the lowest predicted error rate (Figure 1). The decision tree (Additional File 2) used in evaluating the independent dataset is based on all the training data. After applying the Corrected Resampled t-test [46] to data generated following use of the leave-one-out technique with ten iterations for each fold, error rates for C4.5 were significantly (p = 0.05) lower than those of SVM and MLP (Table 2). The rates were lower relative to NN though not statistically significant (Table 2). The ROC curves (Figure 2) used as indicators of performance were also generated using the "CREB-regulated" class and the default Weka ML workbench. The threshold modifications that constituted the basis of the ROC curves have been detailed in Additional File 1.

Subsequently a set of genes whose expressions have been previously determined [22] to be increased in ILS or ISS cerebella was identified and the CREB regulation status of each member predicted using the ML scheme C4.5.

Exploratory Data Analysis (EDA) techniques [47] were also used to characterize the vector set. Specifically, boxplots [48] were used to capture the distribution's central tendency (median), spread (fourth-spread), skewness (based on the relative positions of the median, lower fourth and upper fourth), tail length as well as outliers (Figure 3). The statistical environment used to implement the EDA aspects of the study was R [49].