Introduction

On the basis of previous studies of XX female / XY male systems, chromosome-wide gene dosage compensation of the heterogametic sex chromosome was thought to be critical (Meyer et al. 2004). However, recent studies in birds and in moths have definitively shown that most genes on a large chromosome (Z) can remain uncompensated (Itoh et al. 2007; Ellegren et al. 2007; Zha et al. 2009). Further study of ZZ/ZW sex chromosome systems can help to uncover important adaptations that reduce the disadvantages of constitutively higher expression of Z genes in one sex.

Bird sex chromosomes, Z and W, bear many similarities to the mammalian X and Y chromosomes. Like the mammalian X, the Z chromosome is large and gene rich, whereas the W chromosome, like the mammalian Y chromosome, is small and gene poor. Unlike the sex chromosome constitution in mammals, where the homogametic female is XX and the heterogametic male is XY, in birds homogamety is found in ZZ males whereas females are heterogametic with ZW sex chromosomes. Monotreme X and Y chromosomes are thought to have evolved from the common ancestor of birds and mammals via addition of genes to the ancestral Z and W (Grutzner et al. 2004; Rens et al. 2007). However, unlike birds and eutherian mammals, the monotreme platypus has 5X and 5Y sex chromosomes and the X and Y appear to be intermediate between bird Z/W and eutherian mammalian X/Y in gene content and in level of dosage compensation (Deakin et al. 2008).

Mammalian X chromosome inactivation results in dosage compensation between males and females via silencing of one X chromosome in each female cell (Lyon 1961; Gupta et al. 2006; Nguyen and Disteche 2006). X inactivation occurs in the epiblast early in embryonic development (Nguyen and Disteche 2006), but the inactive X chromosome is re-activated in the female germline during meiosis to allow for recombination between the two X chromosomes (Ross et al. 2005). X inactivation is initiated with the transcription of the XIST non-coding RNA from the X inactivation center. XIST RNA coats the X chromosome from which it was transcribed and recruits repressive Polycomb complexes PRC1 and PRC2 to the inactive X that aid in heterochromatin formation and maintenance of gene silencing (Chadwick and Willard 2003; Plath et al. 2003; Silva et al. 2003). LINE elements are thought to help in the spread of X inactivation along the entire chromosome by acting as ‘way stations’ or ‘booster elements’ (Lyon 1998). Long-term maintenance of gene silencing also requires histone modifications and CpG island methylation on the inactive X (Mohandas et al. 1981; Pfeifer et al. 1990).

In contrast to mammals, dosage compensation of Z genes is ineffective in birds, and average Z gene mRNA is at least 30–40% more highly expressed in males than in females (Itoh et al. 2007; Ellegren et al. 2007; Arnold et al. 2008). Although there is no Xist in birds, dosage compensation may be mediated by an Xist-like non-coding RNA transcribed from the MHM (male hypermethylated) locus on the Z chromosome. The MHM RNA is expressed only in females, associates with the Z chromosome near the site of transcription, and is correlated with female-specific hyperacetylation of lysine 16 on histone 4 near the MHM locus (Teranishi et al. 2001; Bisoni et al. 2005). Importantly, the MHM region shows greater concentration of dosage-compensated genes on the Z chromosome (Melamed and Arnold 2007; Melamed et al. 2009). The greater compensation of dosage could be explained by female-specific upregulation of genes and/or by downregulation of genes in males (Teranishi et al. 2001; Bisoni et al. 2005; Ellegren et al. 2007; Melamed and Arnold 2007).

Dosage compensation in monotremes has not been assessed for all X chromosome genes. However, among 19 X genes on the platypus’ X chromosomes, some genes appear not to be compensated, whereas others are dosage compensated by stochastic transcriptional inhibition with variable expression patterns between genes (Deakin et al. 2008). Paternal inactivation or imprinted partial expression do not appear to be likely mechanisms of dosage compensation in platypus (Deakin et al. 2008). Xist has not been found in marsupial or monotreme genomes and the region homologous to the XIC in mammals is disrupted (Hore et al. 2007; Shevchenko et al. 2007).

Although most genes on the X chromosomes are well compensated, about 15% of human X genes escape inactivation (Carrel and Willard 2005). Previous studies in humans explored how to predict X gene inactivation status. Wang et al, showed that primary DNA features together with support vector machines (SVM) can predict gene compensation status with high accuracy (Wang et al. 2006). Here we sought to evaluate whether primary DNA sequences on the chicken Z chromosome could also be used to predict gene compensation status. To answer this question, we evaluated all primary sequences on the Z chromosome using linear SVM and statistical analysis. We found that CpG islands and CR1-LINE elements are significant primary DNA features on the chicken Z chromosome. CR1-LINE elements are enriched on the Z chromosome compared to autosomes in chicken, but do not predict Z gene compensation status, are not differentially associated with compensated or uncompensated genes, and are not enriched in specific regions of the Z chromosome. In contrast, CpG islands are significantly enriched around uncompensated genes and have the highest predictive power of genes’ compensation status compared with other repetitive features. We further show that LINEs are not enriched on the platypus X chromosome compared with the LINE enrichment on the human X and chicken Z chromosomes, suggesting that LINE accumulation is not a general feature of all sex chromosomes.

Materials and methods

Sample collection and microarray data analysis

Sample collection and microarray data analysis have been described previously (Itoh et al. 2007; Melamed and Arnold 2007).

Statistical analysis

Statistical analysis of the feature sequence difference between compensated and uncompensated genes as well as the SVM analyses were performed using R (R Development Core Team 2006). The libSVM package was used for data classification and gene prediction using linear SVM (Chang and Lin 2001). R’s base statistics package was used to perform Fisher’s exact tests on the CpG island and LINE-CR1 features.

For the SVM analysis in chicken, a total of 310 compensated (lowest 1/3 quantile with M:F ratios <1.21) and uncompensated (upper 1/3 quantile with M:F ratios >1.54) Z brain-expressed genes were considered. Thus, ‘dosage compensation’ is used here to refer to any process that reduces the high M:F ratios assumed to result from the double genomic dose of Z genes in males relative to females. Z gene compensation status was determined by calculating male to female expression ratios (M:F ratios) for all Z genes in brain. The SVM was trained to associate the set of compensated and uncompensated genes with feature scores. The feature list consisted of 329 unique repetitive elements on the Z chromosome obtained from the UCSC genome browser RepeatMasker track (see Supplementary Table 1). The training set consisted of 280 genes. The leave-one-out cross validation sample was performed either with a set of 30 genes chosen randomly on the Z chromosome or with a segment of 30 contiguous Z genes in separate analyses. SVM was trained on 280 genes, and then tested to determine whether mathematical association of features with compensation status could accurately predict the compensation status on the 30 genes that were left out of the training sample. The procedure was repeated 100 times for different sets of 30 genes in the leave-out sample, each time scoring the number of correct and incorrect predictions of the genes’ compensation status.

For both Fisher’s exact test and SVM analyses, we considered a total of 23 types of CR1 elements present on the Z chromosome based on information obtained from UCSC genome browser (Supplementary Table 1) and CpG islands. Gene transcription is known to be influenced by both the promoter region sequence and the larger genomic sequences upstream of genes (Bailey et al. 2000; Ke and Collins 2003; Wang et al. 2006). We therefore chose distances upstream of genes to interrogate genomic windows within gene-promoter regions near gene start points as well as genomic windows with other regulatory units in between genes. Average intergene distance on the Z chromosome was calculated to be 134.9 kb. Therefore, a repeat was considered to be associated with a gene if any part of it was within the search width of 2 kb, 10 kb, 50 kb, or 100 kb upstream of the gene’s start.

For Fisher’s exact test, Z genes were classified into compensated and uncompensated as described above. We counted the number of CpG or LINE elements within 2 kb, 10 kb, 50 kb, or 100 kb genomic windows upstream of genes and established mean numbers of CpG or LINE elements for each genomic window. Genes with less than the mean number of CpG or LINE elements were called low-CpG or low-LINE, and genes with greater than the mean number of elements were called high-CpG or high-LINE.

Genes and repetitive elements sampling

LINE and CpG island information for the chicken (May 2007 freeze, build May 2006), human (May 2007 freeze, build March 2006), and platypus genomes (February 2009 freeze, build March 2007) were extracted from the UCSC genome annotation database using the Repeats and Variation track and the CpG island track (http://genome.ucsc.edu/) (Supplementary Table 1). Repeat information is based on the RepeatMasker annotation (repeat name, repeat Class, and repeat Family, and start and end positions on the chromosome), which uses a modified Smith–Waterman algorithm to align sequences against the Repbase library (http://www.girinst.org). The CpG island track is based on the definition of DNA sequence >200 bp in length, GC content >0.5, CpGobs/CpGexp ratio (observed to expected ratio based on GC content) ≥0.6 (Gardiner-Garden and Frommer 1987).

Information about LINE element positions and length of chromosomes was obtained from the UCSC genome browser RepeatMasker track. We calculated what percentage of each chromosome consisted of LINE-CR1 elements relative to the length of the chromosome. For these analyses, chromosomes of length less than 100 kB (i.e., in chicken, chromosomes 16, 32, and W) were excluded.

Results

Machine learning classifiers such as support vector machine algorithms (SVM) have previously been successfully applied to predict X inactivation status of mammalian genes using different repetitive element features around X chromosome genes (Wang et al. 2006). We asked whether the 329 repetitive features on the Z chromosome are predictive of the compensation status of Z genes using SVM. As described in Methods, the SVM was trained to associate repetitive features and compensation status of 310 compensated and uncompensated Z genes. When all 329 repetitive features were used together, SVM correctly predicted 62.4% of genes. CpG island feature was the most predictive feature of compensation status and alone correctly predicted 63% of genes. LINEs and CpG islands together gave a similar predictive power (61.6%) as CpG islands alone (Fig. 1). Using LINE elements alone decreased SVM’s predictive power to 52.6%.

Fig. 1
figure 1

SVM results. SVM was trained to associated genes’ compensation status in a training set of genes and then tested to predict the compensation status of a leave-out sample of genes. The tables show the percentages of correctly and incorrectly predicted compensation status of genes based on the actual observed gene compensation status for different features. An overall percentage of correctly predicted compensated and uncompensated genes is included, calculated by averaging the correctly predicted compensated and uncompensated genes for the feature. The search width for these results was 2 kb

It has previously been shown that the mammalian X chromosome is enriched 2-fold in LINE elements compared with chromosomes 6, 7, 20, 21, and 22 (Bailey et al. 2000). Similarly, our analysis of LINE element composition on the human X compared with all 22 autosomes confirmed an X-specific enrichment of 1.7-fold (32.3% for X vs. 18.2% for 22 autosomes) (Fig. 2). Chromosomal size and LINE element content were significantly correlated in the human genome (r = 0.645, p < 0.0007). The chicken Z chromosome contained 2.76-fold more LINE elements (9.77%) than the average autosome (3.55%) (Table 1). In addition, chromosome size was highly correlated with percent LINE element composition (r = 0.815, p < 0.0000001) (Fig. 2). The higher percentage of LINE elements on the Z chromosome, however, is not just a function of size, so that the Z vs. autosome difference holds even when comparing the Z chromosome to autosomes of comparable size (Fig. 2). In both the chicken and the human genomes, the Z and X chromosomes contained the highest LINE content of any chromosome (Fig. 2, Table 1).

Fig. 2
figure 2

Percentage LINE element composition on chicken and human chromosomes. In both chicken (a) and human (b), the percentage LINE elements increases monotonically with chromosome length, but in each case the larger sex chromosome (X in human, Z in chicken) has a disproportionately high percentage of LINE elements

Table 1 Percentage of LINE elements per length of chromosome in the chicken, human, and platypus genomes. Compared with the chicken Z and human X chromosomes, both of which have a higher percentage LINE composition, the platypus X chromosome has a similar percentage of LINES compared with autosomes

We found that in platypus, the four X chromosomes contain the average percentage of LINE elements (18.73%) compared with autosomes (18.55% ) (Table 1). Interestingly, the percentage of LINE elements on the platypus X chromosomes was between the LINE percentage compositions of chicken and human.

Given that CpG islands and LINE elements were the most predictive features for dosage compensation, we next evaluated how the density of CpG islands or LINE elements around Z genes related to their compensation status. Using Fisher’s exact test, we observed that in all four genomic windows the low-CpG-score group contained more compensated genes, and the high-CpG-score group contained more uncompensated genes. These differences were significant for the 2 kb (p < 0.00003), 10 kb (p < 0.0006), and 50 kb (p < 0.012) windows, but not in the 100 kb window (p = 0.12) (Fig. 3, Table 2). LINE elements considered as a combined single feature or as separate classes of elements (23 different classes) did not yield a significant association with gene compensation status in any of the considered windows after Bonferroni correction.

Fig. 3
figure 3

CpG islands are differentially associated with compensated and uncompensated genes. a 2 kb (p = 3.19 × 10−5), b 10 kb (p = 6 × 10−4), c 50 kb (p = 0.012), and d 100 kb (p > 0.05) genomic windows upstream of gene start. Fisher’s exact test)

Table 2 CpG islands are differentially associated with compensated and uncompensated genes. Fisher’s exact test results for 2 kb, 10 kb, 50 kb, and 100 kb genomic windows upstream of gene start. In all four genomic windows, the low-CpG-score group contains a disproportionate number of compensated genes, and the high-CpG-score group contains a disproportionate number of uncompensated genes, reaching significance in the 2 kb, 10 kb, and 50 kb windows

We have previously shown that Zp is enriched in compensated genes in the MHM region (MHM valley, 2.5–3.5 × 107 bp) whereas the distal part of Zq (Zq peak, 5.5–7.5 × 107 bp) is enriched in non-compensated genes (Melamed and Arnold 2007, 2009). We next asked whether there were any regional differences in CpG islands or LINE concentration in the MHM valley or in the Zq peak. We computed the number of CpG islands relative to the size of the region in basepairs (CpG proportion) inside the region (Zq peak or MHM valley) and outside of the region. A CpG island was considered to be inside or outside of the region if its midpoint was inside or outside of the region. We found that the CpG proportion inside the Zq peak was 1.13 times higher than on the rest of the Z chromosome. The CpG proportion on the entire Z chromosome was 1.22 times higher than inside of the MHM valley (Table 3). Using a similar method for LINE elements, we did not find a difference in proportion of bases inside or outside of the MHM valley and the Zq peak compared to the rest of the Z chromosome (Table 3).

Table 3 CpG island and LINE-CR1 element proportions on different regions of the Z chromosome. Number indicates the proportion of DNA sequence occupied by CpG islands and LINE-CR1 elements calculated for the entire Z chromosome or inside the Zq peak or MHM valley

Discussion

Here, we evaluated the contribution of primary DNA sequences to chicken Z gene dosage compensation. We find that CpG islands and LINE elements are the most predictive features of 329 repetitive Z chromosome features. CpG islands are significantly enriched upstream of uncompensated Z genes and within the Zq uncompensated region. LINE elements are enriched on both the chicken Z and human X compared with autosomes but not on the platypus X chromosomes. The position of LINE elements does not appear to be significantly associated with dosage compensation status of Z genes. Thus, these analyses implicate LINE elements as important features of the Z chromosome and suggest that CpG islands are not randomly distributed on the Z chromosome with regard to dosage compensation. We also find that LINE elements are not enriched on the platypus X chromosomes, suggesting that LINE accumulation is not a common feature of all sex chromosomes.

In mammals, LINE elements have been estimated to occupy as much as 40% of the genome (Waters et al. 2007). Unlike the pattern in mammals, repetitive elements make up just 10% of the bird genome, with most elements belonging to the CR1 family of LINEs (Hughes and Piontkivska 2005). In addition, unlike mammalian LINEs, which belong to a single lineage (Smit et al. 1995; Furano 2000), chicken CR1 elements come from several distinct divergent lineages (International Chicken Genome Sequencing Consortium 2004). Mammalian LINEs are mostly found in AT-rich genomic regions (Soriano et al. 1983; Lander et al. 2001; Pavlicek et al. 2001; Hackenberg et al. 2005) while chicken CR1s are located in both AT- and GC-rich regions of the genome (Abrusan et al. 2008). It is unclear at present whether any of the chicken CR1 elements are active, since most are truncated at the 5′ end (Silva and Burch 1989; Stumph et al. 1981; International Chicken Genome Sequencing Consortium 2004; Wicker et al. 2005) although at least one CR1 appears to have an open reading frame on chromosome 6 (International Chicken Genome Sequencing Consortium 2004).

There are several explanations for the overall lower percentage of repetitive elements in the bird genome compared with mammals. One reason may be that chicken CR1 is more specific than the mammalian CR1, resulting in fewer pseudogenes and a smaller genome in chickens (International Chicken Genome Sequencing Consortium 2004; Shedlock et al. 2007). Another explanation may be that LINEs are preferentially removed from the chicken genome via ectopic exchange between repeats with subsequent accumulation in AT- and GC-rich regions (Wichman et al. 1992; Abrusan and Krambeck 2006; Abrusan et al. 2008). Still another speculation is that the lower number of repetitive elements in birds’ genomes is related to the need for smaller cell size for the requirements of flight and efficient gas exchange (Szarski 1983; Wachtel and Tiersch 1994).

Other factors, however, are probably relevant to the observation that the Z chromosome is enriched in LINE elements compared with autosomes in a manner similar to the enrichment on X chromosomes of mammals. One question is whether LINE elements are important in dosage compensation on sex chromosomes. In mammals, LINEs are considered to be ‘way stations’ for the propagation of the dosage compensation signal (Lyon 1998, 2000). However, the association of LINEs with inactivated vs. escapee genes is not strong. Only some mammalian studies have reported an association of LINE-L1 elements with inactivated X genes in mammals, whereas other studies did not find an association (Bailey et al. 2000; Wang et al. 2006). In our study, we did not observe a strong relationship between CR1 elements and gene compensation status, suggesting that in birds LINE elements on the Z chromosome may not be active players in a Z-chromosome-specific process of dosage compensation.

The chicken Z chromosome accumulation of LINEs may have more to do with the low level of recombination of the sex chromosomes. The low meiotic recombination rate results in higher accumulation of repetitive elements on the mammalian X and Y chromosomes (Wichman et al. 1992; Jensen-Seaman et al. 2004). When the full chicken W chromosome sequence becomes available, it will be interesting to investigate in future studies whether the W also has a high LINE content given its low recombination. However, the platypus X chromosomes do not have an accumulation of LINEs compared to autosomes in our study. The opossum sex chromosomes are also not enriched in LINEs (Mikkelsen et al. 2007). Therefore, LINE accumulation may not be strictly limited to sex chromosomes. In mammals, chromosomal size is thought to have a positive correlation with repetitive element enrichment (Wichman et al. 1992; Jensen-Seaman et al. 2004), likely explaining higher LINE content on chicken macrochromosomes in our study.

Interestingly, our results show that the distribution of CpG islands on the chicken Z chromosome is not random, and CpG islands tend to be associated with uncompensated genes. This result is similar to the enrichment of CpG islands around highly expressed genes in mammals. For example, in mammals CpG islands are commonly associated with housekeeping genes, which have generally higher expression than other genes in various tissues (Cross et al. 2000; Hurst et al. 2004). In addition to being highly expressed in males, uncompensated Z genes are also enriched in catalytic activity and other housekeeping functions in birds (Melamed and Arnold 2007). In mammals, methylation of CpG islands contributes to the maintenance of gene silencing on the inactive X chromosome (Cross et al. 2000). X escapees have been found to have fewer CpG islands than inactivated X genes in some studies in mammals, suggesting that lack of DNA methylation is related to the escape from inactivation (Bailey et al. 2000). In future studies in birds, it will be important to assess Z-chromosome-wide methylation and association with dosage compensation.

Our analysis using support vector machine algorithms showed that the CpG island feature is the most predictive of dosage compensation compared to all other types of repetitive elements on the Z chromosome. CpG islands alone could be used to correctly predict compensation status of 63% of genes compared to the random chance of 50%. The percentage prediction rate is lower than in mammalian studies, where 12 features can be used to correctly predict compensation status of about 80% of genes (Wang et al. 2006). It is possible that other types of sequences that were not considered in this study may help improve the predictive power over our current result. Some candidates may be present within the MHM region, which contains concentration of compensated genes (Melamed and Arnold 2007). Alternatively, the lack of an effective chromosome-wide mechanism of dosage compensation suggests that the compensated Z genes have each evolved gene-specific adaptations leading to lower M:F ratios of expression. Such gene-specific adaptations may be heterogeneous, and not involve common DNA elements near compensated genes.

In conclusion, primary DNA sequence information can be used to assess dosage compensation status of genes on the Z chromosome. CpG islands are not randomly distributed on the Z chromosome and may influence Z genes’ lack of dosage compensation. CpG islands are also the most predictive feature of Z gene compensation status of all repetitive elements on the Z chromosome, but this feature leads to only a modest level of prediction. LINE elements are important features of the chicken Z and mammalian X but not of the platypus X chromosome suggesting that LINE element accumulation is not a feature of all sex chromosomes. LINEs do not appear to play a role in dosage compensation in chicken compared with their significant role in dosage compensation signal propagation in mammals.