A k-mer grammar analysis to uncover maize regulatory architecture
Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified.
We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) “bag-of-words” which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built “bag-of-k-mers” and “vector-k-mers” models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our “bag-of-k-mers” achieved higher overall accuracy, while the “vector-k-mers” models were more useful in highlighting key groups of sequences within the regulatory regions.
These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.
KeywordsGene regulatory regions Machine learning models Crops genomics
Area under the precision recall curve
Area under the receiver operating characteristic curve
Convolutional Neural Networks
Natural Language Processing
Position Weight Matrix
Recurrent Neural Network
Transcription factor binding site
Term frequency * inverse document frequency
The majority of sequence polymorphisms that are statistically associated with phenotypic variation (GWAS) lie in the non-genic portion of the genome, where they might play regulatory roles [1, 2]. Recently biochemical characterization of the open chromatin space in B73 (the maize reference line), revealed that as much as 40% of the significant sequence polymorphisms - as identified through variance components analyses – overlap with regions in which regulatory elements are expected . These biochemical assays are prohibitively expensive and time consuming at the scale of breeding programs for any crop species. This is even more true for species, such as maize, with high genomic diversity and a high rate of polymorphism. Similar to other crops, in maize, less than half of the genome sequence is expected to be shared between inbred lines . Building accurate models from expensive data derived from reference line(s) will enable breeders to project that information to other genotypes for use in genomic selection models and to prioritize regions of the genome to edit using strategies such as CRISPR technology [5, 6].
The most common models to annotate a non-coding sequence with a regulatory role is the use of collections of transcription factor binding sites (TFBSs), or “motifs”, usually in the form of Position Weight Matrices (PWMs). Collections of PWMs are usually derived from large scale experiments (in-vivo or in-vitro) capable of biochemically characterize the interactions between proteins and the DNA. In plants, only in Arabidopsis, large collections of PWMs describing TF:DNA interactions are available. Franco-Zorrilla JM et al. and O’Malley RC et al. [7, 8]. For plant regulatory regions, a number of convenient tools to identify “motifs” from sets of sequences, or to identify candidate regulatory regions based on the presence of PWMs are routinely used in molecular biology relying on Arabidopsis annotations across species [9, 10]. As a shortcoming “motifs” are elusive, it is common to have experimental data from TF:DNA interactions from which a PWM can not be obtained . When available, PWMs are limited in their application to identify candidate regulatory regions, frequently achieving poor recognition performance [12, 13].
Most of the experimental and computational approaches used to annotate functional non-coding regions focus on the regulatory role of TFBSs [14, 15]. However, it has been observed that patterns of sequence organization (the grammar) and the chromatin context in which TFBSs are located contribute to the regulatory message [16, 17, 18]. For instance, the spatial arrangement of poly(dA:dT) tracts within yeast promoter regions have been identified as causal drivers of transcriptional patterns at comparable levels to TFBSs . More recently, it was shown that developmental enhancers in Ciona rely on the positioning, arrangement, and space between TFBSs to counterbalance low TFBS affinity . From this emerging view, it appears that regulatory regions have distinctive features that can be exploited for prediction, identifying enriched key sequences and sequence organization.
The frequency of oligomers of length k(i.e., short k-mers in the size range of TFBS) have been exploited to build supervised models capable of discriminating regulatory regions from random genomic regions, as well as to score sequence variation with few or no assumptions regarding to the role that a given k-mers might play [21, 22, 23]. The early k-mers count-based classifiers have been improved to count gapped k-mers, allowing exploration of short and long k values without losing power as the total number of k-mers increases . Some limitations of k-mers frequency-based methods include: (1) they make poor or no use of the k-mers positional relationships in their models, and (2) they perform poorly in the presence of repetitive regions, the frequencies of short size k-mers are misleading, which might hamper the performance of this methods for genomes with high repeat content.
Recently however, a growing set of computational tools using Neural Networks (NNs) have shown success in learning to recognize simple sequence patterns, similar to PWMs. These approaches have been able to further integrate those patterns into more complex features to discriminate regulatory regions [25, 26, 27]. Generally, the NNs implemented for genomic data are Convolutional Neural Networks (CNNs), a type of architecture that shows state-of-the-art performance for key phrase recognition tasks in Natural Language Processing (NLP), but not Recurrent Neural Networks (RNNs) which are preferred for comprehension of whole sentence semantics given their power in modeling long-span relations [28, 29]. Despite their power, CNNs are often implemented in a black-box context and interpretation of their output is challenging; thus it remains unclear how much of their performance is derived from recognizing key motifs, motif relationships, and the general sequence context. For these reasons we choose to implement k-mer approaches rather than CNN’s or RNN’s.
To define sequence arrangements with putative regulatory roles, we analyzed the architecture of regulatory regions at the k-mer level, focusing on weighted individual frequencies and co-occurrences, while considering a genome environment with high repeat content. The core of the analysis builds on machine learning approaches commonly applied in the natural language processing (NLP) community. These methods are easily interpretable and rely on word statistics to recover semantic and syntactic cues [30, 31, 32, 33]. We evaluated the accuracy and precision of these approaches with a diverse set of functional genomics experiments to provide a comprehensive description of the regulatory landscape of the maize genome. The software implementation that allows to select control regions, train and test models, is open source and available in a public Bitbucket repository.
Weighted frequencies and co-occurrences of short sequences can accurately discriminate regulatory from random genomic regions
To build accurate classifiers we collected a comprehensive set of regions enriched in regulatory function (hereafter, ’regulatory regions’), as identified in B73 (maize reference genome) through different biochemical assays. We included in the open chromatin regions by MNA-seq derived from two tissues , binding loci from ChIP-seq peaks of two TFs (i.e., Homeobox KNOTTED 1 – KN1, bZIP FASCIATED EAR4 – FEA4) [34, 35], and core promoter regions around TSSs [36, 37, 38] (Additional file 1: Table S1). Because the specific background signals from each individual experiment are not available, regulatory regions were paired with randomly chosen regions controlling for G+C content and genomic distribution. Each group of sequence (regulatory regions and their control) was separated into training and holdout sets for model evaluation. In total we analyzed 52,292,705 base pairs of regulatory regions corresponding to ∼2.5% of the effective genome size of the B73 genome.
We choose to compare our models against a “motif” collection approach. For this we used the MEME-ChIP pipeline . In brief, MEME-ChIP combines several of the most popular algorithms of the MEME suite to generate PWMs (de novo) in a discriminative mode using the sequences in the training set. MEME-ChIP also scan sequences against a motif database from Arabidopsis . The goal of this analysis was to obtain PWMs capable to differentiate between regulatory regions and control to contrast against the models. We obtained five collections, one for each type of regulatory region, of PWMs, and used it to scan the corresponding holdout sets.
To increase the stringency of our evaluation criteria, we measured each models’ performance with unbalanced holdout sets in which regulatory regions are outnumbered by random regions by 1 to 10 (Fig. 2c-d and Additional file 3: Figures S1C-D). Scaling up the number of random regions did not appreciably change accuracy and auROC values, but the auPRC showed a drop in model performance as the rate of false positive increased. At k=8, both models have a desirable precision, ∼80-70%, recovering ∼60% of the relevant regions (i.e., recall rate) for open chromatin and core promoter datasets. The “bag-of-k-mers” model works better for prediction of TF binding loci than the “vector-k-mers”, with the last one displaying an excess of false positives at our aimed recall rate (Additional file 3: Figures S2). Across a more stringent test, the PWM collections under-performed against all the other models at any given k, as a consequence of an increasing in the number of false positives. The performance measurement under an unbalanced set suggests that applying extra stringency to the predicted probability, thereby allowing the recovery of ∼60% of the relevant sequences, would result in an acceptable trade-off between sensitivity and specificity for most of the models when non-regulatory regions are in large numbers.
Models to predict regulatory regions are scalable to the genome-wide space
Under the assumption that annotation of non-coding regions would be part of general pipelines, in which ∼85% of the genome should be recognized as repeats and ∼5% as coding sequences, our models for annotating regulatory regions should be limited to ∼10% of the space. Still, it is a challenge to accurately predict a regulatory region using a model that was training in artificial balanced data from a context that might harbor similar sequence composition while surrounded by repetitive elements. To gain insights on the behavior of the models at a genome-wide scale, the sequence of chromosome 10 was partitioned into 1,943,698 regions (300 base pairs length) and 115,149 regions that were neither repeats nor coding sequences were selected to be annotated. We used models derived from MNA-seq shoot data applying different levels of stringency for the predicted probabilities (Additional file 4: Table S3). According to the results obtained with unbalanced holdout set, and in order to balance sensitivity and specificity, we determined that the ideal predicted probability cut-off was the one that captures ∼60% of the regions that overlap with the annotated regulatory regions. Under this criteria the “bag-of-k-mers” (k=8, filtered, probability ≥0.85) and the “vector-k-mers” models (probability ≥0.95), predicted 38,945 and 41,932 regulatory regions respectively. The high confidence regions classified as regulatory correspond to ∼2.2–2.3% of the total regions from chromosome 10, in line with the expected portion of the genome with a regulatory function.
Next we aimed to annotate the genomes of ZmW22, a maize inbred line, that was recently made public . To do so, we choose to annotate the ZmW22 genome using the MNA-seq shoot models, as open chromatin regions are usually a collection of all the regulatory regions in the genome, including promoters and TFBSs. To get a set of “ground truths” to evaluate our results we aligned ZmB73 MNA-seq regions to the ZmW22 genome, and scored windows around the alignment hits with our models. This test allow us to determine how frequently the models were able to recognize a “candidate regulatory region” in their local context, without masking the genome. This analysis evaluated regulatory vs non-regulatory regions to a ratio of 3:20, more than twice than previous presented analysis for the unbalanced holodut set
According to the observations made in the chromosome 10 of ZmB73, we used first the “bag-of-k-mers” (filtered, probability ≥0.85) to obtain the “candidate regulatory regions”. And used on top the “vector-k-mers” to obtain distances of similarities between the candidate regulatory regions and the ZmB73 MNA-seq regions summarizing region with their vector centroid distance. The combined top prediction around each of the “ground truths” resulted in an intersection with the alignment hit in a ∼70% of the cases. Allowing up to three top predictions around each hit, increases to ∼77% of the cases.
Models trained in maize can be used to inspect the regulatory space in related species
For the evaluation of models trained on core promoters we used a balanced holdout set derived from a random sample of sorghum annotated gene models. The positional preferences in core promoters in maize are evident from average k-mers weights around the +30 region, in which a TATA-box is expected (Fig. 4c). The same is not observed in Sorghum (Fig. 4d). This likely result from the biased sample of TSS in maize that have a high proportion of TATA+ promoters, even when TATA-less promoter are the majority . A positional analysis using the “vector-k-mers” models did not reveal local enrichment along the sorghum promoter sequences. Yet, the probabilities scores are again different between control sequences and core promoter sequences. The difficulties of the model to identify control regions might be a consequence of the strong differences between the repeat landscape in the non-coding regions between sorghum and maize that is not captured in the maize training set, rather than a lack of similarities between the regulatory regions of the two species. Taken together we have shown that classifiers trained in maize can be useful to predict regulatory regions in sorghum and rice, and that features enriched in maize regulatory regions and in the random genomic space (as captured by the models) are of two general types: (1) maize specific and (2) conserved across related species.
Scored vocabularies highlight signatures of regulatory function
The methods proposed here were chosen because of the interpretability of the learned features, aiming to better understand the patterns in sequence that characterize regulatory regions. Thus, we focused on scored k-mer vocabularies (k=8, filtered) as easiest to interpret, and systematically analyzed the tails of the distribution as they concentrated the most informative sequences. Therefore, the largest positive coefficient values (top scored k-mers) are indicative of enrichment and the largest negative values (bottom scored k-mers) of depletion in regulatory regions. The absolute values from both sides of the score distribution are different, with preference for positive over negative ones, meaning that model’s prediction are the result of identifying those k-mers that are enriched in regulatory regions rather than depleted ones (or enriched in random regions). We found that properties of the scored k-mers obtained from applying an out-of-the-box NLP technique  are similar to those previously described with sequence kernels developed to analyze vertebrate genomic data [22, 23, 24].
The enrichment of MNA-seq regions for k-mers with high A+T content (rich A+T k-mers) might be derived from signal co-localization between open chromatin regions and core promoters . If signal co-localization were sufficient to explain the similarities between open chromatin and core promoter regions, then controlling for distance to annotated genes should remove the signal from rich A+T k-mers in distal regions. Yet, controlling for near gene proximal (2kb) the positional constraints remain in both, proximal and distal, regions (Fig. 5e-f). These rich A+T k-mers might be part of poly(dA:dT) tracts which can provide an increase in DNA rigidity and are known to be in proximity to regions that are enriched in TFBSs . In agreement with the positional restriction, rich A+T k-mers flank the midpoints where G+C content is high, as expected for the regions that are bound by TFs , and where the signal for open chromatin regions is concentrated.
In addition to key structural tracts, k-mers with the largest positive values for each regulatory category are expected to be enriched for TF motifs. Because the number of experimentally verified maize motifs is limited, we contrasted the top 1% of positive scored k-mers against two large collections of TF motifs as identified from large scale experiments in the reference plant Arabidopsis thaliana (TOMTOM, p-value <0.001) [7, 8] (Additional file 5: Table S4). For the evaluated experiments we found that the top 1% of positive k-mers are ∼threefold more enriched for significant hits against the motif database than expected by chance for all the k-mers in the population. The enrichment for the top k-mers was statistically significant (hyper-geometric test, p-value <0.001). Further analyses revealed that k-mer scoring is consistent within families of TF binding sites. In particular, motifs preferentially hit by the top 1% of positive k-mers from FEA4 binding loci (a bZIP transcription factor) correspond to the bZIP/TGA-class, and motifs preferentially hit by k-mers enriched in KN1 (a Homeobox transcription factor) correspond to the Homeobox family (Additional file 5: Table S4). Thus, the scored vocabularies produced a comprehensive catalog of k-mers with putative structural roles and a collection of k-mers similar to TFBSs that constitute signatures of the maize regulatory architecture.
Sequence similarity in the geometric space reveals a prevalent distinctive k-mer organization within regulatory regions
To illustrate, we compared the representative vector of the 7-mer CTATATA in Vregulatory (i.e., set of vk-mers learned from core promoter regions) and in Vrandom (i.e., set of vk-mers learned from random regions used as controls for core promoters). Using vCTATATA we obtained the set of top five closest vk-mers in Vregulatory and in Vrandom and found that k-mers from Vregulatory share more sequence similarity (average edit distance 1.8 vs 4.2 respectively) and have, on average, more positive scores from the respective “bag-of-k-mers” model (1.49 vs 0.01) (Fig. 6b). In addition, k-mers close to vCTATATA in Vregulatory share positional constraints that are not recovered from those related in Vrandom (Fig. 6c-d). This example shows how the output of the geometric spaces can be exploited to determine groups of similar k-mers according to their context.
To obtain a global view of how many k-mers are embedded in different local sequences between regulatory and random regions, we collected for any given k-mer (k=8) in the vocabulary, the list of the closest similar k-mers ranked by cosine similarity from Vregulatory and Vrandom. Next, we contrasted the two ranked lists and determined which k-mers show the greatest dissimilarity between regulatory and random regions . In general, we found that low complexity k-mers do not show distinctive organizational ’rules’ between regulatory regions and random, reinforcing our view that short repetitive sequences are not important to define the identity of a sequence. We found that, in terms of the number of k-mers with different relationships between Vregulatory and Vrandom, “vector-k-mers” models derived from TF binding loci (∼45%) and core promoter regions (∼30%) result in notably more differentially represented k-mers than models derived from open chromatin regions (∼5%). In all the cases, we observed a similar proportion of k-mers enriched and depleted in regulatory regions (as established from the “bag-of-k-mers” scores). The results from models trained in open chromatin regions, might represent the heterogeneity of the regions that prevents the model from learning many specific k-mer vectors. However, the fact that the classifiers work with great accuracy indicates that even when the differences are less pronounced than for TF binding loci and core promoter regions, they are large enough to distinguish between an open chromatin region and its control.
We integrated the information obtained from the “bag-of-k-mers” and the “vector-k-mers” models and found that for the top 1% of the k-mers that are enriched in frequency in regulatory regions there is little overlap between k-mers that resemble motifs and k-mers that show differential relationships between regulatory regions and random regions. For instance, from the FEA4 models, only 10 out of 103 k-mers, that are statistically similar to Arabidopsis motifs, show differential k-mer relationships between regulatory and random regions. Such difference might be derived from the proportion of TFBSs that are not similar between Maize and Arabidopsis cis-regulatory elements. In summary, we have compiled a regulatory vocabulary that includes a proportion of key k-mers that are enriched in regulatory regions and (1) resemble known motifs, and (2) are embedded in a specific regulatory context.
The decreased cost of large scale genotyping and genome assemblies for crops such as maize and related species, has already shown potential to accelerate the breeding process by linking sequence and structural variation to phenotype . A majority of functional genetic variation that is important to phenotype is located in the non-coding regions of the genome. This variation is largely untapped because recognizing functional alleles in the non-coding regions of the genome is both expensive and laborious. In humans and other metazoan models, non-coding annotation that allows identification of functional genetic variation has been accelerated over the last decade using two types of analyses: (1) functional analysis from large collections of biochemical assays; and (2) comparative sequence analysis between reference genomes of closely related species . Yet, in maize, these two types of analyses are particularly challenging. Large collections of biochemical assays remain prohibitive at the scale necessary to cover maize diversity, which is 20 times more than the diversity found in humans . In addition, comparative sequence analysis requires genome alignment between closely related species, which for maize and its relatives is complicated by the presence of a large number of repetitive sequences in the genome.
In this study, we introduce a computational framework consisting of two type of machine learning models that can accurately classify regulatory regions obtained from functional genomic experiments and random genomic regions. These approaches were borrowed from the fields of natural language processing and information retrieval, and were explicitly chosen to overcome the challenges of annotating intergenic regions in maize. To address highly repetitive sequences and the role of low-complexity regions in maize non-coding regions the “bag-of-k-mers” model relies on first filtering out k-mers with low-complexity, and next using a sub-linear function to transform raw k-mer frequencies to down weight k-mers that are too frequently observed in a group of sequences and in consequence have less power to discriminate between regulatory and non-regulatory regions. In parallel, the “vector-k-mers” model learns local k-mer organization from k-mer co-occurrence frequencies, which in practice results in a geometric space that allows alignment-free comparisons between sequences . The simultaneous use of two different approaches adds robustness to the predicted annotations, allowing researchers to contrast or to combine the results of the two types of models.
In most of the functional genomics experiments the expectation is to identify rare instances of a biochemical event (e.g., the locations in the genome in which the chromatin is accessible for enzymatic digestion) versus thousands of instances that represent noise. Learning from imbalanced data occurs frequently in many machine learning applications. However, in machine learning rare instances (in our case regulatory regions) are treated as noise. So, training with the true genomic ratio of regulatory:non-regulatory regions will cause the models to learn non-regulatory features over regulatory ones. In the maize genome, non-regulatory features will be the ones that characterize the most abundant class of repeats. On the other hand training in re-sampled data (balancing the ratio of regulatory:non-regulatory region), generate models that expect a distribution of instances that strongly differs from the genomic distribution of events. We decided to pose the problem in a way that the models could learn features from regulatory regions. Next we used a series of evaluations with “real-world” constraints to adjust the probability cut-offs at which the models predictions are still reliable while taking care of the excess of false positives. We show that the adjustment of the probabilities a posteriori and the combined use of the two models allow us to “transfer” annotations from ZmB73 to ZmW22 with reasonable precision.
Because both models are amenable to interpretation, examination of the learned features offers novel insights about key sequence characteristics that can help to build mechanistic hypotheses to be tested at molecular level, and allow comparison of regulatory programs under the same framework. For instance, both types of models suggest that low complexity k-mers are not important for regulatory regions in maize. The comparative use of the models shows that TFBSs (i.e., FEA4 and KN1) are better predicted with the bag-of-k-mers. Also, through modeling MNA-seq data we found that open chromatin regions in maize are characteristically organized within poly(dA:dT) tracts flanking G+C rich k-mers resembling motifs (Fig. 5a-b). Likewise, from modeling maize KN1 ChIP-seq data and further annotation of regions bound by OSH1, we determined conservation at the center of binding loci for the key individual k-mers (Fig. 4b) and a lousy conservation in the pattern of k-mer co-occurrences (Additional file 3: Figure S6A). These results suggests that, though the non-coding regions change rapidly across species, the use of sequence models allows alignment-free comparisons to determine regulatory features that are conserved across million years of evolution.
Taken together, our framework can be used beyond the transference of regional annotations, as can easily be extended to evaluate in silico, the putative effect of sequence variation (i.e., SNPs, single nucleotide polymorphisms) in regulatory function from the differences in k-mer scores and regulatory probabilities for small groups of k-mers.
This work opens many avenues for improving models by adding relevant layers of information. Possible layers to add include: predictions of the 3D structure of regulatory regions, joint modeling of functional genomic data spanning the range of maize diversity to identify general patterns for relevant phenotypes, or even extended across species to build more generalizable models that capture the most conserved features. Furthermore, we expect these annotations to be useful as priors to improve marker assisted technologies such as genomic selection to purge deleterious non-coding sequence variation and to identify targets for genome editing contributing to gene expression dysregulation.
Definition of maize regulatory regions
In the analyses presented throughout this study, we used data sets derived from different functional genomic experiments and obtained from the reference genome (ZmB73 AGPv3, chromosomes 1 to 10) . We included in the analysis open chromatin regions in shoot and roots derived from MNA-seq data ; binding loci for KNOTTED 1 (KN1) and FASCIATED EAR 4 (FEA4) transcription factors from ChIP-seq data [34, 35], and promoter regions [36, 37, 38] from the intersection of TSSs obtained with CAGE and FLcDNAs (Additional file 1: Table S1). For MNA-seq hotspots, ChIP-seq, we collected sequences of 300 base pairs length symmetrically surrounding the midpoints from the originally defined regions. Similarly, for core promoters, we selected the region between -250;+50 base pairs surrounding the TSSs. Each group of regulatory regions was randomly divided between training and holdout sets and reserved for further analyses. Training and testing was performed independently for each type of regulatory regions.
To randomly select control regions, we search in the vicinity (maximum in a 100 kb window) around a given regulatory region for a control region that have a matching G+C content and does not overlap with any of the other regulatory region; if no match was found, we removed the vicinity criteria and searched for a G+C matching region in the same chromosome. For the holdout sets we build balanced and unbalanced sets from randomly selecting one, and ten control regions, respectively, for each regulatory one.
Definition of grasses regulatory regions
Sorghum (Sorghum bicolor) core promoter regions were obtained from the reference genome (v2.1)  for the coordinates between -250;+50 base pairs surrounding the start position of genes with annotated 5’UTR and a subset of 1000 sequences randomly selected for further analyses. Rice (Oryza sativa Nipponbare) KNOTTED 1-like (i.e., OSH1) binding regions were obtained from re-analyzing ChIP-seq experiment starting with the download of raw data from DDBJ (http://www.ddbj.nig.ac.jp/) (accession numbers DRA000206 and DR000313) corresponding to two biological replicates of immunoprecipitation with α-OSH1 and IgG antibodies . Raw reads were mapped against the rice reference genome (IRGSP-1.0 ), using bowtie v1.1.2 (options -n 2, -l 60, -X 500, –best, –strata, -m 1)  and low quality and duplicated reads were removed using picard (http://broadinstitute.github.io/picard/) (MarkDuplicates) and samtools (options -F 780, -F 1024, -f 2)  MACS v2.1.0  was used for peak calling (options -g 3.73e8, -q 0.01) for each of the replicates and 42 peaks with a reproducible absolute summit reserved and further extended to 300 base pairs for downstream analyses. Corresponding control regions were obtained as explained above for maize. Briefly, each reference genome was divided into windows and after removal of sequences overlapping the putative regulatory regions we randomly selected sequences matching G+C content and when possible in the vicinity (∼10 kb) of each of the regulatory sequences.
Preprocessing of sequences
Sequences were preprocessed before fitting models. The preprocessing for the “bag-of-k-mers” model involves the dividing of each sequence into 1 base pair sliding (overlapping) windows of a given size k (k-mers) to collect for a sequence of length L (L-k)+1 k-mers. Next, k-mers were converted into tokens (t) that correspond to collapsed pairs of k-mer and their respective reversed complementary. For the “vector-k-mers” models, each sequence is described as a collection of “sentences” resulting from walking k times and sliding by 1 base pair. Each sentence is broken into ordered non-overlapping new tokens. For testing sentences are divided in neighborhoods to obtain regulatory and non-regulatory likelihoods for groups of k-mers
Calculation of TF*IDF and implementation of the “bag-of-k-mers” model
To generate a “bag-of-k-mers” model, each training data set is represented as a x matrix, with Ws -list of token weights- as rows, and a list y of sequence labels (1 for regulatory regions and 0 for control regions). The “bag-of-k-mers” model results from fitting a regression curve, y = f(x) (i.e., a logistic regression). The C parameter for the logistic regression was chosen by fivefold cross-validation using a grid search function. Logistic regression and grid search functions as used here correspond to the implementation of the python library scikit-learn v0.19.0 .
Implementation of “vector-k-mers” model
To generate “vector-k-mers” models we used the implementation of word2vec algorithms from the python library gensim v1.0.0, which fits sequence representations (k-mer vectors - vk-mers) via Stochastic Gradient Descent (SGD) that aims to optimize an objective function, that implicitly correspond to likelihood for k-mer co-occurrences [32, 57]. Next, as shown for text classification, sequence representations -vk-mers- can be turned through inversion via Bayes rule to determine the likelihood of a new sequence of being part of a regulatory region based on its k-mer composition . This classification schema interprets the individual vk-mers as components in a composite likelihood approximation that allows classification of sequences without extra modeling or estimation steps.
In brief, we trained a shallow (one single hidden layer), fully connected neural network aimed to optimize the probability of predicting a given k-mer (k-mertarget) from its context, that is from the observation of the co-occurring k-mers appearing anywhere within a small window around the target. We ran word2vec with 30 iterations using hierarchical softmax and no negative sampling for each data set (options iter=30, hs=1, negative=0, size=300, min_count=0 and window=5, all others parameters were kept as the defaults) to obtain two independent geometric spaces (a continuous space of sequence representations), one for the regulatory regions (Vregulatory) and the other for the control regions (Vrandom).
For the classification step, we calculated the probability of every new sequence si under each sequence representation – Vregulatory and Vrandom – by first calculating the likelihood of every window within a sentence (using the score function from gensim) and the averaging likelihoods to obtain sentence likelihoods. Next, from the matrix of sentence likelihoods by the two categories (i.e., C= regulatory and control) we derive the sequence probabilities - pVregulatory(si) and pVrandom(si). The category probabilities were calculated via Bayes rule, using as prior πc=1/C, such that the classification proceeds by assigning the category for which pVcategory (si) is greater .
Generation of PWMs collections
For any given regulatory region we generated a collection of PWMs using the MEME-ChIP pipeline, in discriminative mode. The PWMs were generated from the same training sets described above. The collection of PWMs were further used to predict on the respective holdout set. To do so, we run FIMO and consider a prediction as “positive” for any sequence with a p-value of less than 1e-4 for any of the motifs and a PWM scores greater than log2(10 000)=13.28 bits. This parameters have been defined as “gold-standard” to determine “positive PWMs hits” previously . The collections of PWMs obtained with MEME-ChIP are available to the community at the Cyverse data store (http://datacommons.cyverse.org/browse/iplant/home/shared/panzea/dataFromPubs/Mejia2018BMCBiology)
Confusion matrix, and the Receiver Operating Characteristic (ROC) and precision recall (PR) curves were generated using the python library scikit-learn v0.19.0  and plotted with python matplotlib v2.0.0 .
In brief, for each trained model we obtained a confusion matrix from predicting on the holdout data and compared predictions against the true categories to which each region belong. As mentioned for the training, evaluation of the model’s performance was made only in data from the same type of regulatory region in which we trained the models. It means, for instance, that only FEA4 data was used for training and evaluation of FEA4 models.
From the confusion matrix we obtained
True positives (TP): Regions in which we predicted the regulatory category and truly belong to the regulatory category
True negatives (TN): Regions in which we predicted the control category and truly belong to the control category
False positives (FP): Regions in which we predicted the regulatory category, but truly belong to the control category. (Also known as a “Type I error”).
False negatives (FN): Regions in which we predicted the control category, but truly belong to the regulatory category. (Also known as a “Type II error”)
To evaluate the models, we computed from the output of the confusion matrix the following metrics:
Accuracy: (TP+TN)/total regions
Precision: TP /(TP + FP)
Recall: TP /(TP + FN)
In addition to the metrics derived from the confusion matrix we generated ROC and PR curves for each model. The ROC shows the true positive rate in function of the false positive rate for different decision thresholds (a point, sensitivity, specificity). In a ROC curve, the closer it is to the upper left corner (auROC = 1), the better the performance of the classifier. The PR curve shows the trade-off between precision and recall for different decision threshold. A high area under the curve represents both high recall (low false negative rate) and high precision (low false positive rate). The PR curve is preferred over ROC to measure the performance of a binary classifier under imbalanced datasets .
Prediction of open chromatin regions in the ZmW22 genome
In order to evaluate model performance in the annotation of a non-reference maize genome we used the recently published W22 genome . First we collected “ground truths” from aligning MNA-seq regions from B73 to W22 using MUMmer4, a system designed for genome alignments that can handle specie divergent DNA sequence alignments . The hits in the W22 genome that correspond to the corresponding chromosome were considered “truths” or homologous regions. Next, we used the bag-of-k-mers models trained in MNAseq data to score overlapping (stride 150 bps) windows (lenght 300 bps) in a region corresponding to 4Kb centered in the hit. We used the vector-k-mers models to score each window based on their similarity to B73 MNAseq regions. For this we calculated the mean of the k-mers vectors to obtain a “centroid” that summarize each evaluated window to calculate the cosine similarity distance to the centroid vector of the B73 MNAseq regions. The best-scored window was compared against the hits from MUMmer4 and counted as intersecting if at least half of the length of the window was included in the MUMmer4 hit. A file with the coordinates and the predictions from each model as well as the MUMmer4 results are available to the community at the Cyverse data store (http://datacommons.cyverse.org/browse/iplant/home/shared/panzea/dataFromPubs/Mejia2018BMCBiology)
Calculation of k-mer complexity on a TF motifs database
To empirically establish a threshold of complexity for k-mers within regulatory regions we calculated the k-mer complexity for any given k and for all the consensus sequences derived from transcription factor (TF) binding models represented as Position Weight Matrices (PWMs) in the HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) v11 .
Motif enrichment analyses
The statistical significance of the enrichment was calculated using the hyper-geometric test, as implemented with the python library scipy 0.18.1 (stats.hypergeom) , after applying the Bonferroni correction for multiple testing hypothesis to the α (alpha) value required for statistical significance.
We thank to the members of the Buckler lab for comments that greatly improved the manuscript. Specially to Sara Miller for her assistance in language editing, and proofreading.
This work has been funded by NSF Plant Genome Project (IOS #1238014) and the USDA-ARS. The funding sources had no role in the design of the study, data collection, data analysis, or manuscript writing.
Availability of data and materials
All the regulatory regions sequences and their controls, as well with the code used to train models and evaluate models’ performance are available through a public Bitbucket repository (https://bitbucket.org/bucklerlab/k-mer_grammar/) and through Cyverse data store (http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Mejia2019BMCBiology/).
MKMG, Conceptualization, Data curation, Software, Formal analysis, Methodology, Writing—original draft, Writing—review and editing; ESB, Conceptualization, Supervision, Funding acquisition, Writing—review and editing. Both authors read and approved the final version of the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Wallace JG, Bradbury PJ, Zhang N, Gibon Y, Stitt M, Buckler ES. Association mapping across numerous traits reveals patterns of functional variation in maize. PLoS Genet. 2014; 10(12):1004845.Google Scholar
- 3.Rodgers-Melnick E, Vera DL, Bass HW, Buckler ES. Open chromatin reveals the functional maize genome. Proc Natl Acad Sci U S A. 2016; 113(22):3177–84.Google Scholar
- 13.Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges in bioinformatics approaches for enhancer identification. Brief Bioinforma. 2015; 17(6):967–79.Google Scholar
- 24.Ghandi M, Lee D, Mohammad-Noori M, Beer MA. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 2014; 10(7):1003711.Google Scholar
- 28.Zhang D, Wang D. Relation classification: CNN or RNN? In: Lin CY, Xue N, Zhao D, Huang X, Feng Y, editors. Natural Language Understanding and Intelligent Applications. ICCPOL 2016, NLPCC 2016. Lecture Notes in Computer Science, vol 10102. Cham: Springer: 2016. p. 665–75.Google Scholar
- 29.Yin W, Kann K, Yu M, Schütze H. Comparative study of CNN and RNN for natural language processing. ArXiv e-prints. 2017; abs/1702.01923. http://arxiv.org/abs/1702.01923.
- 30.Manning CD, Schütze H. Foundations of Statistical Natural Language Processing. MIT Press. 1999; 5:141–77.Google Scholar
- 31.Mikolov T, Chen K, Corrado GS, Dean J. Efficient estimation of word representations in vector space. ArXiv e-prints. 2013; abs/1301.3781. http://arxiv.org/abs/1301.3781.
- 32.Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13), vol 2. USA: Curran Associates, Inc.: 2013. p. 3111–9.Google Scholar
- 33.Taddy M. Document classification by inversion of distributed language representations. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Stroudsburg: Association for Computational Linguistics: 2015. p. 45–9.Google Scholar
- 37.Soderlund C, Descour A, Kudrna D, Bomhoff M, Boyd L, Currie J, Angelova A, Collura K, Wissotski M, Ashley E, Morrow D, Fernandes J, Walbot V, Yu Y. Sequencing, mapping, and analysis of 27,455 maize full-length cDNAs. PLoS Genet. 2009; 5(11):1000740.Google Scholar
- 41.Springer NM, Anderson SN, Andorf CM, Ahern KR, Bai F, Barad O, Barbazuk WB, Bass HW, Baruch K, Ben-Zvi G, Buckler ES, Bukowski R, Campbell MS, Cannon EKS, Chomet P, Dawe RK, Davenport R, Dooner HK, Du LH, Du C, Easterling KA, Gault C, Guan J-C, Hunter CT, Jander G, Jiao Y, Koch KE, Kol G, Köllner TG, Kudo T, Li Q, Lu F, Mayfield-Jones D, Mei W, McCarty DR, Noshay JM, Portwood JL, Ronen G, Settles AM, Shem-Tov D, Shi J, Soifer I, Stein JC, Stitzer MC, Suzuki M, Vera DL, Vollbrecht E, Vrebalov JT, Ware D, Wei S, Wimalanathan K, Woodhouse MR, Xiong W, Brutnell TP. The maize w22 genome provides a foundation for functional genomics and transposon biology. Nat Genet. 2018; 50(9):1282–8.PubMedGoogle Scholar
- 43.Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M, Weng Z. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012; 22(9):1798–812.PubMedPubMedCentralGoogle Scholar
- 45.Levy O, Goldberg Y. Linguistic regularities in sparse and explicit word representations. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Stroudsburg: Association for Computational Linguistics: 2014. p. 171–80.Google Scholar
- 46.Webber W, Moffat A, Zobel J. A similarity measure for indefinite rankings. ACM Trans Inf Syst. 2010; 28(4):38. https://doi.org/10.1145/1852102.1852106.
- 47.Jiao Y, Peluso P, Shi J, Liang T, Stitzer MC, Wang B, Campbell MS, Stein JC, Wei X, Chin C-S, Guill K, Regulski M, Kumari S, Olson A, Gent J, Schneider KL, Wolfgruber TK, May MR, Springer NM, Antoniou E, McCombie WR, Presting GG, McMullen M, Ross-Ibarra J, Dawe RK, Hastie A, Rank DR, Ware D. Improved maize reference genome with single-molecule technologies. Nature. 2017; 546(7659):524–7.PubMedGoogle Scholar
- 50.Asgari E, Mofrad MRK. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE. 2015; 10(11):0141287.Google Scholar
- 51.Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, Reily AD, Courtney L, Kruchowski SS, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock SM, Belter E, Du F, Kim K, Abbott RM, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson SM, Gillam B, Chen W, Yan L, Higginbotham J, Cardenas M, Waligorski J, Applebaum E, Phelps L, Falcone J, Kanchi K, Thane T, Scimone A, Thane N, Henke J, Wang T, Ruppert J, Shah N, Rotter K, Hodges J, Ingenthron E, Cordes M, Kohlberg S, Sgro J, Delgado B, Mead K, Chinwalla A, Leonard S, Crouse K, Collura K, Kudrna D, Currie J, He R, Angelova A, Rajasekar S, Mueller T, Lomeli R, Scara G, Ko A, Delaney K, Wissotski M, Lopez G, Campos D, Braidotti M, Ashley E, Golser W, Kim H, Lee S, Lin J, Dujmic Z, Kim W, Talag J, Zuccolo A, Fan C, Sebastian A, Kramer M, Spiegel L, Nascimento L, Zutavern T, Miller B, Ambroise C, Muller S, Spooner W, Narechania A, Ren L, Wei S, Kumari S, Faga B, Levy MJ, McMahan L, Van Buren P, Vaughn MW, Ying K, Yeh C-T, Emrich SJ, Jia Y, Kalyanaraman A, Hsia A-P, Barbazuk WB, Baucom RS, Brutnell TP, Carpita NC, Chaparro C, Chia J-M, Deragon J-M, Estill JC, Fu Y, Jeddeloh JA, Han Y, Lee H, Li P, Lisch DR, Liu S, Liu Z, Nagel DH, McCann MC, SanMiguel P, Myers AM, Nettleton D, Nguyen J, Penning BW, Ponnala L, Schneider KL, Schwartz DC, Sharma A, Soderlund C, Springer NM, Sun Q, Wang H, Waterman M, Westerman R, Wolfgruber TK, Yang L, Yu Y, Zhang L, Zhou S, Zhu Q, Bennetzen JL, Dawe RK, Jiang J, Jiang N, Presting GG, Wessler SR, Aluru S, Martienssen RA, Clifton SW, McCombie WR, Wing RA, Wilson RK. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009; 326(5956):1112–5.PubMedGoogle Scholar
- 52.Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, Schmutz J, Spannagl M, Tang H, Wang X, Wicker T, Bharti AK, Chapman J, Feltus FA, Gowik U, Grigoriev IV, Lyons E, Maher CA, Martis M, Narechania A, Otillar RP, Penning BW, Salamov AA, Wang Y, Zhang L, Carpita NC, Freeling M, Gingle AR, Hash CT, Keller B, Klein P, Kresovich S, McCann MC, Ming R, Peterson DG, Mehboob-ur-Rahman, Ware D, Westhoff P, Mayer KFX, Messing J, Rokhsar DS. The sorghum bicolor genome and the diversification of grasses. Nature. 2009; 457(7229):551–6.PubMedGoogle Scholar
- 53.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10(3):25.Google Scholar
- 55.Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008; 9(9):137.Google Scholar
- 56.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É. Scikit-learn: Machine learning in python. J Mach Learn Res. 2011; 12(Oct):2825–30.Google Scholar
- 57.Rehurek R, Sojka P. Software framework for topic modelling with large corpora. In: In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta: University of Malta: 2010. p. 46–50. ISBN 2-9517408-6-7.Google Scholar
- 58.Hunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng. 2007; 9(3):90–5.Google Scholar
- 59.MarÃ§ais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. Mummer4: A fast and versatile genome alignment system. PLoS Comput Biol. 2018; 14(1):1–14.Google Scholar
- 60.Kulakovskiy IV, Vorontsov IE, Yevshin IS, Soboleva AV, Kasianov AS, Ashoor H, Ba-Alawi W, Bajic VB, Medvedeva YA, Kolpakov FA, Makeev VJ. HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. 2016; 44(D1):116–25.Google Scholar
- 61.Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007; 8(2):24.Google Scholar
- 62.Jones E, Oliphant T, Peterson P, et al.SciPy: Open source scientific tools for Python. 2001. http://www.scipy.org/. Accessed 18 Jan 2017.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.