Introduction

Cells are fundamental building blocks of living organisms. Cells constitute tissues, tissues form organs and combination of organs give birth to organ systems [1, 2]. The way different living organisms grow, survive, develop and reproduce is regulated by an instruction manual called Deoxyribonucleic Acid (DNA) or Genetic code [1, 2]. The genetic code is organised into chromatin in a series of nucleosomes, where in each nucleosome, DNA is wrapped around histone octamers which are made up of four pairs of histone proteins (H2A, H2B, H3 and H4). A graphical representation of nucleosome construction with Histone Octamer and DNA binding is illustrated in Supplementary Figure 1. In the process of gene regulation, nucleosomes play an important role as gene transcription is blocked in regions, where DNA is tightly packed by nucleosomes. Nucleosome occupancy affects epigenetic silencing [3], cell replication [4], differentiation [5], and re-programming [5]. Determining whether DNA around histone octamer is tightly wrapped or loosely wrapped, a genetic task known as histone occupancy determination has profound importance in genetic research [6, 7]. Accurate determination of histone occupancy can facilitate deeper understanding of DNA accessibility to proteins [8, 9], chromatin functions [10, 11], and occupancy correlation with promoter strength [12].

Similar to histone occupancy, histone modifications (acetylation, methylation, phosphorylation, sumoylation, and ubiquitylation) are responsible to regulate multifarious biological processes including chromosome wrapping [13, 14] transcriptional activation and de-activation [15,16,17], damaging and repairing of DNA [18, 19]. For instance, histone amino (N)-terminal tails modifications influence internucleosomal exchanges and are capable to modify chromatin structures which ultimately affect gene expression [20] and give birth to many complex diseases, such as Cancer [13].

To acquire a deeper comprehension of epigenetic regulation at cellular level and to pave way for the development of drugs specifically targeting cancer treatment, and histone altering enzymes [21], histone modification detection is essentially required [22]. Histone modifications [23] largely affect the availability of DNA to different transcription factors and ribonucleic acid polymerases. Histone octamers repeat themselves across all nucleosomes in histone sequences, hence properties of nucleosomes primarily rely on incorporated area of histone sequences with specific acetylation and methylation sites level [24, 25]. In addition, considering methylation of histone proteins H3 and H4 mainly regulate the core activity of DNA replication [26], and acetylation of different histone proteins impact chromatin structure as well as gene transcription [27, 28]. A thorough analysis of histone acetylation and methylation areas in histone sequences can decipher the association of histone modification with metabolism which mediates diverse epigenetic abnormalities in multifarious pathological conditions [29].

Developing a robust computational approach for accurate histone occupancy and modification prediction has been an active area of research, since the public availability of ten benchmark data sets developed by [30]. From ten benchmark histone marker data sets, two belong to histone occupancy, three are related to histone acetylation, and five are related to histone methylation. Across 10 different benchmark data sets, histone sequences having occupancy, acetylation, or methylation level greater than 1.2 belong to positive class and lower than 0.8 belong to negative class. To perform binary classification across all ten benchmark histone markers data sets, [30] proposed the very first computational approach for histone occupancy and modification prediction for yeast genome. Their proposed approach utilized occurrence of higher order residues to generate statistical representation of histone sequences and Support Vector Machine classifier.

Using 10 different benchmark data sets facilitated by [30] related to histone occupancy, acetylation, and methylation, to date, a number of computational methodologies have been developed [31,32,33,34,35,36]. Prime focus of existing computational approaches [30, 34,35,36] has been to generate a rich statistical representation of histone sequences. In this regard, few researchers have utilized bag of words-based approaches [30, 34], whereas others have utilized one hot encoding scheme to generate statistical representation of histone sequences [31,32,33, 35]. While, bag of words-based statistical representation only manages to capture residue frequency and neglects rich semantic information. One-hot encoding lacks to capture comprehensive contextual information and correlations of residues. Furthermore, bag of words and one-hot encoding schemes face the curse of dimensionality issue with the induction of higher order sequence residues.

Recently, [36] proposed a deep learning approach for histone occupancy and modification prediction. For each sequence, they transformed one-hot encoded vector of higher order residues into image-like tensor through the assignment of each higher order residue to a pixel in an image by making use of Hilbert curves. Image-based representation of histone sequences was passed to a CNN model for the extraction of important residue correlations and dependencies. Although image-based representation manages to find discriminative sequence residues, however, fails to handle transnational invariance of residues mainly due to the supreme attention towards local residue context. Despite the fact that histone sequences are primarily comprised of four basic residues [adenine (A), cytosine (C), guanine (G), and thymine (T)], treating them as a simple string of repetitive letters neglects their biologically relevant and inherent spatial configuration as well as interaction between sequence residues. Complex molecular spatial composition of histone sequences indicates the relevance of a rich statistical representation which can effectively capture long-range dependencies of residues. However, due to the lack of comprehensive understanding of sequence residue patterns, a rich statistical representation scheme for histone sequences related to histone occupancy, methylation, and acetylation does not exist.

Building on these deficiencies, for the establishment of an improved and more robust histone occupancy and modification landscape, the paper in hand develops a lightweight computational multi-paradigm framework, namely, “Histone-Net”. Considering the efficacy of neural language modelling in diverse Natural Language Processing (NLP) [37] and Bioinformatics tasks [38] for capturing long-range dependencies and relatedness of sequence residues as well as improving the generalizability of predictive pipeline. Histone-Net makes use of neural language modelling to generate a rich distributed representation of histone sequences. Inspiring from the extensive usage of FastText model to generate word or higher order residue embeddings in an un-supervised manner for diverse NLP (e.g., text classification) [37, 39] and Bioinformatics tasks (e.g., protein family classification [40], enhancer prediction [41], n6-methyladenine sites prediction [42]). Histone-Net generates un-supervised higher order residue embeddings (DNA2Vec) of histone sequences. Furthermore, Histone-Net presents a different application of FastText model, where it incorporates histone occupancy and modification information while learning higher order residue embeddings (SuperDNA2Vec) of histone sequences. To investigate which distributed representation leaning scheme better captures coarse-grained and fine-grained relations of higher order residues, a rigorous intrinsic evaluation of both kinds of embeddings is performed by mapping high-dimensional feature space into low-dimensional feature space using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) schemes. Extrinsic evaluation of both types of embeddings is also performed using three different machine learning classifiers (Random Forest, AdaBoost, Support Vector Machine).

Existing approaches [31,32,33,34,35,36] have not been evaluated in cross-domain binary classification paradigm, where for each histone sequence analysis task, model is trained on one type of histone marker and tested on another type of histone marker. We explore the performance potential of proposed Histone-Net approach in cross-domain binary classification paradigm. Furthermore, a critical analysis of existing computational approaches reveals that in all existing approaches, 10 different checkpoints are obtained by rigorously training the single model separately over 10 benchmark genomic data sets to predict histone occupancy, methylation, and acetylation areas in histone sequences. In this strategy, one needs to know the target histone marker beforehand to select appropriate checkpoint amongst all model checkpoints while making prediction over unseen histone sequences. More recently, [31] developed a deep learning approach “DeepHistone” to simultaneously predict different histone markers associated with particular sequence. However, DeepHistone is only capable to detect the type of histone marker modification and unable to predict histone occupancy and modification levels. Inspiring from the work of Yin et al. [31] who treated the identification of histone markers as multi-label classification problem, we develop a multi-label classification paradigm to deal with the expensive overhead of generating separate model checkpoints for ten benchmark data sets belonging to three histone sequence analysis tasks. More specifically, in multi-label classification paradigm, performance of Histone-Net is evaluated in terms of its ability to simultaneously predict histone marker type, its occupancy, acetylation, and methylation levels. A comprehensive evaluation of proposed approach under the hood of intra-domain and cross-domain binary classification as well as multi-label classification paradigm proves the dominance of proposed approach over state-of-the-art predictor, its generalization potential across multiple histone markers as well as power to simultaneously predict histone type, its occupancy, acetylation, and methylation areas using a single deep learning model.

Fig. 1
figure 1

Number of discriminative features of positive and negative class along with overlapping features among two classes across one data set of histone occupancy, acetylation, and methylation

Materials and methods

This section illustrates different modules of computational framework Histone-Net, benchmark binary classification data sets, the process used to develop a multi-label classification data set, and evaluation metrics used to evaluate the integrity of Histone-Net in binary and multi-label classification paradigm.

Histone-Net framework

To better illustrate the dynamics of proposed computational framework Histone-Net, this section briefly describes four different modules of Histone-Net. First module generates overlapping higher order residues of histone sequences, details of residue generation is provided in “Higher order residue generation of histone sequences”. Second module generates unsupervised sequence embeddings (DNA2Vec) using FastText model, details of which are provided in “Unsupervised distributed representation of histone sequences (DNA2Vec)”. Third module presents a novel approach, namely, Histone-Net which generates histone occupancy and modification information aware residue embeddings in a supervised manner (SuperDNA2Vec) and is capable to perform histone occupancy and modification prediction in binary classification paradigm. Details of third module is provided in “Histone-Net approach based on supervised distributed representation of histone sequences (SuperDNA2Vec)”.

Fourth module presents the work of Histone-Net in multi-label classification paradigm, where it simultaneously predicts histone marker type, occupancy, acetylation, and methylation levels, a brief description of which is given in “Histone-Net: multi-label classification paradigm”.

Higher order residue generation of histone sequences

Histone sequences are comprised of four repetitive letters adenine (A), guanine (G), cytosine (C), and thymine (T). One of the most common approach to analyze long histone sequences is segregating the sequences into higher order residues [43,44,45,46,47,48]. Mainly, two different kinds of higher order residues can be derived from the sequence including overlapping and non-overlapping [49] by rotating a fixed-size window with certain stride size (Supplementary Figure 2). Overlapping higher order residues are generated by rotating a fixed-size window over histone sequences with the stride size less than the size of the window. Whereas, non-overlapping higher-order residues are generated by rotating a fixed size window over histone sequences with stride size equal to the size of the window. Non-overlapping higher order residues loose the deep correlation information of basic residues, whereas overlapping residues retain deep correlation information of residues. Following existing sequence analysis work [41, 50,51,52,53], we generate overlapping higher order residues with 10 different degrees ranging from 2-to-12 to comprehensively assess the impact of higher order residues in disentangling rich contextual information for 3 distinct histone sequence analysis tasks.

Furthermore, considering, the performance of machine learning classifiers is directly proportional to the number of discriminative higher order residues, we analyze which window size of overlapping high order residues generates most discriminative residues for histone occupancy, acetylation, and methylation prediction tasks. We find that there does not exist any discriminative residues until 7-mers; however, afterward (e.g., 8-mers to 12-mers), there exist decent number of discriminative residues which occur in one class and absent in other class.

To provide a bird’s eye view of discriminative potential of different overlapping higher order residues across different classes, we randomly select one data set from each histone sequence analysis task and reveal the overlapping and discriminative features for each class for 5 different higher order residues (8-mers to 12-mers) in form of venn diagrams (Fig. 1). As shown in Fig. 1, for histone occupancy H3 data set, in case of 8-mers, positive class has only 20 unique residues which are not present in the negative class and negative class has 10 unique residues which are not present in positive class. Whereas, 65,506 residues are present in both classes. With the increase of degree of k-mers, discriminative as well as overlapping residues also increase. A similar trend can also be seen with histone acetylation (H3K9ac) and methylation (H3K4me1) data sets.

Unsupervised distributed representation of histone sequences (DNA2Vec)

In Natural Language Processing (NLP), two neural embedding generation models, namely, Continuous Bags of Words (CBOW), and Skipgram have been extensively used to learn dense word vectors based on their semantic relationships. Building on the success of these models in diverse NLP, genomics, and proteomics sequence analysis tasks, we generate rich distributed representation of histone sequences using CBOW model in an unsupervised manner (DNA2Vec), workflow of which is illustrated in Fig. 2. As is shown by the figure, for each histone sequence analysis task, higher order residues (k-mers) embedding vectors are generated in four steps. First, a corpus of overlapping high order residues (k-mers) is generated by combining all the data sets of only corresponding task. Second, we represent each higher order residue as bag of sub-mers.

Fig. 2
figure 2

Workflow of unsupervisedly prepared higher order residue embeddings (DNA2Vec)

Third, an element wise addition of randomly initialized vectors of different sub-mers present in high order residues is performed to generate inner residue distribution aware vectors. Finally, another vector addition is performed using these embedding vectors and complete high order residue vectors. In this manner, generated vectors of higher order residues contain the information about internal residue structure along with the properties of residues.

The training objective of un-supervised embedding generation model is to optimize higher order residue embedding matrix in such a way that model accurately estimates probability of target higher order residue (over all vocabulary of higher order residues) providing decent contextual information through softmax layer. For each sequence analysis task, distributed representation of histone sequences is generated by averaging the distributed representation of higher order residues present in histone sequences.

Histone-Net approach based on supervised distributed representation of histone sequences (SuperDNA2Vec)

The infusion of meta information (e.g., task type, task nature) in the training process of unsupervised distributed representation generation model can encapsulate a variety of inherent relationships. Such information creates a correlation of local as well as global sequence residue distribution with target task that eventually supports the accurate identification of target task. For efficient histone sequence analysis, we develop a unique distributed representation generation model (SuperDNA2Vec) which encapsulates histone occupancy, acetylation, and methylation information along with semantic information of higher order residues using feedback mechanism of supervised learning. As shown in Fig. 3, for each histone sequence analysis task including histone occupancy, acetylation, and methylation prediction, data sets of only corresponding task are combined. Using task specific data set collection, the process of generating task aware distributed representation of histone sequences starts by dividing the sequences into higher order residues (K-mers, e.g., 10-mers). Then, distributed representations of sequences are learned by averaging the randomly initialized vectors of higher order residues present in sequences, regularization of which is performed by applying the dropout rate of 0.5. The training objective of embedding generation model is to optimize higher order residue embedding matrix by monitoring the cross entropy loss produced while estimating the probability of target class (over all classes) through softmax layer. Through iterative learning, embedding generation model manages to encapsulate histone occupancy, acetylation, and methylation information in embedding matrix. Precise deep learning model (Fig. 3) without softmax layer acts as a embedding generation model (SuperDNA2Vec) for test histone sequences. Whereas, entire deep learning model (Histone-Net) predicts histone occupancy, acetylation, and methylation using task aware distributed representation of histone sequences generated by SuperDNA2Vec model.

Fig. 3
figure 3

Workflow of supervisedly prepared higher order residues embeddings (SuperDNA2Vec) and Histone-Net approach

Table 1 Statistics of 10 benchmark data sets including 2 data sets for histone occupancy detection, 3 data sets for acetylation and 5 data sets for methylation level prediction

Histone-Net: multi-label classification paradigm

Architecture of Histone-Net multi-label classification paradigm (Supplementary Figure 3) is identical to binary classification paradigm discussed earlier in “Histone-Net approach based on supervised distributed representation of histone sequences (SuperDNA2Vec)” except 2 major differences. Unlike Histone-Net binary classification paradigm, where SuperDNA2Vec sequence representation embeds only one type of task specific information (e.g., occupancy, acetylation, or methylation) for hand-on data set. In Histone-Net multi-label classification paradigm, SuperDNA2Vec sequence representation integrates all four histone marker type, occupancy, acetylation, and methylation information at the same time. Another difference is the use of sigmoid activation function at last layer and binary cross-entropy as a loss function instead of softmax activation and categorical cross-entropy loss function.

Benchmark binary classification data sets for histone occupancy and modification prediction

This section illustrates the details of 10 public benchmark histone occupancy and modifications (acetylation and methylation) prediction data sets [36] used to evaluate the performance of proposed multi-paradigm computational framework Histone-Net. The process used to prepare 10 benchmark data sets is described in previous studies [54], here we only summarize the statistics of 10 benchmark data sets. Table 1 describes the sample-to-label distribution of 2 histone occupancy (H3, H4), 5 methylation (H3K4me1, H3K4me2, H3K4me3, H3K36me3, H3K79me3), and 3 acetylation data sets (H3K9ac, H3K14ac, H4ac). For acetylation and methylation level prediction data sets, K with its leading number represents the Kth amino acid which has to be modified with mono, di, or tri acetyl (“ac”) and methyl (“me”) modifications. For example, in H3K4me1 data set, 4th amino acid of H3 protein is modified with a mono methyl group.

Fig. 4
figure 4

Development workflow of imbalanced and balanced multi-label classification data sets for histone occupancy and modification prediction

For each benchmark data set, histone sequences having relative occupancy, methylation, and acetylation values greater than 1.2 belong to positive class and lower than 0.8 belong to negative class. In binary classification paradigm, Histone-Net is evaluated in intra-domain and cross-domain settings using 10 benchmark binary classification data sets.

Multi-label classification data set for histone occupancy and modification prediction

This section describes the process used to develop multi-label classification data set for histone occupancy and modification prediction.

Figure 4 illustrates the complete workflow used to develop imbalanced and balanced version of multi-label histone sequence analysis data set by utilising 10 benchmark datasets given by [54]. All 10 benchmark data sets have total 2.74 million sequences, where each sequence is annotated with either 0 or 1. A closer look at sequence ids provided by [54] reveals that a significant number of sequence ids appear in multiple histone markers data sets. For instance, consider a sequence id “iTELL-Chr1_61” which is annotated as 1 in H3 histone marker data set indicates that the sequence has histone occupancy more than 1.2, same sequence id is annotated as 0 in H3k4me1 histone marker data set indicating that the sequence methylation level is less than 0.8, and same sequence id is annotated as 1 in H4kme2 histone marker data set indicating that the sequence methylation level is more than 1.2.

This analysis serves as a basis to formulate multi-label data set, where each sequence id may have 20 labels at max instead of 1 label (0 or 1). From 20 labels, 10 labels represent the association of sequence with positive class distribution of 10 benchmark histone markers, whereas other 10 labels show the association of sequence with negative class distribution of 10 benchmark histone markers. In this manner, each sequence target label is represented as a 20-dimensional binary vector, where 1 is assigned for the association of sequence with positive class of particular histone marker and 0 is assigned for the association of sequence with negative class of particular histone marker. Considering whether multi-label data set is imbalanced or balanced largely impact the predictor performance, two different versions of multi-label histone sequence analysis data set are prepared.

Balanced version of multi-label data set is prepared by eliminating all label cardinalities which have less than 25 sequences, whereas imbalanced version of data set is obtained by retaining all label cardinalities which have less than 25 sequences. In this manner, balanced multi-label data set of 19,088 and imbalanced data set of 22,191 sequences are obtained, sample-to-label distributions of which are illustrated in Fig. 5. In both imbalanced and balanced version of multi-label data sets, hepta-label cardinality have most number of sequences followed by octa-label cardinality. In imbalanced version which retains all label cardinalities, uni-label cardinality has least number of sequences followed by bi-label cardinality, histone markers and respective class distribution of which are depicted in Fig. 6. H3K79me3_neg has most number of uni-label sequences, whereas H3K4me2_pos-H3K4me2_pos has most number of bi-label sequences. Likewise, histone markers and class distribution of tri-label and tetra-label cardinalities which have more than 25 number of sequences are shown in Fig. 7. Combination of H3k36me3_pos, H3k4me1_pos, and H3k4me2_pos has most number of tri-label sequences, whereas the combination of H3k4me2_pos, H3K79me3_pos, H3K36me3_pos, and H3K4me1_pos has most number of tetra-label sequences. Beside trivial binary classification paradigm, Histone-Net is evaluated in multi-label classification paradigm using imbalanced and balanced version of multi-label histone sequence analysis data set.

Fig. 5
figure 5

Sample-to-label distribution of two versions of multi-label classification data sets for histone occupancy and modification prediction

Fig. 6
figure 6

Histone markers having less than 25 samples with uni and bi-label cardinalities

Fig. 7
figure 7

Histone markers having more than 25 samples with tri- and tetra-label cardinalities

Evaluation criteria

To perform a comprehensive evaluation of Histone-Net in binary and multi-label classification paradigm under the hood of diverse settings, this paper utilizes 15 different evaluation metrics, short descriptions of which is provided in following subsections.

Binary classification evaluation metrics

This paper utilizes 4 most widely used binary classification evaluation metrics to assess the performance of Histone-Net framework including accuracy, precision, recall, and F1-score. Accuracy is the most simplest evaluation metric which computes the ratio between correctly predicted histone sequences and total histone sequences. Recall measures the true positive rate and precision computes the percentage of correct positive predictions from all positive predictions. While recall effectively handles type II errors, precision effectively deals type I errors. F1 Score is considered a trade-off among recall and precision that depicts orthogonal relationship. It is calculated through the weighted average of recall and precision. Among all, Mathematical expressions of four different evaluation metrics are given below:

$$\begin{aligned} f(x)= \left\{ \begin{array}{l} \hbox {Accuracy} = \frac{(o_{-}^{+} + o_{+}^{-})}{(o^{+}+o^{-})} \\ \hbox {(Rec)}\, \hbox {Recall} = \frac{(o_{-}^{+})}{(o_{-}^{+}+F_{+}^{-})}\\ \hbox {(Pre)}\, \hbox {Precision} = \frac{(o_{-}^{+})}{(o_{-}^{+}+F_{-}^{+})} \\ \hbox {F}1\hbox {-score} = 2*\frac{\hbox {Pre}*\hbox {Rec}}{\hbox {Pre}+\hbox {Rec}}. \end{array}\right. \end{aligned}$$
(1)

In Eq. 1, \(\hbox {O}^{+}\) represents false positives and true positives, \(\hbox {O}^{-}\) refers to false negatives and true negatives. Total positive class histone sequences which are accurately identified as positive referred by \(O^+_-\), similarly total negative class histone sequences which are corrected predicted as negative are expressed using \(O^-_+\). Whereas, histone sequences which are wrongly predicted into positive class (False positives) are represented as \(F_{-}^{+}\) and histone sequences which are mistakenly classified into negative class (False Negatives) are referred as \(F_{+}^{-}\).

Multi-label classification evaluation metrics

To evaluate the performance of Histone-Net in multi-label classification paradigm, we have utilised 11 different evaluation metrics, namely, accuracy, precision, recall, F1-score, area under the receiver operating characteristic (AU-ROC), area under precision recall curve (AU-PRC), average precision, one error, hamming loss, ranking loss, and coverage.

For multi-label classification, accuracy computes the proportion of accurately predicted label to the total number of labels, averaged over all sequences present in test set. Precision computes the proportion of correctly predicted labels to total number of predicted labels, recall calculates true positive rate, and F1-score compute harmonic mean of precision and recall, averaged over all sequences of test set. Hamming loss is the fraction of miss-classified labels, ranking loss computes the average fraction of reversely ordered label pairs for each histone sequence. Average precision computes the proportion of relevant labels in the highest ranked labels. One error calculates the fraction of histone sequences whose most confident predicted label is irrelevant. Coverage computes average number of labels which should be included to cover entire set of relevant labels. Area under precision recall curve indicates the trade-off among precision and recall. Area Under Receiver Operating Characteristic (AU-ROC) reveals the performance of a classifier at various thresholds. It is important to mention that for most evaluation metrics, higher performance figures show better performance except hamming loss, ranking loss, one error, and coverage, where lower performance figures show better performance. Mathematical expressions of different multi-label evaluation metrics are provided below:

$$\begin{aligned} f(x)=\left\{ \begin{array}{l} \hbox {Accuracy}= \frac{1}{N}\sum _{i=1}^{N}\frac{|y^{(i)}\wedge \hat{y}^{(i)}|}{|y^{(i)}\vee \hat{y}^{(i)}|} \\ \hbox {(Rec)}\, \hbox {Recall}=\frac{1}{N}\sum _{i=1}^{N}\frac{|y^{(i)}\wedge \hat{y}^{(i)}|}{|{y}^{(i)}|} \\ \hbox {(Pre)}\, \hbox {Precision}=\frac{1}{N}\sum _{i=1}^{N}\frac{|y^{(i)} \wedge \hat{y}^{(i)}|}{|\hat{y}^{(i)}|} \\ \hbox {F}1\hbox {-score}= \frac{1}{N}\sum _{i=1}^{N}\frac{2x|Pre(x_{i}) * Rec (x_{i})|}{(|Pre(x_{i}|))+(|Rec(x_{i}|)} \\ \hbox {Hamming Loss} = \frac{1}{NL}\sum _{1=i}^{N}\sum _{j=1}^{L}[I(y_{j}^{(i)}\ne \hat{y}_{j}^{(i)})] \\ \hbox {Ranking Loss}(y,\hat{f}) = \frac{1}{N}*\sum _{1=0}^{N-1}\frac{1}{\left\| y_{i} \right\| _{0}*(n_{\mathrm{labels}}-{\left\| y_{i} \right\| _{0}})} \\ \hbox {Average Precision} (f,C)=\frac{1}{|N|}\sum _{i=1}^{|N|}\sum _{b\epsilon B_{i}}\\ \quad \times \frac{|\{b'|f_{\mathrm{rank}}(a_{i},b')\le f_{\mathrm{rank}}(a_{i},b),b' \epsilon B_{i}\}|}{f_{\mathrm{rank}}(a_{i},b)} \\ \hbox {One Error} (F)=\frac{1}{N}\sum _{i=1}^{N} [[\hbox {arg max} F(x_i) \not \in Y_{i}^{+}]] \\ \hbox {Coverage}=(y,\hat{f}) = \frac{1}{N}*\sum _{1=0}^{N-1}\hbox {max}_{j:y_{ij}=1}\hbox {rank}_{ij} \\ \hbox {rank}_{ij}=|\{k:\hat{f}_{ik}\ge \hat{f}_{ij}\}| \end{array}\right. \end{aligned}$$
(2)

In Eq. 2, N represents total number of sequences, \(Y_i\) represents the ground truth label assignment for ith sequence, \(\hat{Y}_i\) represents the predicted label for ith sequence, \(x_i\) represents ith sequences, jth represents the class index, \(\wedge \) represents logical AND operator, and \(\vee \) represents logical OR operator.

Experimental setup

Proposed computational framework Histone-Net is implemented using Scikit-Learn [55] and Pytorch [56]. To perform a fair performance comparison of Histone-Net predictive methodologies with state-of-the-art histone occupancy and modification predictor [36], following Yin et al. [36], in both adapted DeepHistone [31] and proposed Histone-Net approach, randomly chosen 90% sequences are used for training and 10% sequences are used for testing. From 90% training sequences, 10% sequences are used as a validation set. We perform a large scale experimentation to develop an optimal model for histone occupancy and modification prediction. We assess the performance of DNA2Vec and SuperDNA2Vec sequence embeddings of 8 different dimensions (25, 32, 50, 64, 75, 100, 128, 150) using three different machine learning classifiers (RFC, AdaBoost, SVM). These sequence embeddings are prepared by averaging the statistical vectors of higher order residues present in them. We find that DNA2Vec and SuperDNA2Vec 100-dimensional sequence vectors mark best performance for intra-domain and cross-domain binary classification paradigms, whereas 64-dimensional sequence vectors perform better for multi-label classification paradigm. In all settings, embedding generation model is trained for 10 epochs, where we tweak the dropout from 0.1 to 0.5 only during SuperDNA2Vec embedding generation. From different batch sizes (32, 64, 128, 256), learning rates (0.001-to-0.008), and decay rates (0.91-to-0.99), proposed deep learning approach performs better when it is trained with a batch size of 64, Adam [57, 58] optimizer decay rate of 0.95, and learning rate of 0.008.

To find optimal hyperparameter values for machine learning classifier, we tweak quality of split, number of estimators, kernel type, degree, gamma, and penalty parameter using GridSearch [59]. We find that tree-based machine learning classifiers perform better with gini criteria using 50 number of estimators, discriminative classifier SVM performs better with radial basis kernel, degree of 2, penalty parameter (C) of \(2^{-5}\) and gamma value of 0.001. After finding optimal DNA2Vec and SuperDNA2Vec sequence vectors as well as hyperparameter values, we perform experimentation with 11 different higher order residues ranging from 2-to-12 to determine which higher order residue-based sequence embeddings comprehensively help the classifier to make accurate predictions. For 10 benchmark data sets of 3 different histone sequence analysis tasks, we generate 99 (tasks/unique data set groups * k-mers * machine learning classifiers = 3 * 11 *3) predictive pipelines for DNA2Vec sequence embeddings and 330 (data sets * k-mers * machine learning classifiers = 10 * 11 *3) predictive pipelines for SuperDNA2Vec embeddings. Proposed Histone-Net approach generates 110 (data sets * k-mers * deep learning classifiers = 10 * 11 *1) predictive checkpoints. From different higher order residues, we find that 7-mers to 11-mers sequence embeddings mark best performance. More specifically, 7-mers DNA2Vec and 11-mers SuperDNA2Vec sequence embeddings mark best performance across all classifiers. To evaluate adapted DeepHistone approach [31] across 10 benchmark histone markers data sets, we utilize the source code and parameters provided by Yin et al. [31].

Results and discussion

This section performs comprehensive extrinsic and intrinsic evaluation of DNA2Vec and SuperDNA2Vec sequence embeddings. It compares the performance of proposed approach with machine learning classifiers, adapted convolutional neural network-based approach DeepHistone [31], and state-of-the-art image representation-based predictor HCNN [36].

Extrinsic evaluation of DNA2Vec and SuperDNA2Vec distributed representation in intra-domain setting using binary classification paradigm

Performance values produced by 3 machine learning classifiers with unsupervised and supervised sequence embeddings using 5 different higher order residues (7-to-11) are provided in Supplementary Tables 1 and 2 in terms of 4 distinct evaluation measures. Furthermore, to analyze which higher order residue-based supervised and unsupervised sequence embeddings perform better with what classifier across 3 different histone sequence analysis tasks, performance values are compared in terms of accuracy and F1 score, graphical illustrations of which are provided in Fig. 8 and Supplementary Figure 4, respectively.

A critical performance analysis of supplementary Table 1 indicates that among 3 machine learning classifiers, SVM classifier marks better performance with un-supervised sequence embeddings across 3 distinct histone sequence analysis tasks. Over 2 histone occupancy 5 histone methylation, and 3 histone acetylation prediction data sets, among all different higher order residues (K-mers), 7-mers sequence embeddings produce better performance across most machine learning classifiers.

Furthermore, performance analysis of Supplementary Table 2 indicates that with the induction of target histone occupancy and modification information, performance of all 3 machine learning classifier gets significantly rocketed over all histone occupancy, methylation, and acetylation data sets. Higher order residues such as 10-mers and 11-mers mark superior performance across 10 benchmark data sets belonging to histone occupancy, acetylation, and methylation prediction. Among all classifiers, SuperDNA2Vec and softmax classifier-based approach (Histone-Net) achieves best performance across 10 benchmark data sets. Second, best performance for histone occupancy, acetylation, and methylation prediction tasks across different evaluation metrics is achieved by SVM classifier.

Fig. 8
figure 8

Difference in accuracy of classifiers over 10 different histone occupancy, acetylation and methylation prediction data sets produced using unsupervised and supervised sequence embeddings

In both bar graphs (Fig. 8, Supplementary Figure 4), it can be seen that SuperDNA2Vec sequence embeddings significantly raises the performance of all classifiers. A bird’s eye view of the accuracy graph (Fig. 8) reveals that, using SuperDNA2Vec sequence representation, over both histone occupancy prediction data sets (H4, H3), among all 4 predictors, Histone-Net performance takes the major leap of almost 37% followed by 33.5% of RFC, 27%, AdaBoost, and 26% of SVM when compared against the performance produced using unsupervised higher order residues-based sequence embeddings. For most methylation prediction data sets, RFC performance reach the increment figure of 31%, SVM, and Adaboost performance increases with an average figure of 28%, and Histone-Net incurs 26% increment in performance. Likewise, across all acetylation prediction data sets (H4ac, H3K14ac, H3K9ac), SVM and Adaboost performance increment is identical (31%), RFC marks highest accretion of 32%, whereas Histone-Net performance increases by the figure of 21% in terms of accuracy.

Using supervised sequence embeddings (SuperDNA2Vec), assessing the performance rise in terms of F1-score (Supplementary Figure 4), among all, Histone-Net performance gets rocketed by the average figure of 46% on histone occupancy prediction data sets (H4, H3). Whereas, RFC marks the average increment of 47% over acetylation prediction data sets and Adaboost manages to achieve the average performance increment of 53% for most methylation prediction data sets.

Fig. 9
figure 9

Unsupervisedly learned higher order residue-based embeddings of 10% histone sequences for histone occupancy, acetylation, and methylation tasks

Fig. 10
figure 10

Supervisedly learned higher order residue-based embeddings of 10% histone sequences for histone occupancy, acetylation, and methylation tasks

To summarize, higher order residues assist to disentangle rich contextual information and contribute to learn better distributed representation. From the comparison of DNA2Vec and SAuperDNA2Vec, it is evident that SuperDNA2Vec sequence representation effectively exploits the correlation of higher order residues with target histone occupancy and modification information which eventually greatly influences the performance of all classifiers. Although, all machine learning classifiers performance produced using unsupervised sequence embeddings gets improved significantly with the utilization of supervised sequence embeddings (SuperDNA2Vec); however, among all, softmax classifier outperforms the top performance of all baseline classifiers with decent margin across most histone markers. It is important to mention that machine learning classifiers (e.g., SVM) marginally performs better than proposed Histone-Net approach on few histone markers data sets, because these classifiers make use of SuperDNA2Vec sequence embeddings. Here, the aim of analyzing the performance of different machine learning classifier is to show that even simple classifiers can produce a performance similar to proposed deep learning approach using better statistical sequence representation.

Fig. 11
figure 11

Intrinsic evaluation of different higher order residues-based supervisedly learned sequence embeddings (superdna2vec) over histone occupancy (H3) data set

Intrinsic evaluation of DNA2Vec AND SuperDNA2Vec distributed representation

To analyze whether DNA2Vec or SuperDNA2Vec distributed representation learning scheme generates comprehensive occurrence and positional information aware vectors for higher order residues. We perform intrinsic evaluation of both presented distributed representation generation schemes on randomly selected three data sets, one from each histone sequence analysis task. It can be concluded from Supplementary Tables 1 and 2, among different higher order residues-based sequence embeddings, across 10 different histone marker data sets, overall, most classifiers perform better with 7-mer unsupervised and 11-mer supervised sequence embeddings. Hence, we perform intrinsic evaluation using only 7-mer unsupervised and 11-mer supervised sequence embeddings. To effectively visualize the intrinsic patterns on each selected histone occupancy, acetylation, and methylation data set, we randomly select 10% sequences from positive as well as negative class, 100-dimensional DNA2Vec representations of which are passed to PCA approach that reduces the dimensions from 100 to 25. These 25 dimensional sequence vectors are finally passed to T-SNE visualizer that further reduces the dimensions and create mappings in two-dimensional space, where histone sequences of each class are represented with unique color.

Likewise, to prepare supervised higher order sequence embeddings SuperDNA2Vec, each histone occupancy and modification data set is splitted into train, validations and test sets containing 80%, 10%, and 10% histone sequences, respectively. For each histone sequence analysis task, we utilize the training and validation data to effectively train SuperDNA2Vec model which is later used to generate embeddings for histone sequences present in test set. Visualization of histone sequences present in test set is performed in a similar manner using PCA and T-SNE.

As is depicted by the embedding charts Figs. 9 and 10, clusters for both positive and negative class produced by supervised sequence embeddings (SuperDNA2Vec) of all three selected data sets are far less overlapping as compared to the charts produced by unsupervised sequence embeddings (DNA2Vec), where the clusters are highly overlapping. In addition, it is quite evident that supervised sequence embeddings better captures the local and global semantic composition of higher order residues which eventually assists to develop optimal sequence clusters. Therefore, building on these clusters, for new instances, there are strong chances that un-supervised sequence representation is more likely to cause mis-classification as compared to supervised sequence representation, where the chances of correctly classifying the given instance are far high.

Furthermore, to intrinsically illustrate the discriminative impact of different higher residues-based SuperDNA2Vec embeddings, Fig. 11 shows the positive and negative histone sequence vectors clusters produced using 11 types of higher order residues for histone occupancy data set H3. It is evident from Fig. 11 that clusters are highly overlapping at lower higher order residues (1-mer to 6-mer); however, they are less overlapping at upper degree higher order residues (7-mer to 11-mer).

Performance comparison of Histone-Net approach with adapted and state-of-the-art histone occupancy and modification predictors

We perform a fair performance comparison of SuperDNA2Vec and softmax classifier-based Histone-Net approach with image representation- based state-of-the-art histone occupancy and modification predictor, namely, “HCNN” [36] and adapted convolutional neural network-based approach DeepHistone [31].

Fig. 12
figure 12

Accuracy comparison of proposed Histone-Net approach with state-of-the-art HCNN [36] and adapted DeepHistone [31] approach. Accuracy values of HCNN are taken from Table 3 of Yin et al. [36] work and accuracy values of DeepHistone are obtained by processing raw histone sequences of various histone markers using convolutional neural network model presented by the authors [31]

Figure 12 reports the performance of Histone-Net approach, state-of-the-art HCNN [36] approach, and adapted DeepHistone [31] approach over 10 different histone occupancy, methylation and acetylation prediction data sets in terms of accuracy. Performance of proposed Histone-Net, state-of-the-art HCNN [36], and adapted DeepHistone [31] approach in terms of other evaluation metrics (e.g., precision, recall) is given in Supplementary Table 3. Considering bag-of-words and one-hot encoding approaches face the issue of data sparsity and fail to consider deep contextual information of higher order residues, HCNN transformed histone sequences into 2-dimensional image-like arrays to magnify the strengths of convolutional neural network by making use of space-filling curves (Hilbert curve). HCNN [36] managed to outshine SVM, [34] CNN, and LSTM [35, 36]-based approaches with a decent margin. As illustrated in Fig. 12, for both histone occupancy prediction data sets (H3,H4), on average, HCNN achieves the performance figures around 87%. For most methylation prediction data sets, HCNN average performance falls around 75% except H3k79me3 data set, where it crosses the figure of 80% when evaluated in terms of accuracy, precision, and recall. Similarly, for acetylation prediction data set, it manages to mark the performance of nearly 80%.

To perform a rich performance assessment of proposed Histone-Net approach, we adapt a convolutional neural network-based approach DeepHistone proposed by Yin et al. [31]. Adapted DeepHistone process raw histone sequences, statistical representation of which is generated using one-hot encoding scheme. As shown in Fig. 12, DeepHistone only manages to achieve over 85% accuracy on 2 histone occupancy prediction data sets, over 80% accuracy on only 1 histone methylation prediction data set (H3K79me3) from 5 histone methylation prediction data sets, and over 70% accuracy on 3 histone acetylation prediction data sets. The reasons behind the limited performance of DeepHistone [31] in comparison to state-of-the-art HCNN [36] is the use of sub-optimal statistical representation learning scheme which lacks to capture translational invariance of residues.

Although image-based sequence representation approach successfully extracts discriminative higher order residues-based features, however, it still fails to acquire the comprehensive semantic information of diverse higher order residues. In addition, image-based sequence representation approach fails to extract correlations and diverse interactions of higher order residues important for concerned histone sequence analysis task. Due to these factors, the idea of incorporating target histone occupancy and modification information while generating distributed representation of higher order residues through language modelling (SuperDNA2Vec) proves extremely effective.

Table 2 Performance produced by Histone-Net using different degree higher order residue-based sequence representation in cross-domain setting

SuperDNA2Vec sequence embedding scheme largely assists the softmax classifier in achieving superior performance than HCNN [36] across 10 different benchmark data sets. While for histone occupancy and acetylation prediction data sets (H3,H4), on average, Histone-Net performance falls around 90% and 86%, respectively. Whereas, for most methylation prediction data sets, its performance crosses the landmark of 85% and on H3k79me3 data set it manages to achieve the top performance of 90% when assessed in terms of accuracy, precision, and recall. Likewise, Histone-Net significantly outperforms adapted DeepHistone [31] approach across all 10 benchmark histone markers data sets for 3 different histone sequence analysis tasks. For histone occupancy prediction, on average, Histone-Net achieves an increment of 3%, for histone methylation prediction, it attains an increment of 15% and for histone acetylation prediction, it achieves an increment of 14%. A similar performance trend in terms of other evaluation metrics can be seen in Supplementary Table 3.

In a nutshell, a different application of language modelling, where sequence embeddings are learned by taking the association of higher order residues with target histone occupancy and histone modification into account outperforms bag-of-words, one-hot encoding, image-based, and un-supervised higher order residues-based sequence representation schemes. SuperDNA2Vec effectively captures comprehensive long-range dependencies of higher order residues, their unique interactions, and occurrence in the distribution of diverse histone proteins. Using SuperDNA2Vec, simple softmax classifier manages to outperform state-of-the-art image representation and CNN-based histone occupancy, acetylation, and methylation predictor with significant margin.

Evaluation of Histone-Net approach in cross-domain binary classification paradigm

In biomedical sequence analysis, generally, cross-domain evaluation is used to examine the effectiveness of computational predictors for practical applications. Considering existing histone occupancy and modification predictors are not evaluated in cross-domain setting, we validate the generalization potential of Histone-Net approach by performing cross-domain evaluation. In cross-domain setting, for histone occupancy sequence analysis task, Histone-Net approach is trained over the sequences of different histone markers belonging to Histone occupancy and tested on one of the test set of particular histone marker. This process is repeated to ensure that Histone-Net approach is evaluated on the test set of each histone marker belonging to histone occupancy. Similar process is repeated for histone acetylation and methylation prediction tasks to ensure that Histone-Net is not biased towards specific histone marker data. In this manner, cross-domain performance of Histone-Net approach over test sets of 10 benchmark data sets belonging to 3 distinct histone sequence analysis task is computed.

Like intra-domain setting, in cross-domain setting, performance of Histone-Net approach is assessed using 5 different higher order residues (7-to-11). Table 2 summarizes the peak performance achieved by Histone-Net approach under different higher order residues over the test sets of 10 different benchmark data sets belonging to histone occupancy, acetylation, and methylation prediction. As indicated in Table 2, just like intra-domain setting, Histone-Net achieves top performance of around 86% in terms of 4 different evaluation metrics using upper degree higher order residues (11-mers)-based sequence representation in cross-domain setting for the task of histone acetylation prediction. Whereas, for 2 other histone sequence analysis tasks including Histone Occupancy and Histone Methylation prediction, unlike intra-domain setting, here Histone-Net marks better performance with medium degree higher order residue (7-mers)-based sequence representation. For histone occupancy, Histone-Net achieves best performance of 71% on test set of H3 histone marker as compared to H4 across all 4 evaluation metrics. For histone acetylation prediction, Histone-Net achieves better performance of 86% on test set of H4ac followed by H3K14ac and H3K9ac. Whereas, for histone methylation prediction, Histone-Net attains best performance of 72% on the test of H3K79me3 data set.

Empirical evaluation on the test sets of 10 benchmark data sets belonging to 3 distinct histone sequence analysis tasks indicates that Histone-Net manages to attain the average performance of more than 80% for histone acetylation (H3K14ac, H3K9ac, H4ac), 70% for histone occupancy, and 60% for histone methylation prediction. Across 10 benchmark data sets, compared to average performance of 87% and peak performance of 90% achieved by Histone-Net in intra-domain setting, Histone-Net manages to attain an average performance of 70% with the peak performance of 86% in cross-domain setting. Usually, the performance of computational approaches drops up to great extent when evaluated using cross-domain paradigm; however, SuperDNA2Vec and softmax classifier- based Histone-Net methodology shows decent generalization potential across a variety of data sets belonging to 3 distinct histone sequence analysis tasks.

Evaluation of Histone-Net approach in multi-label classification paradigm

This sections briefly describes the performance of Histone-Net multi-label classification paradigm using balanced and imbalanced version of multi-label data set in terms of 11 different evaluation metrics. To better quantify the effectiveness of Histone-Net multi-label classification paradigm, it compares the performance of Histone-Net with an adapted convolutional neural network-based methodology DeepHistone [31].

Supplementary Table 4 reports the performance produced by Histone-Net over im-balanced and balanced version of multi-label histone occupancy and modification prediction data set using different higher order residues-based SuperDNA2Vec sequence embeddings. As is evident by Supplementary Table 4, across both versions of multi-label data set, Histone-Net performance almost gradually improves by increasing the degree of higher order residues, indicating Histone-Net achieves best performance with upper degree higher order residues (10-mers, 11-mers)-based supervised sequence embeddings across all evaluation metrics. Furthermore, across all different higher order residue thresholds, Histone-Net achieves slightly better performance on balanced version of multi-label data set as compared to imbalanced version in terms of most evaluation metrics. On balanced version of multi-label data set, Histone-Net achieves the F1-score of 72% and hamming loss of 0.20 which surpasses the Histone-Net performance achieved on imbalanced version by the figure of 3% and 1%.

Furthermore, Table 3 compares the performance of Histone-Net approach with adapted DeepHistone approach using imbalanced and balanced versions of multi-label histone occupancy and modification prediction data set. It is evident from Table 3 that Histone-Net significantly outperforms adapted DeepHistone approach across all 11 evaluation metrics. On im-balanced version of multi-label histone occupancy and modification prediction data set, Histone-Net achieves the accuracy increment of 14%, precision increment of 8%, recall increment of 21%, F1-score increment of 15%, average precision increment of 13%, AU-PRC increment of 18%, AU-ROC increment of 12%, hamming loss improvement of 7%, and coverage improvement of 2%. On the balanced version of multi-label histone occupancy and modification prediction data set, Histone-Net achieves the increment of 10%, 4%, 15%, 10%, 15%, 17%, 10%, 5%, and 2% in terms of aforementioned distinct evaluation metrics. On average, Histone-Net supersedes the performance of adapted DeepHistone by the figure 10% and 8% on im-balanced and balanced version of multi-label histone occupancy and modification prediction data set, respectively.

Table 3 Performance produced by proposed Histone-Net and adapted DeepHistone using optimal higher order residues-based SuperDNA2Vec sequence representation over imbalanced and balanced version of multi-label data set in terms of 11 distinct evaluation metrics

Furthermore, to analyze the effectiveness of proposed Histone-Net approach for accurately predicting most number of samples of each histone marker, we utilize one-versus-all strategy to generate 20 binary confusion matrices for 10 histone markers for imbalanced (Supplementary Figure 5) and balanced version of multi-label data set (Supplementary Figure 6). In one-versus-all strategy, false positives, false negatives, true negatives, and true positives, are computed by treating one particular histone marker class as positive and all other histone markers classes belonging to same histone sequence analysis task as negative irrespective of the multi-label problem. More specifically, we want to evaluate the behaviour of Histone-Net when there is a decent gap between the total number of positive and negative sequences.

A critical analysis of 20 confusion matrices (Supplementary Figure 5) produced by Histone-Net over imbalanced data set shows that overall 64% positive histone marker appearances and 82% negative histone marker appearances (represented as rest) are correctly predicted by Histone-Net. Top true positive figure of 89% is achieved on H3 histone marker, whereas top true negative figure of 94% is achieved on H3ac histone marker. Among histone markers related to occupancy, higher number of positive and negative sequences are correctly classified on H3 makers as compared to H4. Among histone markers related to acetylation, most number of positive class sequences are accurately classified in H3K14ac histone marker with the performance around 82%, whereas most number of negative class sequences are correctly predicted in H34ac histone maker with the performance around 95%. Turning towards the performance of methylation related histone markers, 80% positive class sequences are corrected classified in 3 histone markers (H3K36me3, H3K4me1, H3K4me3), whereas 88% negative class sequences are correctly predicted by Histone-Net on H3K79me3 histone marker.

In one-versus-all setting as negative class gets more number of samples which is why usually there exist a huge gap between the performance of positive and negative class; however, here, the gap is not large at all due to the robustness of Histone-Net approach towards imbalance class distribution.

In imbalanced version of multi-label data set, there exist only 5 uni-label and 24 bi-label sequences which are too little for effective model training. Considering uni-label and bi-label sequences act as a noise and derail the generalizability of classifier, we perform experimentation on balanced version of multi-label data set prepared after eliminating all uni-label and bi-label sequences. Performance analysis on 20 confusion matrices produced by Histone-Net on balanced version of multi-label data set (Supplementary Figure 6) reveals that overall 66% positive and 80% negative histone marker appearances are predicted accurately.

Across different histone markers, overall Histone-Net marks better performance on balanced version of multi-label data set as compared to imbalanced version of multi-label data set (Supplementary Figures 5 and 6). Highest true positive figure of 97% and true negative figure of 99% is achieved on H3 and H4ac histone markers, respectively, achieving an increment of 8% and 5% as compared to the peak performance achieved by Histone-Net on imbalanced version of data set. From histone markers related to occupancy, while most number of positive sequences are corrected predicted in H3 histone marker, higher number of negative sequences are correctly classified in H4 histone marker. Among histone markers related to acetylation, greater number of positive class sequences are accurately classified in H3K14ac histone marker with the performance around 86%, whereas most number of negative class sequences are correctly predicted in H34ac histone maker with the performance around 99%, outperforming the performance attained on imbalanced version by 4%. Concerning the performance of methylation related histone markers, 93% positive class sequences are corrected classified in H3K36me3 histone marker, whereas 89% negative class sequences are correctly predicted by Histone-Net on H3K4me2 histone marker, achieving ac increment of 13% and 1%, respectively, when compared with top performance attained by Histone-Net on methylation histone marker of imbalanced data set.

In a nutshell, across different evaluation metrics, although average performance figures attained by Histone-Net on imbalanced and balanced version of multi-label data set are comparable. However, a close look indicates that across most histone markers, Histone-Net achieves better performance on balanced version of multi-label data set as compared to imbalanced version.

To identify up to what degree Histone-Net manages to simultaneously predict histone-occupancy, acetylation, and methylation areas in novel histone sequences, performance of Histone-Net is analyzed over imbalanced and balanced version of multi-label data set in terms of multi-label confusion matrices corresponding to unique sample-label distributions. In both versions multi-label data set, number of correctly predicted histone markers out of all actual histone markers are highlighted in confusion matrices (Fig. 13).

Fig. 13
figure 13

Performance figures produced by Histone-Net over imbalanced and balanced version of multi-label data sets corresponding to unique sample-label distributions

For imbalanced version of multi-label data set, a closer look at sample-label distribution (Fig. 13) and confusion matrix (Fig. 13) reveals that, Histone-Net manages to make accurate prediction for 90% uni-label sequences as only 1 sequence is miss-classified out of 5 sequences. For bi-label sequences, it correctly predict 46% sequences, because 11 bi-label sequences are correctly classified out of 24 sequences. For tri-label sequences, Histone-Net performance drops further, where it only manages to identify the target histone markers of 31% sequences. For tetra-label sequences, Histone-Net achieves best performance around 83% as it makes correct predictions for 260 sequences out of 314 sequences. However, afterward, with the increase of histone marker combinations, Histone-Net best performance of 83% keeps on declining with great margin, dropping to 51%, 25% 26%, and 22%, for penta, hexa, hepta, and octa-label sequences, respectively, achieving lowest performance 15%, 9% on highest label cardinalities including nona-label and deca-label sequences.

On the other hand, for balanced version of multi-label data set, uni and bi-label samples are removed from the data set and label cardinalities which have 25 or more samples are kept. For tri-label and tetra-label sequences, Histone-Net manages to correctly predict 60% and 26% sequences. Afterward, with the increase of label cardinality, unlike imbalanced data set, here Histone-Net performance decreases with less margin, it manages to correctly predict penta, hexa, hepta, and octa-label sequences with 49%, 44%, 32%, and 23% accuracy. However, for highest label cardinalities such as nona-label and deca-label sequences, Histone-Net once again only manages to correctly predict 11% and 9% sequences. Overall, Histone-Net achieves better performance on balanced version of multi-label data set. For balanced version of data set, Histone-Net achieves better performance with medium to higher level of histone marker combinations. Whereas, for imbalanced version, Histone-Net achieves better performance with low to medium level of histone marker combinations. However, for highest histone marker combinations (9 and 10), Histone-Net makes correct prediction for only 10% of sequences.

To summarize, a comprehensive evaluation of Histone-Net in multi-label sequence classification paradigm using imbalanced and balanced data sets proves the capability of Histone-Net for simultaneously predicting histone type, occupancy, acetylation, and methylation levels in histone sequences. Furthermore, Histone-Net achieves decent performance on both imbalanced and balanced version of multi-label data set, showing its robustness to handle diverse data and sample-to-label distributions.

An interactive and user-friendly Histone-Net web server

Histone-Net web server makes the lives of genomics researchers and practitioners easier by facilitating an interactive and user-friendly web interface capable to perform robust histone sequence analysis. Unlike other web server developed for biomedical sequence analysis which only supports inference on new sequences and even that for one particular task. Histone-Net web server can be used to perform and visualize a multi-dimensional exploratory analysis of histone sequences. In addition, it can be used to train diverse predictive pipelines from scratch, tweak most crucial hyper-parameters, inference on new histone sequences for a variety of histone sequence analysis tasks including histone occupancy, acetylation, or methylation level prediction under binary and multi-label classification paradigm. Different modules of Histone-Net web server provide interactive session artifacts which can be downloaded and used for various purposes.

Conclusions

Researchers have experimented with a variety of statistical representation learning approaches and strategies (from distributed representation to attention mechanism) to capture relatedness of residues, their diver interactions, and distribution among different classes. This paper develops unsupervised higher order residues embeddings of histone sequences using FastText model and explores a different application of FastText model to develop SuperDNA2Vec which encapsulates histone occupancy and modification information while learning higher order residues embeddings in a supervised manner. It presents a computational multi-paradigm framework Histone-Net to perform a comprehensive intrinsic and extrinsic evaluation of 2 differently learned embeddings using 3 machine learning classifiers. In addition, it develops a precisely deep neural network Histone-Net for robust histone occupancy, acetylation, and methylation prediction. A comprehensive empirical evaluation of Histone-Net in intra-domain and cross-domain settings under the hood of binary and multi-label classification paradigms proves its effectiveness over state-of-the-art, generalization potential across multiple histone markers, and aptitude to simultaneously predict histone type, its occupancy, acetylation, and methylation levels.