Discriminating membrane proteins using the joint distribution of length sums of success and failure runs
Discriminating integral membrane proteins from water-soluble ones, has been over the past decades an important goal for computational molecular biology. A major drawback of methods appeared in the literature, is that most of the authors tried to solve the problem using machine learning techniques. Specifically, most of the proposed methods require an appropriate dataset for training, and consequently the results depend heavily on the suitability of the dataset, itself. Motivated by these facts, in this paper we develop a formal discrimination procedure that is based on appropriate theoretical observations on the sequence of hydrophobic and polar residues along the protein sequence and on the exact distribution of a two dimensional runs-related statistic defined on the same sequence. Specifically, for setting up our discrimination procedure, we study thoroughly the exact distribution of a bivariate random variable, which accumulates the exact lengths of both success and failure runs of at least a specific length in a sequence of Bernoulli trials. To investigate the properties of this bivariate random variable, we use the Markov chain embedding technique. Finally, we apply the new procedure to a well-defined dataset of proteins.
KeywordsRuns Scans Patterns Proteins analysis Markov chain embeddable random variables Bivariate Markov embedded random variables of polynomial type
- Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P (2002) Molecular biology of the cell, 4th edn. Garland Science, New YorkGoogle Scholar