Statistical Methods & Applications

, Volume 26, Issue 2, pp 251–272 | Cite as

Discriminating membrane proteins using the joint distribution of length sums of success and failure runs

  • Sotirios BersimisEmail author
  • Athanasios Sachlas
  • Pantelis G. Bagos
Original Paper


Discriminating integral membrane proteins from water-soluble ones, has been over the past decades an important goal for computational molecular biology. A major drawback of methods appeared in the literature, is that most of the authors tried to solve the problem using machine learning techniques. Specifically, most of the proposed methods require an appropriate dataset for training, and consequently the results depend heavily on the suitability of the dataset, itself. Motivated by these facts, in this paper we develop a formal discrimination procedure that is based on appropriate theoretical observations on the sequence of hydrophobic and polar residues along the protein sequence and on the exact distribution of a two dimensional runs-related statistic defined on the same sequence. Specifically, for setting up our discrimination procedure, we study thoroughly the exact distribution of a bivariate random variable, which accumulates the exact lengths of both success and failure runs of at least a specific length in a sequence of Bernoulli trials. To investigate the properties of this bivariate random variable, we use the Markov chain embedding technique. Finally, we apply the new procedure to a well-defined dataset of proteins.


Runs Scans Patterns Proteins analysis Markov chain embeddable random variables Bivariate Markov embedded random variables of polynomial type 


  1. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P (2002) Molecular biology of the cell, 4th edn. Garland Science, New YorkGoogle Scholar
  2. Antzoulakos DL, Bersimis S, Koutras MV (2003) On the distribution of the total number of run lengths. Ann Inst Stat Math 55(4):865–884MathSciNetCrossRefzbMATHGoogle Scholar
  3. Balakrishnan N, Koutras MV (2002) Runs and scans with applications. Wiley, New YorkzbMATHGoogle Scholar
  4. Bagos PG, Liakopoulos TD, Hamodrakas SJ (2005) Evaluation of methods for predicting the topology of beta-barrel outer membrane proteins and a consensus prediction method. BMC Bioinform 6:7CrossRefGoogle Scholar
  5. Baldi P, Brunak S (2001) Bioinformatics: the machine learning approach. MIT press, BostonzbMATHGoogle Scholar
  6. Berger B, Leighton T (1998) Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. J Comput Biol 5(1):27–40CrossRefGoogle Scholar
  7. Casadio R, Fariselli P, Finocchiaro G, Martelli PL (2003) Fishing new proteins in the twilight zone of genomes: the test case of outer membrane proteins in Escherichia coli K12, Escherichia coli O157:H7, and other Gram-negative bacteria. Protein Sci 12:1158–1168CrossRefGoogle Scholar
  8. Chakraborti S, Eryilmaz S (2007) A nonparametric Shewhart-type signed-rank control chart based on runs. Commun Stat Theory Methods 36(2):335–356MathSciNetzbMATHGoogle Scholar
  9. Dembo A, Karlin S (1992) Poisson approximations for r-scan processes. Ann Appl Probab 2:329–357MathSciNetCrossRefzbMATHGoogle Scholar
  10. Dill KA (1985) Theory for the folding and stability of globular proteins. Biochemistry 24(6):1501–1509CrossRefGoogle Scholar
  11. Eisenberg D, Schwarz E, Komaromy M, Wall R (1984) Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J Mol Biol 179(1):125–142CrossRefGoogle Scholar
  12. Feller W (1968) An introduction to probability theory and its applications, vol I, 3rd edn. Wiley, New YorkzbMATHGoogle Scholar
  13. Fernández A, Kardos J, Goto Y (2003) Protein folding: could hydrophobic collapse be coupled with hydrogen-bond formation? FEBS Lett 536(1):187–192CrossRefGoogle Scholar
  14. Freeman TC Jr, Wimley WC (2010) A highly accurate statistical approach for the prediction of transmembrane beta-barrels. Bioinformatics 26:1965–1974CrossRefGoogle Scholar
  15. Fu JC (1996) Distribution theory of runs and patterns associated with a sequence of multistate trials. Stat Sin 6:957–974zbMATHGoogle Scholar
  16. Fu JC, Koutras MV (1994) Distribution theory of runs: a Markov chain approach. J Am Stat Assoc 89:1050–1058MathSciNetCrossRefzbMATHGoogle Scholar
  17. Gibbons JD, Chakraborti S (2010) Nonparametric statistical inference, 5th edn. Chapman and Hall/CRC, New YorkzbMATHGoogle Scholar
  18. Glaz J, Naus JI (1991) Tight bounds and approximations for scan statistic probabilities for discrete data. Ann Appl Probab 1:306–318MathSciNetCrossRefzbMATHGoogle Scholar
  19. Glaz J, Naus J, Wallenstein S (2001) Scan statistics. Springer, New-YorkCrossRefzbMATHGoogle Scholar
  20. Goldstein L (1990) Poisson approximation in DNA sequence matching. Commun Stat Theory Methods 19:4167–4179CrossRefzbMATHGoogle Scholar
  21. Gromiha MM, Suwa M (2005) A simple statistical method for discriminating outer membrane proteins with better accuracy. Bioinformatics 21:961–968CrossRefGoogle Scholar
  22. Gromiha MM, Ahmad S, Suwa M (2005) Application of residue distribution along the sequence for discriminating outer membrane proteins. Comput Biol Chem 29:135–142CrossRefzbMATHGoogle Scholar
  23. Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15(7):563–577CrossRefGoogle Scholar
  24. Karlin S, Cardon LR (1994) Computational DNA-sequence analysis. Annu Rev Microbiol 48:619–654CrossRefGoogle Scholar
  25. Karlin S, Macken C (1991) Some statistical problems in the assessment of inhomogeneities of DNA sequence data. J Am Stat Assoc 86:27–35CrossRefGoogle Scholar
  26. Koutras MV, Alexandrou VA (1995) Runs, scans and urn model distributions: a unified Markov chain approach. Ann Inst Stat Math‘ 47:743–766MathSciNetCrossRefzbMATHGoogle Scholar
  27. Koutras MV, Bersimis S, Antzoulakos DL (2008) Bivariate Markov chain embeddable variables of polynomial type. Ann Inst Stat Math 60(1):173–191MathSciNetCrossRefzbMATHGoogle Scholar
  28. Lapidus LJ et al (2007) Protein hydrophobic collapse and early folding steps observed in a microfluidic mixer. Biophys J 93(1):218–224CrossRefGoogle Scholar
  29. Leslie RT (1967) Recurrent composite events. J Appl Probab 4:34–61MathSciNetCrossRefzbMATHGoogle Scholar
  30. Lou WYW (2003) The exact distribution of the k-tuple statistic for sequence homology. Stat Probab Lett 61:51–59MathSciNetCrossRefzbMATHGoogle Scholar
  31. Martin DEK, Aston JAD (2001) Waiting time distribution of generalized later patterns. Comput Stat Data Anal 52:4879–4890MathSciNetCrossRefzbMATHGoogle Scholar
  32. Möller S, Croning MD, Apweiler R (2001) Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17(7):646–653CrossRefGoogle Scholar
  33. Mood AM (1940) The distribution theory of runs. Ann Math Stat 11:367–392MathSciNetCrossRefzbMATHGoogle Scholar
  34. Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC (2012) The genomes online database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 40:D571–D579CrossRefGoogle Scholar
  35. Rajarshi MB (1974) Success runs in a two-state Markov chain. J Appl Probab 11:190–192MathSciNetCrossRefzbMATHGoogle Scholar
  36. Schulz GE (2002) The structure of bacterial outer membrane proteins. Biochim Biophys Acta 1565(2):308–317CrossRefGoogle Scholar
  37. Tusnady GE, Zs Dosztanyi, Simon I (2005) PDB_TM: selection and membrane localization of transmembrane proteins in the Protein Data Bank. Nucleic Acids Res 33:D275–D278CrossRefGoogle Scholar
  38. Wu TL, Glaz J (2015) A new adaptive procedure for multiple window scan statistics. Comput Stat Data Anal 82:164–172MathSciNetCrossRefGoogle Scholar
  39. Zhou R, Huang X, Margulis CJ, Berne BJ (2004) Hydrophobic collapse in multidomain protein folding. Science 305(5690):1605–1609CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Sotirios Bersimis
    • 1
    Email author
  • Athanasios Sachlas
    • 1
  • Pantelis G. Bagos
    • 2
  1. 1.Department of Statistics and Insurance ScienceUniversity of PiraeusPiraeusGreece
  2. 2.Department of Computer Science and Biomedical InformaticsUniversity of ThessalyLamiaGreece

Personalised recommendations