Position weight matrix and Perceptron

  • Xuhua Xia


This chapter covers two frequently used algorithms for motif characterization and prediction. The first part is on position weight matrix (PWM) algorithm which takes in a set of aligned motif sequences (e.g., 5’ splice sites) and generates a list of PWM scores which may be taken as the signal strength for each input sequence. In this context, PWM is somewhat a sequence equivalent of principle component analysis in multivariate statistics. Also generated are a PWM for highlighting the non-randomness of motif patterns and the significance tests of the site-specific motif patterns. PWM also serves as an essential component in algorithms for de novo motif discovery, such as Gibbs sampler. How to specify background frequencies and pseudo-counts in computing PWM? What are their effects on the PWM outcome? How to control for Type I error rate involving multiple comparisons by using false discovery rate? All these topics are detailed in this chapter. The second part of the chapter covers perceptron which is the simplest artificial neural network with only a single neuron. Its function is equivalent to two-group discriminant function analysis in multivariate statistics, i.e., to identify nucleotide or amino acid sites that provide the greatest discriminant power between two sets of sequences. The algorithm of perceptron as well as its application and limitations are illustrated in detail.


  1. Aerts S, van Helden J, Sand O, Hassan BA (2007) Fine-tuning enhancer models to predict transcriptional targets across multiple genomes. PLoS One 2(11):e1115CrossRefPubMedPubMedCentralGoogle Scholar
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRefGoogle Scholar
  3. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402PubMedPubMedCentralCrossRefGoogle Scholar
  4. Arnqvist G (2006) Sensory exploitation and sexual conflict. Philos Trans R Soc Lond Ser B Biol Sci 361(1466):375–386CrossRefGoogle Scholar
  5. Bailey TL, Williams N, Misleh C, Li WW (2006) MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res 34(Web Server issue):W369–W373CrossRefPubMedPubMedCentralGoogle Scholar
  6. Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I (2005) Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21(11):2657–2666CrossRefPubMedGoogle Scholar
  7. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57(1):289–300Google Scholar
  8. Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple hypothesis testing under dependency. Ann Stat 29:1165–1188CrossRefGoogle Scholar
  9. Bhagwat M, Aravind L (2007) PSI-BLAST tutorial. Methods Mol Biol 395:177–186CrossRefPubMedPubMedCentralGoogle Scholar
  10. Brown M, Hughey R, Krogh A, Mian IS, Sjolander K, Haussler D (1993) Using Dirichlet mixture priors to derive hidden Markov models for protein families. Proc Int Conf Intell Syst Mol Biol 1:47–55PubMedGoogle Scholar
  11. Brumme ZL, Dong WW, Yip B, Wynhoven B, Hoffman NG, Swanstrom R, Jensen MA, Mullins JI, Hogg RS, Montaner JS et al (2004) Clinical and immunological impact of HIV envelope V3 sequence variation after starting initial triple antiretroviral therapy. AIDS 18(4):F1–F9CrossRefPubMedGoogle Scholar
  12. Chakrabarti S, Lanczycki CJ (2007) Analysis and prediction of functionally important sites in proteins. Protein Sci 16(1):4–13CrossRefPubMedPubMedCentralGoogle Scholar
  13. Claverie JM (1994) Some useful statistical properties of position-weight matrices. Comput Chem 18(3):287–294CrossRefPubMedGoogle Scholar
  14. Claverie JM, Audic S (1996) The statistical significance of nucleotide position-weight matrix matches. Comput Appl Biosci 12(5):431–439PubMedGoogle Scholar
  15. Delorenzi M, Speed T (2002) An HMM model for coiled-coil domains and a comparison with PSSM-based predictions. Bioinformatics 18(4):617–625CrossRefPubMedGoogle Scholar
  16. Dewey CN, Rogozin IB, Koonin EV (2006) Compensatory relationship between splice sites and exonic splicing signals depending on the length of vertebrate introns. BMC Genomics 7:311CrossRefPubMedPubMedCentralGoogle Scholar
  17. Fickett JW (1996) Quantitative discrimination of MEF2 sites. Mol Cell Biol 16(1):437–441CrossRefPubMedPubMedCentralGoogle Scholar
  18. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188CrossRefGoogle Scholar
  19. Frank C, Makkonen H, Dunlop TW, Matilainen M, Vaisanen S, Carlberg C (2005) Identification of pregnane X receptor binding sites in the regulatory regions of genes involved in bile acid homeostasis. J Mol Biol 346(2):505–519CrossRefPubMedGoogle Scholar
  20. Ge Y, Sealfon SC, Speed TP (2008) Some step-down procedures controlling the false discovery rate under dependence. Stat Sin 18(3):881–904PubMedPubMedCentralGoogle Scholar
  21. Gorodkin J, Heyer LJ, Brunak S, Stormo GD (1997) Displaying the information contents of structural RNA alignments: the structure logos. Comput Appl Biosci 13(6):583–586PubMedGoogle Scholar
  22. Grech B, Maetschke S, Mathews S, Timms P (2007) Genome-wide analysis of chlamydiae for promoters that phylogenetically footprint. Res Microbiol 158(8–9):685–693CrossRefPubMedGoogle Scholar
  23. Gumbel EJ (1958) Statistics of extremes. Columbia University Press, New YorkGoogle Scholar
  24. Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15(7–8):563–577CrossRefPubMedGoogle Scholar
  25. Hertz GZ, Hartzell GW 3rd, Stormo GD (1990) Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci 6(2):81–92PubMedGoogle Scholar
  26. Hertzberg L, Izraeli S, Domany E (2007) STOP: searching for transcription factor motifs using gene expression. Bioinformatics 23(14):1737–1743CrossRefPubMedGoogle Scholar
  27. Hiard S, Maree R, Colson S, Hoskisson PA, Titgemeyer F, van Wezel GP, Joris B, Wehenkel L, Rigali S (2007) PREDetector: a new tool to identify regulatory elements in bacterial genomes. Biochem Biophys Res Commun 357(4):861–864CrossRefPubMedGoogle Scholar
  28. Hiller K, Grote A, Scheer M, Munch R, Jahn D (2004) PrediSi: prediction of signal peptides and their cleavage positions. Nucleic Acids Res 32(Web Server issue):W375–W379CrossRefPubMedPubMedCentralGoogle Scholar
  29. Hirst JD, Sternberg MJ (1991) Prediction of ATP/GTP-binding motif: a comparison of a perceptron type neural network and a consensus sequence method [corrected]. Protein Eng 4(6):615–623CrossRefPubMedGoogle Scholar
  30. Hua S, Sun Z (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17(8):721–728CrossRefPubMedGoogle Scholar
  31. Hwang S, Gou Z, Kuznetsov IB (2007) DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 23(5):634–636CrossRefPubMedGoogle Scholar
  32. Jin VX, Leu YW, Liyanarachchi S, Sun H, Fan M, Nephew KP, Huang TH, Davuluri RV (2004b) Identifying estrogen receptor alpha target genes using integrated computational genomics and chromatin immunoprecipitation microarray. Nucleic Acids Res 32(22):6627–6635CrossRefPubMedPubMedCentralGoogle Scholar
  33. Jin VX, O’Geen H, Iyengar S, Green R, Farnham PJ (2007) Identification of an OCT4 and SRY regulatory module using integrated computational and experimental genomics approaches. Genome Res 17(6):807–817CrossRefPubMedPubMedCentralGoogle Scholar
  34. Kamalakaran S, Radhakrishnan SK, Beck WT (2005) Identification of estrogen-responsive genes using a genome-wide analysis of promoter elements for transcription factor binding sites. J Biol Chem 280(22):21491–21497CrossRefPubMedGoogle Scholar
  35. Kim H, Park H (2004) Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins 54(3):557–562CrossRefPubMedGoogle Scholar
  36. Kobayashi H, Akitomi J, Fujii N, Kobayashi K, Altaf-Ul-Amin M, Kurokawa K, Ogasawara N, Kanaya S (2007) The entire organization of transcription units on the Bacillus subtilis genome. BMC Genomics 8:197CrossRefPubMedPubMedCentralGoogle Scholar
  37. Kumar KK, Shelokar PS (2008) An SVM method using evolutionary information for the identification of allergenic proteins. Bioinformation 2(6):253–256CrossRefPubMedPubMedCentralGoogle Scholar
  38. Kuznetsov IB, Gou Z, Li R, Hwang S (2006) Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins 64(1):19–27CrossRefPubMedGoogle Scholar
  39. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214CrossRefPubMedGoogle Scholar
  40. Lemay DG, Hwang DH (2006) Genome-wide identification of peroxisome proliferator response elements using integrated computational genomics. J Lipid Res 47(7):1583–1587CrossRefPubMedGoogle Scholar
  41. Li GL, Leong TY (2005) Feature selection for the prediction of translation initiation sites. Genomics Proteomics Bioinformatics 3(2):73–83CrossRefPubMedPubMedCentralGoogle Scholar
  42. Liang KC, Wang X, Anastassiou D (2008) A profile-based deterministic sequential Monte Carlo algorithm for motif discovery. Bioinformatics 24(1):46–55CrossRefPubMedGoogle Scholar
  43. Lin HC, Tsai K, Chang BL, Liu J, Young M, Hsu W, Louie S, Nicholas HB Jr, Rosenquist GL (2003) Prediction of tyrosine sulfation sites in animal viruses. Biochem Biophys Res Commun 312(4):1154–1158CrossRefPubMedGoogle Scholar
  44. Liu J, Louie S, Hsu W, Yu KM, Nicholas HB Jr, Rosenquist GL (2008) Tyrosine sulfation is prevalent in human chemokine receptors important in lung disease. Am J Respir Cell Mol Biol 38(6):738–743CrossRefPubMedPubMedCentralGoogle Scholar
  45. Ma P, Xia X (2011) Factors affecting splicing strength of yeast genes. Comp Funct Genomics:Article ID 212146, 13 pagesGoogle Scholar
  46. Mannella CA, Neuwald AF, Lawrence CE (1996) Detection of likely transmembrane beta strand regions in sequences of mitochondrial pore proteins using the Gibbs sampler. J Bioenerg Biomembr 28(2):163–169CrossRefPubMedGoogle Scholar
  47. Monteiro PT, Mendes ND, Teixeira MC, d’Orey S, Tenreiro S, Mira NP, Pais H, Francisco AP, Carvalho AM, Lourenco AB et al (2008) YEASTRACT-DISCOVERER: new tools to improve the analysis of transcriptional regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Res 36(Database issue):D132–D136PubMedGoogle Scholar
  48. Nicholas HB Jr, Chan SS, Rosenquist GL (1999) Reevaluation of the determinants of tyrosine sulfation. Endocrine 11(3):285–292CrossRefPubMedGoogle Scholar
  49. Nichols T, Hayasaka S (2003) Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat Meth Med Res 12(5):419–446CrossRefGoogle Scholar
  50. Obenauer JC, Cantley LC, Yaffe MB (2003) Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res 31(13):3635–3641CrossRefPubMedPubMedCentralGoogle Scholar
  51. Ostrin EJ, Li Y, Hoffman K, Liu J, Wang K, Zhang L, Mardon G, Chen R (2006) Genome-wide identification of direct targets of the Drosophila retinal determination protein Eyeless. Genome Res 16(4):466–476CrossRefPubMedPubMedCentralGoogle Scholar
  52. Pearson WR (1998) Empirical statistical estimates for sequence similarity searches. J Mol Biol 276(1):71–84CrossRefGoogle Scholar
  53. Ptashne M (1986) A genetic switch: gene control and phage lambda. Cell Press and Blackwell Scientific, Cambridge, MAGoogle Scholar
  54. Rashid M, Saha S, Raghava GP (2007) Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformatics 8:337CrossRefPubMedPubMedCentralGoogle Scholar
  55. Regier JC, Shultz JW, Zwick A, Hussey A, Ball B, Wetzer R, Martin JW, Cunningham CW (2010) Arthropod relationships revealed by phylogenomic analysis of nuclear protein-coding sequences. Nature 463(7284):1079–1083CrossRefPubMedGoogle Scholar
  56. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408CrossRefPubMedGoogle Scholar
  57. Ryan MJ, Fox JH, Wilczynski W, Rand AS (1990) Sexual selection for sensory exploitation in the frog Physalaemus pustulosus. Nature 343:66–67CrossRefPubMedGoogle Scholar
  58. Sakaluk SK (2000) Sensory exploitation as an evolutionary origin to nuptial food gifts in insects. Proc Biol Sci 267(1441):339–343CrossRefPubMedPubMedCentralGoogle Scholar
  59. Schneider TD, Stephens RM (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18(20):6097–6100CrossRefPubMedPubMedCentralGoogle Scholar
  60. Schwartz S, Silva J, Burstein D, Pupko T, Eyras E, Ast G (2008) Large-scale comparative analysis of splicing signals and their corresponding splicing factors in eukaryotes. Genome Res 18(1):88–103CrossRefPubMedPubMedCentralGoogle Scholar
  61. Sharp PM, Li WH (1987) The codon adaptation index – a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15(3):1281–1295CrossRefPubMedPubMedCentralGoogle Scholar
  62. Sheth N, Roca X, Hastings ML, Roeder T, Krainer AR, Sachidanandam R (2006) Comprehensive splice-site analysis using comparative genomics. Nucl Acids Res 34(14):3955–3967CrossRefPubMedGoogle Scholar
  63. Sim J, Kim SY, Lee J (2005) PPRODO: prediction of protein domain boundaries using neural networks. Proteins 59(3):627–632CrossRefPubMedGoogle Scholar
  64. Staden R (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 12(1 Pt 2):505–519CrossRefPubMedPubMedCentralGoogle Scholar
  65. Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982a) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3011PubMedPubMedCentralCrossRefGoogle Scholar
  66. Stormo GD, Schneider TD, Gold LM (1982b) Characterization of translational initiation sites in E. coli. Nucleic Acids Res 10(9):2971–2996CrossRefPubMedPubMedCentralGoogle Scholar
  67. Stormo GD, Schneider TD, Gold L (1986) Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res 14(16):6661–6679CrossRefPubMedPubMedCentralGoogle Scholar
  68. Sun XY, Yang Q, Xia X (2013) An improved implementation of effective Number of Codons (Nc). Mol Biol Evol 30:191–196CrossRefPubMedGoogle Scholar
  69. Vert JP (2002) Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings. Pac Symp Biocomput 7:649–660Google Scholar
  70. Vlasschaert C, Xia X, Gray DA (2016) Selection preserves Ubiquitin Specific Protease 4 alternative exon skipping in therian mammals. Sci Rep 6:20039CrossRefPubMedPubMedCentralGoogle Scholar
  71. Wright F (1990) The ‘effective number of codons’ used in a gene. Gene 87(1):23–29CrossRefPubMedGoogle Scholar
  72. Xia X (2001) Data analysis in molecular biology and evolution. Kluwer Academic Publishers, BostonGoogle Scholar
  73. Xia X (2007c) An improved implementation of codon adaptation index. Evol Bioinforma 3:53–58CrossRefGoogle Scholar
  74. Xia X (2012b). Position Weight Matrix, Gibbs Sampler, and the associated significance tests in motif characterization and prediction. Scientifica 2012: Article ID 917540, 15 ppGoogle Scholar
  75. Xia X (2013) DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution. Mol Biol Evol 30:1720–1728PubMedPubMedCentralCrossRefGoogle Scholar
  76. Xia X (2017d) Self-organizing map for characterizing heterogeneous nucleotide and amino acid sequence motifs. Computation 5(4):43CrossRefGoogle Scholar
  77. Xia X, Xie Z (2001b) DAMBE: software package for data analysis in molecular biology and evolution. J Hered 92(4):371–373CrossRefPubMedGoogle Scholar
  78. Young JA, Johnson JR, Benner C, Yan SF, Chen K, Le Roch KG, Zhou Y, Winzeler EA (2008) In silico discovery of transcription regulatory elements in Plasmodium falciparum. BMC Genomics 9:70CrossRefPubMedPubMedCentralGoogle Scholar
  79. Yu KM, Liu J, Moy R, Lin HC, Nicholas HB Jr, Rosenquist GL (2002) Prediction of tyrosine sulfation in seven-transmembrane peptide receptors. Endocrine 19(3):333–338CrossRefPubMedGoogle Scholar
  80. Yuan ZC, Zaheer R, Morton R, Finan TM (2006) Genome prediction of PhoB regulated promoters in Sinorhizobium meliloti and twelve proteobacteria. Nucleic Acids Res 34(9):2686–2697CrossRefPubMedPubMedCentralGoogle Scholar
  81. Zheng CL, Fu XD, Gribskov M (2005) Characteristics and regulatory elements defining constitutive splicing and different modes of alternative splicing in human and mouse. RNA 11(12):1777–1787CrossRefPubMedPubMedCentralGoogle Scholar
  82. Zien A, Ratsch G, Mika S, Scholkopf B, Lengauer T, Muller KR (2000) Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16(9):799–807CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2018

Authors and Affiliations

  • Xuhua Xia
    • 1
  1. 1.University of Ottawa CAREG and Biology DepartmentOttawaCanada

Personalised recommendations