AC: A Compression Tool for Amino Acid Sequences

  • Morteza HosseiniEmail author
  • Diogo Pratas
  • Armando J. Pinho
Original Research article


Advancement of protein sequencing technologies has led to the production of a huge volume of data that needs to be stored and transmitted. This challenge can be tackled by compression. In this paper, we propose AC, a state-of-the-art method for lossless compression of amino acid sequences. The proposed method works based on the cooperation between finite-context models and substitutional tolerant Markov models. Compared to several general-purpose and specific-purpose protein compressors, AC provides the best bit-rates. This method can also compress the sequences nine times faster than its competitor, paq8l. In addition, employing AC, we analyze the compressibility of a large number of sequences from different domains. The results show that viruses are the most difficult sequences to be compressed. Archaea and bacteria are the second most difficult ones, and eukaryota are the easiest sequences to be compressed.


Protein Compression Substitutional tolerant Markov model Finite-context model Kolmogorov complexity 



This work was supported by Programa Operacional Factores de Competitividade—COMPETE (FEDER), and by national funds through the Foundation for Science and Technology (FCT), in the context of the projects [UID/CEC/00127/2013, PTCD/EEI-SII/6608/2014] and the grant [PD/BD/113969/2015].


  1. 1.
    Cao MD, Dix TI, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression. In: Proceedings of DCC ’07: data compression conference, IEEE Computer Society Washington, DC, USA, March 27– 29, 2007, Snowbird, UtahGoogle Scholar
  2. 2.
    Rafizul Haque S, Mallick T, Kabir I (2013) A new approach of protein sequence compression using repeat reduction and ASCII replacement. IOSR J Comput Eng (IOSR-JCE) 10:46–51CrossRefGoogle Scholar
  3. 3.
    Ward M (2014) Virtual organisms: the startling world of artificial life. Macmillan, LondonGoogle Scholar
  4. 4.
    Baker MS, Ahn SB, Mohamedali A, Islam MT, Cantor D, Verhaert PD, Fanayan S, Sharma S, Nice EC, Connor M et al (2017) Accelerating the search for the missing proteins in the human proteome. Nat Commun 8:14271Google Scholar
  5. 5.
    Eckhard U, Marino G, Butler GS, Overall CM (2016) Positional proteomics in the era of the human proteome project on the doorstep of precision medicine. Biochimie 122:110–118CrossRefGoogle Scholar
  6. 6.
    Legrain P, Aebersold R, Archakov A, Bairoch A, Bala K, Beretta L, Bergeron J, Borchers CH, Corthals GL, Costello CE et al (2011) The human proteome project: current state and future direction. Mol Cell Proteom 10(7):M111–009993Google Scholar
  7. 7.
    Paik YK, Jeong SK, Omenn GS, Uhlen M, Hanash S, Cho SY, Lee HJ, Na K, Choi EY, Yan F (2012) The chromosome-centric human proteome project for cataloging proteins encoded in the genome. Nat Biotechnol 30(3):221CrossRefGoogle Scholar
  8. 8.
    Comm IUPAC-IUB (1968) A one-letter notation for amino acid sequences. Tentative rules. Biochemistry 7(8):2703–2705CrossRefGoogle Scholar
  9. 9.
    Consortium U (2016) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45(D1):D158–D169Google Scholar
  10. 10.
    Pratas D, Hosseini M, Pinho AJ (2018) Compression of amino acid sequences. In: Fdez-Riverola F, Mohamad M, Rocha M, De Paz J, Pinto T (eds) 11th International Conference on Practical Applications of Computational Biology & Bioinformatics. PACBB 2017. Advances in Intelligent Systems and Computing, vol 616. Springer, ChamGoogle Scholar
  11. 11.
    Benedetto D, Caglioti E, Chica C (2007) Compressing proteomes: the relevance of medium range correlations. Eur J Bioinform Syst Biol 2007:60723Google Scholar
  12. 12.
    Nalbantoglu ÖU, Russell DJ, Sayood K (2009) Data compression concepts and algorithms and their applications to bioinformatics. Entropy 12(1):34–52CrossRefGoogle Scholar
  13. 13.
    Wootton J (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285CrossRefGoogle Scholar
  14. 14.
    Yu J, Cao Z, Yang Y, Wang C, Su Z, Zhao Y, Wang J, Zhou Y (2016) Natural protein sequences are more intrinsically disordered than random sequences. Cell Mol Life Sci 73:2949–2957CrossRefGoogle Scholar
  15. 15.
    Nevill-Manning CG, Witten IH (1999) Protein is incompressible. In: Proceedings of DCC ’99: Data Compression Conference. IEEE Computer Society Washington, DC, USA, March 29–31, Snowbird, Utah, USAGoogle Scholar
  16. 16.
    Adjeroh D, Nan F (2006) On compressibility of protein sequences. In: Proceedings of DCC ’06: data compression conference,. IEEE Computer Society Washington, DC, March 28–30, Snowbird, Utah, USAGoogle Scholar
  17. 17.
    Deorowicz S, Walczyszyn J, Debudaj-Grabysz A, Hancock J (2018) Comsa: compression of protein multiple sequence alignment files. Bioinformatics 35:227–234CrossRefGoogle Scholar
  18. 18.
    Hategan A, Tabus I (2004) Protein is compressible. In: Signal Processing Symposium. NORSIG 2004. In: Proceedings of the 6th Nordic, 11 June 2004, IEEE, Espoo, Finland, Finland, pp 192–195Google Scholar
  19. 19.
    Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome Inf 11:43–52Google Scholar
  20. 20.
    Willems F, Shtarkov Y, Tjalkens T (1995) The context tree weighting method: basic properties. IEEE Trans Inf Theory 41:653–664CrossRefGoogle Scholar
  21. 21.
    Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical Report, Digital Equipment Corporation, Palo Alto, CAGoogle Scholar
  22. 22.
    Adjeroh D, Feng J (2003) The SCP and compressed domain analysis of biological sequences. In: Computational Systems Bioinformatics Conference, International IEEE Computer Society (2003) Stanford, California, Aug 11–14 2003Google Scholar
  23. 23.
    Hosseini M, Pratas D, Pinho AJ (2016) A survey on data compression methods for biological sequences. Information 7(4):56CrossRefGoogle Scholar
  24. 24.
    Korber B, Farber RM, Wolpert DH, Lapedes AS (1993) Covariation of mutations in the v3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci 90(15):7176–7180CrossRefGoogle Scholar
  25. 25.
    Pereira F, Duarte-Pereira S, Silva RM, Da Costa LT, Pereira-Castro I (2016) Evolution of the NET (NocA, Nlz, Elbow, TLP-1) protein family in metazoans: insights from expression data and phylogenetic analysis. Sci Rep 6:38,383CrossRefGoogle Scholar
  26. 26.
    Hayashida M, Ruan P, Akutsu T (2014) Proteome compression via protein domain compositions. Methods 67(3):380–385CrossRefGoogle Scholar
  27. 27.
    Pelta DA, Gonzalez JR, Krasnogor N (2005) Protein structure comparison through fuzzy contact maps and the universal similarity metric. In: EUSFLAT Conf., pp 1124–1129Google Scholar
  28. 28.
    Rocha J, Rosselló F, Segura J (2006) Compression ratios based on the universal similarity metric still yield protein distances far from CATH distances. arXiv:q-bio/0603007
  29. 29.
    Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Inf Transm 1(1):1–7Google Scholar
  30. 30.
    Soler-Toscano F, Zenil H (2017) A computable measure of algorithmic probability by finite approximations with an application to integer sequences. Complexity 2017:7208216CrossRefGoogle Scholar
  31. 31.
    Zenil H, Hernández-Orozco S, Kiani N, Soler-Toscano F, Rueda-Toicen A, Tegnér J (2018) A decomposition method for global evaluation of Shannon entropy and local estimations of algorithmic complexity. Entropy 20(8):605CrossRefGoogle Scholar
  32. 32.
    Zenil H, Kiani NA, Shang MM, Tegnér J (2018) Algorithmic complexity and reprogrammability of chemical structure networks. Parallel Process Lett 28(1):1850,005CrossRefGoogle Scholar
  33. 33.
    Pinho AJ, Ferreira PJ, Neves AJ, Bastos CA (2011) On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 6(6):e21,588CrossRefGoogle Scholar
  34. 34.
    Pinho AJ, Pratas D (2013) MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics 30(1):117–118CrossRefGoogle Scholar
  35. 35.
    Pratas D, Hosseini M, Pinho AJ (2017) Substitutional tolerant Markov models for relative compression of DNA sequences. In: International conference on practical applications of computational biology & bioinformatics (PACBB). Springer, pp 265–272Google Scholar
  36. 36.
    Pratas D, Pinho AJ, Ferreira PJ (2016) Efficient compression of genomic sequences. In: Proceedings of DCC ’16: data compression conference. IEEE Computer Society Washington, DC, USA, March 30, April 1, Snowbird, Utah,Google Scholar
  37. 37.
    Sayood K (2017) Introduction to data compression. Morgan Kaufmann, BurlingtonGoogle Scholar
  38. 38.
    Pratas D, Silva RM, Pinho AJ, Ferreira PJ (2015) An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci Rep 5:10,203CrossRefGoogle Scholar
  39. 39.
    Bywater RP (2015) Prediction of protein structural features from sequence data based on Shannon entropy and Kolmogorov complexity. PLoS One 10(4):e0119306CrossRefGoogle Scholar
  40. 40.
    Hosseini M, Pratas D, Pinho AJ (2018) Cryfa: a secure encryption tool for genomic data. Bioinformatics 35:146–148CrossRefGoogle Scholar
  41. 41.
    Pratas D, Pinho AJ (2017) On the approximation of the Kolmogorov complexity for DNA sequences. In: Iberian conference on pattern recognition and image analysis (IbPRIA), pp 259–266. SpringerGoogle Scholar

Copyright information

© International Association of Scientists in the Interdisciplinary Areas 2019

Authors and Affiliations

  1. 1.IEETA/DETIUniversity of AveiroAveiroPortugal

Personalised recommendations