Skip to main content

Advertisement

Log in

AC: A Compression Tool for Amino Acid Sequences

  • Original Research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Advancement of protein sequencing technologies has led to the production of a huge volume of data that needs to be stored and transmitted. This challenge can be tackled by compression. In this paper, we propose AC, a state-of-the-art method for lossless compression of amino acid sequences. The proposed method works based on the cooperation between finite-context models and substitutional tolerant Markov models. Compared to several general-purpose and specific-purpose protein compressors, AC provides the best bit-rates. This method can also compress the sequences nine times faster than its competitor, paq8l. In addition, employing AC, we analyze the compressibility of a large number of sequences from different domains. The results show that viruses are the most difficult sequences to be compressed. Archaea and bacteria are the second most difficult ones, and eukaryota are the easiest sequences to be compressed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Cao MD, Dix TI, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression. In: Proceedings of DCC ’07: data compression conference, IEEE Computer Society Washington, DC, USA, March 27– 29, 2007, Snowbird, Utah

  2. Rafizul Haque S, Mallick T, Kabir I (2013) A new approach of protein sequence compression using repeat reduction and ASCII replacement. IOSR J Comput Eng (IOSR-JCE) 10:46–51

    Article  Google Scholar 

  3. Ward M (2014) Virtual organisms: the startling world of artificial life. Macmillan, London

    Google Scholar 

  4. Baker MS, Ahn SB, Mohamedali A, Islam MT, Cantor D, Verhaert PD, Fanayan S, Sharma S, Nice EC, Connor M et al (2017) Accelerating the search for the missing proteins in the human proteome. Nat Commun 8:14271

  5. Eckhard U, Marino G, Butler GS, Overall CM (2016) Positional proteomics in the era of the human proteome project on the doorstep of precision medicine. Biochimie 122:110–118

    Article  PubMed  CAS  Google Scholar 

  6. Legrain P, Aebersold R, Archakov A, Bairoch A, Bala K, Beretta L, Bergeron J, Borchers CH, Corthals GL, Costello CE et al (2011) The human proteome project: current state and future direction. Mol Cell Proteom 10(7):M111–009993

  7. Paik YK, Jeong SK, Omenn GS, Uhlen M, Hanash S, Cho SY, Lee HJ, Na K, Choi EY, Yan F (2012) The chromosome-centric human proteome project for cataloging proteins encoded in the genome. Nat Biotechnol 30(3):221

    Article  PubMed  CAS  Google Scholar 

  8. Comm IUPAC-IUB (1968) A one-letter notation for amino acid sequences. Tentative rules. Biochemistry 7(8):2703–2705

    Article  Google Scholar 

  9. Consortium U (2016) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45(D1):D158–D169

    Google Scholar 

  10. Pratas D, Hosseini M, Pinho AJ (2018) Compression of amino acid sequences. In: Fdez-Riverola F, Mohamad M, Rocha M, De Paz J, Pinto T (eds) 11th International Conference on Practical Applications of Computational Biology & Bioinformatics. PACBB 2017. Advances in Intelligent Systems and Computing, vol 616. Springer, Cham

  11. Benedetto D, Caglioti E, Chica C (2007) Compressing proteomes: the relevance of medium range correlations. Eur J Bioinform Syst Biol 2007:60723

    Google Scholar 

  12. Nalbantoglu ÖU, Russell DJ, Sayood K (2009) Data compression concepts and algorithms and their applications to bioinformatics. Entropy 12(1):34–52

    Article  CAS  Google Scholar 

  13. Wootton J (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285

    Article  PubMed  CAS  Google Scholar 

  14. Yu J, Cao Z, Yang Y, Wang C, Su Z, Zhao Y, Wang J, Zhou Y (2016) Natural protein sequences are more intrinsically disordered than random sequences. Cell Mol Life Sci 73:2949–2957

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. Nevill-Manning CG, Witten IH (1999) Protein is incompressible. In: Proceedings of DCC ’99: Data Compression Conference. IEEE Computer Society Washington, DC, USA, March 29–31, Snowbird, Utah, USA

  16. Adjeroh D, Nan F (2006) On compressibility of protein sequences. In: Proceedings of DCC ’06: data compression conference,. IEEE Computer Society Washington, DC, March 28–30, Snowbird, Utah, USA

  17. Deorowicz S, Walczyszyn J, Debudaj-Grabysz A, Hancock J (2018) Comsa: compression of protein multiple sequence alignment files. Bioinformatics 35:227–234

    Article  Google Scholar 

  18. Hategan A, Tabus I (2004) Protein is compressible. In: Signal Processing Symposium. NORSIG 2004. In: Proceedings of the 6th Nordic, 11 June 2004, IEEE, Espoo, Finland, Finland, pp 192–195

  19. Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome Inf 11:43–52

    CAS  Google Scholar 

  20. Willems F, Shtarkov Y, Tjalkens T (1995) The context tree weighting method: basic properties. IEEE Trans Inf Theory 41:653–664

    Article  Google Scholar 

  21. Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical Report, Digital Equipment Corporation, Palo Alto, CA

  22. Adjeroh D, Feng J (2003) The SCP and compressed domain analysis of biological sequences. In: Computational Systems Bioinformatics Conference, International IEEE Computer Society (2003) Stanford, California, Aug 11–14 2003

  23. Hosseini M, Pratas D, Pinho AJ (2016) A survey on data compression methods for biological sequences. Information 7(4):56

    Article  Google Scholar 

  24. Korber B, Farber RM, Wolpert DH, Lapedes AS (1993) Covariation of mutations in the v3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci 90(15):7176–7180

    Article  PubMed  CAS  Google Scholar 

  25. Pereira F, Duarte-Pereira S, Silva RM, Da Costa LT, Pereira-Castro I (2016) Evolution of the NET (NocA, Nlz, Elbow, TLP-1) protein family in metazoans: insights from expression data and phylogenetic analysis. Sci Rep 6:38,383

    Article  CAS  Google Scholar 

  26. Hayashida M, Ruan P, Akutsu T (2014) Proteome compression via protein domain compositions. Methods 67(3):380–385

    Article  PubMed  CAS  Google Scholar 

  27. Pelta DA, Gonzalez JR, Krasnogor N (2005) Protein structure comparison through fuzzy contact maps and the universal similarity metric. In: EUSFLAT Conf., pp 1124–1129

  28. Rocha J, Rosselló F, Segura J (2006) Compression ratios based on the universal similarity metric still yield protein distances far from CATH distances. arXiv:q-bio/0603007

  29. Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Inf Transm 1(1):1–7

    Google Scholar 

  30. Soler-Toscano F, Zenil H (2017) A computable measure of algorithmic probability by finite approximations with an application to integer sequences. Complexity 2017:7208216

    Article  Google Scholar 

  31. Zenil H, Hernández-Orozco S, Kiani N, Soler-Toscano F, Rueda-Toicen A, Tegnér J (2018) A decomposition method for global evaluation of Shannon entropy and local estimations of algorithmic complexity. Entropy 20(8):605

    Article  Google Scholar 

  32. Zenil H, Kiani NA, Shang MM, Tegnér J (2018) Algorithmic complexity and reprogrammability of chemical structure networks. Parallel Process Lett 28(1):1850,005

    Article  Google Scholar 

  33. Pinho AJ, Ferreira PJ, Neves AJ, Bastos CA (2011) On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 6(6):e21,588

    Article  CAS  Google Scholar 

  34. Pinho AJ, Pratas D (2013) MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics 30(1):117–118

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  35. Pratas D, Hosseini M, Pinho AJ (2017) Substitutional tolerant Markov models for relative compression of DNA sequences. In: International conference on practical applications of computational biology & bioinformatics (PACBB). Springer, pp 265–272

  36. Pratas D, Pinho AJ, Ferreira PJ (2016) Efficient compression of genomic sequences. In: Proceedings of DCC ’16: data compression conference. IEEE Computer Society Washington, DC, USA, March 30, April 1, Snowbird, Utah,

  37. Sayood K (2017) Introduction to data compression. Morgan Kaufmann, Burlington

    Google Scholar 

  38. Pratas D, Silva RM, Pinho AJ, Ferreira PJ (2015) An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci Rep 5:10,203

    Article  CAS  Google Scholar 

  39. Bywater RP (2015) Prediction of protein structural features from sequence data based on Shannon entropy and Kolmogorov complexity. PLoS One 10(4):e0119306

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  40. Hosseini M, Pratas D, Pinho AJ (2018) Cryfa: a secure encryption tool for genomic data. Bioinformatics 35:146–148

    Article  PubMed Central  Google Scholar 

  41. Pratas D, Pinho AJ (2017) On the approximation of the Kolmogorov complexity for DNA sequences. In: Iberian conference on pattern recognition and image analysis (IbPRIA), pp 259–266. Springer

Download references

Acknowledgements

This work was supported by Programa Operacional Factores de Competitividade—COMPETE (FEDER), and by national funds through the Foundation for Science and Technology (FCT), in the context of the projects [UID/CEC/00127/2013, PTCD/EEI-SII/6608/2014] and the grant [PD/BD/113969/2015].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Morteza Hosseini.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hosseini, M., Pratas, D. & Pinho, A.J. AC: A Compression Tool for Amino Acid Sequences. Interdiscip Sci Comput Life Sci 11, 68–76 (2019). https://doi.org/10.1007/s12539-019-00322-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-019-00322-1

Keywords

Navigation