Abstract
Advancement of protein sequencing technologies has led to the production of a huge volume of data that needs to be stored and transmitted. This challenge can be tackled by compression. In this paper, we propose AC, a state-of-the-art method for lossless compression of amino acid sequences. The proposed method works based on the cooperation between finite-context models and substitutional tolerant Markov models. Compared to several general-purpose and specific-purpose protein compressors, AC provides the best bit-rates. This method can also compress the sequences nine times faster than its competitor, paq8l. In addition, employing AC, we analyze the compressibility of a large number of sequences from different domains. The results show that viruses are the most difficult sequences to be compressed. Archaea and bacteria are the second most difficult ones, and eukaryota are the easiest sequences to be compressed.
Similar content being viewed by others
References
Cao MD, Dix TI, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression. In: Proceedings of DCC ’07: data compression conference, IEEE Computer Society Washington, DC, USA, March 27– 29, 2007, Snowbird, Utah
Rafizul Haque S, Mallick T, Kabir I (2013) A new approach of protein sequence compression using repeat reduction and ASCII replacement. IOSR J Comput Eng (IOSR-JCE) 10:46–51
Ward M (2014) Virtual organisms: the startling world of artificial life. Macmillan, London
Baker MS, Ahn SB, Mohamedali A, Islam MT, Cantor D, Verhaert PD, Fanayan S, Sharma S, Nice EC, Connor M et al (2017) Accelerating the search for the missing proteins in the human proteome. Nat Commun 8:14271
Eckhard U, Marino G, Butler GS, Overall CM (2016) Positional proteomics in the era of the human proteome project on the doorstep of precision medicine. Biochimie 122:110–118
Legrain P, Aebersold R, Archakov A, Bairoch A, Bala K, Beretta L, Bergeron J, Borchers CH, Corthals GL, Costello CE et al (2011) The human proteome project: current state and future direction. Mol Cell Proteom 10(7):M111–009993
Paik YK, Jeong SK, Omenn GS, Uhlen M, Hanash S, Cho SY, Lee HJ, Na K, Choi EY, Yan F (2012) The chromosome-centric human proteome project for cataloging proteins encoded in the genome. Nat Biotechnol 30(3):221
Comm IUPAC-IUB (1968) A one-letter notation for amino acid sequences. Tentative rules. Biochemistry 7(8):2703–2705
Consortium U (2016) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45(D1):D158–D169
Pratas D, Hosseini M, Pinho AJ (2018) Compression of amino acid sequences. In: Fdez-Riverola F, Mohamad M, Rocha M, De Paz J, Pinto T (eds) 11th International Conference on Practical Applications of Computational Biology & Bioinformatics. PACBB 2017. Advances in Intelligent Systems and Computing, vol 616. Springer, Cham
Benedetto D, Caglioti E, Chica C (2007) Compressing proteomes: the relevance of medium range correlations. Eur J Bioinform Syst Biol 2007:60723
Nalbantoglu ÖU, Russell DJ, Sayood K (2009) Data compression concepts and algorithms and their applications to bioinformatics. Entropy 12(1):34–52
Wootton J (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285
Yu J, Cao Z, Yang Y, Wang C, Su Z, Zhao Y, Wang J, Zhou Y (2016) Natural protein sequences are more intrinsically disordered than random sequences. Cell Mol Life Sci 73:2949–2957
Nevill-Manning CG, Witten IH (1999) Protein is incompressible. In: Proceedings of DCC ’99: Data Compression Conference. IEEE Computer Society Washington, DC, USA, March 29–31, Snowbird, Utah, USA
Adjeroh D, Nan F (2006) On compressibility of protein sequences. In: Proceedings of DCC ’06: data compression conference,. IEEE Computer Society Washington, DC, March 28–30, Snowbird, Utah, USA
Deorowicz S, Walczyszyn J, Debudaj-Grabysz A, Hancock J (2018) Comsa: compression of protein multiple sequence alignment files. Bioinformatics 35:227–234
Hategan A, Tabus I (2004) Protein is compressible. In: Signal Processing Symposium. NORSIG 2004. In: Proceedings of the 6th Nordic, 11 June 2004, IEEE, Espoo, Finland, Finland, pp 192–195
Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome Inf 11:43–52
Willems F, Shtarkov Y, Tjalkens T (1995) The context tree weighting method: basic properties. IEEE Trans Inf Theory 41:653–664
Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical Report, Digital Equipment Corporation, Palo Alto, CA
Adjeroh D, Feng J (2003) The SCP and compressed domain analysis of biological sequences. In: Computational Systems Bioinformatics Conference, International IEEE Computer Society (2003) Stanford, California, Aug 11–14 2003
Hosseini M, Pratas D, Pinho AJ (2016) A survey on data compression methods for biological sequences. Information 7(4):56
Korber B, Farber RM, Wolpert DH, Lapedes AS (1993) Covariation of mutations in the v3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci 90(15):7176–7180
Pereira F, Duarte-Pereira S, Silva RM, Da Costa LT, Pereira-Castro I (2016) Evolution of the NET (NocA, Nlz, Elbow, TLP-1) protein family in metazoans: insights from expression data and phylogenetic analysis. Sci Rep 6:38,383
Hayashida M, Ruan P, Akutsu T (2014) Proteome compression via protein domain compositions. Methods 67(3):380–385
Pelta DA, Gonzalez JR, Krasnogor N (2005) Protein structure comparison through fuzzy contact maps and the universal similarity metric. In: EUSFLAT Conf., pp 1124–1129
Rocha J, Rosselló F, Segura J (2006) Compression ratios based on the universal similarity metric still yield protein distances far from CATH distances. arXiv:q-bio/0603007
Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Inf Transm 1(1):1–7
Soler-Toscano F, Zenil H (2017) A computable measure of algorithmic probability by finite approximations with an application to integer sequences. Complexity 2017:7208216
Zenil H, Hernández-Orozco S, Kiani N, Soler-Toscano F, Rueda-Toicen A, Tegnér J (2018) A decomposition method for global evaluation of Shannon entropy and local estimations of algorithmic complexity. Entropy 20(8):605
Zenil H, Kiani NA, Shang MM, Tegnér J (2018) Algorithmic complexity and reprogrammability of chemical structure networks. Parallel Process Lett 28(1):1850,005
Pinho AJ, Ferreira PJ, Neves AJ, Bastos CA (2011) On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 6(6):e21,588
Pinho AJ, Pratas D (2013) MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics 30(1):117–118
Pratas D, Hosseini M, Pinho AJ (2017) Substitutional tolerant Markov models for relative compression of DNA sequences. In: International conference on practical applications of computational biology & bioinformatics (PACBB). Springer, pp 265–272
Pratas D, Pinho AJ, Ferreira PJ (2016) Efficient compression of genomic sequences. In: Proceedings of DCC ’16: data compression conference. IEEE Computer Society Washington, DC, USA, March 30, April 1, Snowbird, Utah,
Sayood K (2017) Introduction to data compression. Morgan Kaufmann, Burlington
Pratas D, Silva RM, Pinho AJ, Ferreira PJ (2015) An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci Rep 5:10,203
Bywater RP (2015) Prediction of protein structural features from sequence data based on Shannon entropy and Kolmogorov complexity. PLoS One 10(4):e0119306
Hosseini M, Pratas D, Pinho AJ (2018) Cryfa: a secure encryption tool for genomic data. Bioinformatics 35:146–148
Pratas D, Pinho AJ (2017) On the approximation of the Kolmogorov complexity for DNA sequences. In: Iberian conference on pattern recognition and image analysis (IbPRIA), pp 259–266. Springer
Acknowledgements
This work was supported by Programa Operacional Factores de Competitividade—COMPETE (FEDER), and by national funds through the Foundation for Science and Technology (FCT), in the context of the projects [UID/CEC/00127/2013, PTCD/EEI-SII/6608/2014] and the grant [PD/BD/113969/2015].
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hosseini, M., Pratas, D. & Pinho, A.J. AC: A Compression Tool for Amino Acid Sequences. Interdiscip Sci Comput Life Sci 11, 68–76 (2019). https://doi.org/10.1007/s12539-019-00322-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-019-00322-1