Low-dimensional representation of genomic sequences

Tillquist, Richard C.; Lladser, Manuel E.

doi:10.1007/s00285-019-01348-1

Low-dimensional representation of genomic sequences

Published: 30 March 2019

Volume 79, pages 1–29, (2019)
Cite this article

Journal of Mathematical Biology Aims and scope Submit manuscript

968 Accesses
18 Citations
Explore all metrics

Abstract

Numerous data analysis and data mining techniques require that data be embedded in a Euclidean space. When faced with symbolic datasets, particularly biological sequence data produced by high-throughput sequencing assays, conventional embedding approaches like binary and k-mer count vectors may be too high dimensional or coarse-grained to learn from the data effectively. Other representation techniques such as Multidimensional Scaling (MDS) and Node2Vec may be inadequate for large datasets as they require recomputing the full embedding from scratch when faced with new, unclassified data. To overcome these issues we amend the graph-theoretic notion of “metric dimension” to that of “multilateration.” Much like trilateration can be used to represent points in the Euclidean plane by their distances to three non-colinear points, multilateration allows us to represent any node in a graph by its distances to a subset of nodes. Unfortunately, the problem of determining a minimal subset and hence the lowest dimensional embedding is NP-complete for general graphs. However, by specializing to Hamming graphs, which are particularly well suited to representing biological sequences, we can readily generate low-dimensional embeddings to map sequences of arbitrary length to a real space. As proof-of-concept, we use MDS, Node2Vec, and multilateration-based embeddings to classify DNA 20-mers centered at intron–exon boundaries. Although these different techniques perform comparably, MDS and Node2Vec potentially suffer from scalability issues with increasing sequence length whereas multilateration provides an efficient means of mapping long genomic sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Introduction to Bioinformatics

A large-scale evaluation of algorithms to calculate average nucleotide identity

Article 15 February 2017

BUSCO: Assessing Genome Assembly and Annotation Completeness

References

Aguirre S, Maestre AM, Pagni S, Patel JR, Savage T, Gutman D, Maringer K, Bernal-Rubio D, Shabman RS, Simon V, Rodriguez-Madoz JR, Mulder LC, Barber GN, Fernandez-Sesma A (2012) DENV inhibits type I IFN production in infected cells by cleaving human STING. PLoS Pathog 8(10):e1002–934
Google Scholar
Asgari E, Mofrad MR (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS One 10(11):e0141–287
Google Scholar
Baum LE, Petrie T (1966) Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat 37(6):1554–1563
MathSciNet MATH Google Scholar
Bennett J, Lanning S et al (2007) The Netflix prize. In: Proceedings of KDD cup and workshop, New York, vol 2007, p 35
Berman P, DasGupta B, Kao MY (2005) Tight approximability results for test set problems in bioinformatics. J Comput Syst Sci 71(2):145–162
MathSciNet MATH Google Scholar
Blumenthal LM (1953) Theory and applications of distance geometry. Clarendon Press, Oxford
MATH Google Scholar
Bock JR, Gough DA (2001) Predicting protein–protein interactions from primary structure. Bioinformatics 17(5):455–460
Google Scholar
Breathnach R, Benoist C, O’hare K, Gannon F, Chambon P (1978) Ovalbumin gene: evidence for a leader sequence in mRNA and DNA sequences at the exon–intron boundaries. Proc Natl Acad Sci 75(10):4853–4857
Google Scholar
Cáceres J, Hernando C, Mora M, Pelayo IM, Puertas ML, Seara C, Wood DR (2007) On the metric dimension of cartesian products of graphs. SIAM J Discrete Math 21(2):423–441
MathSciNet MATH Google Scholar
Cai C, Han L, Ji ZL, Chen X, Chen YZ (2003a) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31(13):3692–3697
Google Scholar
Cai YD, Feng KY, Li YX, Chou KC (2003b) Support vector machine for predicting \(\alpha \)-turn types. Peptides 24(4):629–630
Google Scholar
Chaouche FA, Berrachedi A (2006) Automorphisms group of generalized Hamming graphs. Electron Notes Discrete Math 24:9–15
MathSciNet MATH Google Scholar
Chartrand G, Eroh L, Johnson MA, Oellermann OR (2000) Resolvability in graphs and the metric dimension of a graph. Discrete Appl Math 105(1):99–113
MathSciNet MATH Google Scholar
Chvátal V (1983) Mastermind. Combinatorica 3(3–4):325–329
MathSciNet MATH Google Scholar
Cook SA (1971) The complexity of theorem-proving procedures. In: Proceedings of the third annual ACM symposium on theory of computing. ACM, pp 151–158
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 233–240
Fix E, Hodges JL Jr (1951) Discriminatory analysis-nonparametric discrimination: consistency properties. Tech. rep, DTIC Document
Gary MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. WH Freeman and Company, New York
Google Scholar
Gehrke J, Ginsparg P, Kleinberg J (2003) Overview of the 2003 KDD cup. ACM SIGKDD Explor Newsl 5(2):149–151
Google Scholar
Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 855–864
Hamming RW (1950) Error detecting and error correcting codes. Bell Labs Techn J 29(2):147–160
MathSciNet MATH Google Scholar
Harary F, Melter R (1976) On the metric dimension of a graph. Ars Comb 2:191–195
MATH Google Scholar
Harrison MA (1963) The number of transitivity sets of Boolean functions. J Soc Ind Appl Math 11(3):806–828
MathSciNet MATH Google Scholar
Hauptmann M, Schmied R, Viehmann C (2012) Approximation complexity of metric dimension problem. J Discrete Algorithms 14:214–222
MathSciNet MATH Google Scholar
Hayes WS, Borodovsky M (1998) How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Res 8(11):1154–1171
Google Scholar
Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinf 9(1):217
Google Scholar
Jaakkola TS, Diekhans M, Haussler D (1999) Using the Fisher kernel method to detect remote protein homologies. ISMB 99:149–158
Google Scholar
Karp RM (1972) Reducibility among combinatorial problems. In: Complexity of computer computations. Springer, pp 85–103
Khuller S, Raghavachari B, Rosenfeld A (1996) Landmarks in graphs. Discrete Appl Math 70(3):217–229
MathSciNet MATH Google Scholar
Kratica J, Kovačević-Vujčić V, Čangalović M (2009) Computing the metric dimension of graphs by genetic algorithms. Comput Optim Appl 44(2):343–361
MathSciNet MATH Google Scholar
Krzanowski WJ (2000) Principles of multivariate analysis: a user’s perspective. OUP, Oxford
MATH Google Scholar
Leslie CS, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput 7:566–575
Google Scholar
Li J, Lim SP, Beer D, Patel V, Wen D, Tumanut C, Tully DC, Williams JA, Jiricek J, Priestle JP, Harris JL, Vasudevan SG (2005) Functional profiling of recombinant NS3 proteases from all four serotypes of dengue virus using tetrapeptide and octapeptide substrate libraries. J Biol Chem 280(31):28,766–28,774
Google Scholar
Lozupone C, Knight R (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12):8228–8235
Google Scholar
Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R (2011) UniFrac: an effective distance metric for microbial community comparison. ISME J 5(2):169–172
Google Scholar
Maaten Lvd, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9((Nov)):2579–2605
MATH Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. ArXiv e-prints 1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mladenović N, Kratica J, Kovačević-Vujčić V, Čangalović M (2012) Variable neighborhood search for metric dimension and minimal doubly resolving set problems. Eur J Oper Res 220(2):328–337
MathSciNet MATH Google Scholar
Ng P (2017) dna2vec: consistent vector representations of variable-length k-mers. ArXiv e-prints 1701.06279
Opsahl T (2011) Why Anchorage is not (that) important: binary ties and sample selection. http://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-tiesand-sample-selection. Accessed September 2013
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 701–710
Reese MG, Kulp D, Tammana H, Haussler D (2000) Genie—gene finding in Drosophila melanogaster. Genome Res 10(4):529–538
Google Scholar
Sarda D, Chua GH, Li KB, Krishnan A (2005) pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinform 6(1):152
Google Scholar
Sciabola S, Cao Q, Orozco M, Faustino I, Stanton RV (2012) Improved nucleic acid descriptors for siRNA efficacy prediction. Nucleic Acids Res 41(3):1383–1394
Google Scholar
Slater PJ (1975) Leaves of trees. Congressus Numerantium 14(549–559):37
MATH Google Scholar
Stanke M, Steinkamp R, Waack S, Morgenstern B (2004) AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 32(suppl 2):W309–W312
Google Scholar
Yang KK, Wu Z, Bedbrook CN, Arnold FH (2018) Learned protein embeddings for machine learning. Bioinformatics 34(15):2642–2648
Google Scholar
Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L et al (2016) Ensembl 2016. Nucleic Acids Res 44(D1):D710–D716
Google Scholar
Yu CY, Chang TH, Liang JJ, Chiang RL, Lee YL, Liao CL, Lin YL (2012) Dengue virus targets the adaptor protein MITA to subvert host innate immunity. PLoS Pathog 8(6):e1002–780
Google Scholar

Download references

Acknowledgements

The authors thank the reviewers for their very insightful comments on the original version of this manuscript. This research was partially funded by the NSF IGERT Grant 1144807, and NSF IIS Grant 1836914. The authors acknowledge the BioFrontiers Computing Core at the University of Colorado–Boulder for providing High-Performance Computing resources (funded by National Institutes of Health 1S10OD012300), supported by BioFrontiers IT group.

Author information

Authors and Affiliations

Department of Computer Science, The University of Colorado, Boulder, CO, 80309-0526, USA
Richard C. Tillquist
Department of Applied Mathematics, The University of Colorado, Boulder, CO, 80309-0526, USA
Manuel E. Lladser

Authors

Richard C. Tillquist
View author publications
You can also search for this author in PubMed Google Scholar
Manuel E. Lladser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel E. Lladser.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tillquist, R.C., Lladser, M.E. Low-dimensional representation of genomic sequences. J. Math. Biol. 79, 1–29 (2019). https://doi.org/10.1007/s00285-019-01348-1

Download citation

Received: 30 April 2018
Revised: 12 November 2018
Published: 30 March 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s00285-019-01348-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Low-dimensional representation of genomic sequences

Abstract

Access this article

Similar content being viewed by others

Introduction to Bioinformatics

A large-scale evaluation of algorithms to calculate average nucleotide identity

BUSCO: Assessing Genome Assembly and Annotation Completeness

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 101 KB)

Supplementary material 2 (tsv 6 KB)

Supplementary material 3 (tsv 3 KB)

Supplementary material 4 (tsv 2 KB)

Supplementary material 5 (tsv 321306 KB)

Supplementary material 6 (tsv 8001 KB)

Supplementary material 7 (tsv 6864 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Low-dimensional representation of genomic sequences

Abstract

Access this article

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation