Abstract
A suitable feature representation that can both preserve the data intrinsic information and reduce data complexity and dimensionality is key to the performance of machine learning models. Deeply rooted in algebraic topology, persistent homology (PH) provides a delicate balance between data simplification and intrinsic structure characterization, and has been applied to various areas successfully. However, the combination of PH and machine learning has been hindered greatly by three challenges, namely topological representation of data, PH-based distance measurements or metrics, and PH-based feature representation. With the development of topological data analysis, progresses have been made on all these three problems, but widely scattered in different literatures. In this paper, we provide a systematical review of PH and PH-based supervised and unsupervised models from a computational perspective. Our emphasizes are the recent development of mathematical models and tools, including PH software and PH-based functions, feature representations, kernels, and similarity models. Essentially, this paper can work as a roadmap for the practical application of PH-based machine learning tools. Further, we compare between two types of simplicial complexes (alpha and Vietrois-Rips complexes), two types of feature extractions (barcode statistics and binned features), and three types of machine learning models (support vector machines, tree-based models, and neural networks), and investigate their impacts on the protein secondary structure classification.
Similar content being viewed by others
Data availability
The data and codes can be downloaded from https://entuedu-my.sharepoint.com/:f:/g/personal/xiakelin_staff_main_ntu_edu_sg/EvZ-CivdgCdCu90JpIAR3BYBmJwl--DxteRirSvLnAhFHA?e=kBPAP3.
References
Adams H, Emerson T, Kirby M, Neville R, Peterson C, Shipman P, Chepushtanova S, Hanson E, Motta F, Ziegelmeier L (2017) Persistence images: a stable vector representation of persistent homology. J Mach Learn Res 18:218–252
Adcock A, Carlsson E, Carlsson G (2016) The ring of algebraic functions on persistence bar codes. Homol, Homotopy Appli 18:381–402
Ahmed M, Fasy BT, Wenk C (2014) Local persistent homology based distance between maps. In Proceedings of the 22nd ACM SIGSPATIAL international conference on advances in geographic information systems, ACM, pp. 43–52
Alfaro E, Gámez M, García N (2013) adabag: An r package for classification with boosting and bagging. J Statis Softw 54:1–35. https://doi.org/10.18637/jss.v054.i02
Anirudh R, Thiagarajan JJ, Kim I, Polonik W (2016) Autism spectrum disorder classification using graph kernels on multidimensional time series, arXiv preprint arXiv:1611.09897,
Anirudh R, Venkataraman V, Ramamurthy KN, Turaga P (2016) A Riemannian framework for statistical analysis of topological persistence diagrams. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 68–76
Bae W, Yoo JJ, Ye JC (2017) Beyond deep residual learning for image restoration: persistent homology-guided manifold simplification. In CVPR workshops, pp. 1141–1149
Bauer U (2017) Ripser: a lean C++ code for the computation of Vietoris-Rips persistence barcodes, Software available at https://github.com/Ripser/ripser
Bauer U, Kerber M, Reininghaus J (2014) Distributed computation of persistent homology, In: Proceedings of the 16th workshop on algorithm engineering and experiments (ALENEX)
Bauer U, Kerber M, Reininghaus J (2014) Distributed computation of persistent homology. In 2014 proceedings of the 16th workshop on algorithm engineering and experiments (ALENEX), SIAM, pp. 31–38
Bauer U, Kerber M, Reininghaus J, Wagner H (2014) PHAT–persistent homology algorithms toolbox. In International congress on mathematical software, Springer, pp. 137–143
Bendich P, Cohen-Steiner D, Edelsbrunner H, Harer J, Morozov D (2007) Inferring local homology from sampled stratified spaces. In foundations of computer science, 2007. FOCS’07. 48th Annual IEEE symposium on, IEEE, pp. 536–546
Bendich P, Edelsbrunner H, Kerber M (2010) Computing robustness and persistence for images. IEEE Trans Visual Comput Graphics 16:1251–1260
Bendich P, Gasparovic E, Harer J, Izmailov R, Ness L (2015) Multi-scale local shape analysis and feature selection in machine learning applications. In Neural Networks (IJCNN), 2015 international joint conference on, IEEE, pp. 1–8
Bendich P, Wang B, Mukherjee S (2012) Local homology transfer and stratification learning, In Proceedings of the 23th annual ACM-SIAM symposium on discrete algorithms, SIAM, pp. 1355–1370
Binchi J, Merelli E, Rucco M, Petri G, Vaccarino F (2014) jholes: A tool for understanding biological complex networks via clique weight rank persistent homology. Electron Notes Theoretical Comput Sci 306:5–18
Bonis T, Ovsjanikov M, Oudot S, Chazal F (2016) Persistence-based pooling for shape pose recognition. In International workshop on computational topology in image context, Springer, pp. 19–29
Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Chapman and Hall/CRC, Wadsworth Statistics/Probability
Bubenik P (2015) Statistical topological data analysis using persistence landscapes. J Mach Learn Res 16:77–102
Bubenik P (2018) The persistence landscape and some of its properties, arXiv preprint arXiv:1810.04963
Bubenik P, Dłotko P (2017) A persistence landscapes toolbox for topological statistics. J Symb Comput 78:91–114
Bubenik P, Kim PT (2007) A statistical approach to persistent homology. Homol, Homotopy Appli 19:337–362
Cai T, Liu W (2011) A direct estimation approach to sparse linear discriminant analysis. J Am Stat Assoc 106:1566–1577. https://doi.org/10.1198/jasa.2011.tm11199
Cang ZX, Mu L, Wei GW (2018) Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput Biol 14:e1005929
Cang ZX, Mu L, Wu KD, Opron K, Xia KL, Wei G (2015) A topological approach to protein classificationy. Molecul Math Biol 3:140–162
Cang ZX, Wei GW (2017) Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology. Bioinformatics 33:3549–3557
Cang ZX, Wei GW (2017) Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. Int J Numerical Methods Biomed Eng 34(2):e2914
Cang ZX, Wei GW (2017) TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLoS Comput Biol 13:e1005690
Carlsson G (2009) Topology and data. Am Math Soc 46:255–308
Carlsson G, Ishkhanov T, Silva V, Zomorodian A (2008) On the local behavior of spaces of natural images. Int J Comput Vision 76:1–12
Carlsson G, Singh G, Zomorodian A (2009) Computing multidimensional persistence, in Algorithms and computation, Springer, pp. 730–739
Carlsson G, Zomorodian A (2009) The theory of multidimensional persistence. Dis Comput Geo 42:71–93
Carriere M, Bauer U (2018) On the metric distortion of embedding persistence diagrams into reproducing kernel hilbert spaces, arXiv preprint arXiv:1806.06924
Carriere M, Cuturi M, Oudot S (2017) Sliced wasserstein kernel for persistence diagrams, arXiv preprint arXiv:1706.03358
Cerri A, Landi C (2013) The persistence space in multidimensional persistent homology. In Discrete Geometry for Computer Imagery, Springer, 180–191
Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Sys Technol 27(1–27):27
Chazal F, Cohen-Steiner D, Mérigot Q (2011) Geometric inference for probability measures. Found Comput Math 11:733–751
Chazal F, Fasy B, Lecci F, Michel B, Rinaldo A, Rinaldo A, Wasserman L (2017) Robust topological inference: distance to a measure and kernel distance. J Mach Learn Res 18:5845–5884
Chen Y, Garcia EK, Gupta MR, Rahimi A, Cazzanti L (2009) Similarity-based classification: concepts and algorithms. J Mach Learn Res 10:747–776
Chevyrev I, Nanda V, Oberhauser H (2018) Persistence paths and signature features in topological data analysis, arXiv preprint arXiv:1806.00381
Chintakunta H, Gentimis T, Gonzalez-Diaz R, Jimenez MJ, Krim H (2015) An entropy-based persistence barcode. Pattern Recogn 48:391–401
Chiu MC, Pun CS, Wong HY (2017) Big data challenges of high-dimensional continuous-time mean-variance portfolio selection and a remedy. Risk Anal 37:1532–1549. https://doi.org/10.1111/risa.12801
Cohen-Steiner D, Edelsbrunner H, Morozov D (2006) Vines and vineyards by updating persistence in linear time. In Proceedings of the 22nd annual symposium on Computational geometry, ACM, 119–126
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. https://doi.org/10.1023/A:1022627411411
Cramer JS (2004) The early origins of the logit model. Studies History Philosophy Sci Part C: Studies History Philosophy Biol Biomed Sci 35:613–626. https://doi.org/10.1016/j.shpsc.2004.09.003
Dey TK, Li KY, Sun J, David CS (2008) Computing geometry aware handle and tunnel loops in 3d models., ACM Trans. Graph., 27
Dey TK, Mandal S (2018) Protein classification with improved topological data analysis. In LIPIcs-Leibniz international proceedings in informatics, vol. 113, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,
Dey TK, Wang YS (2013) Reeb graphs: approximation and persistence. Discret Comput Geom 49:46–73. https://doi.org/10.1007/s00454-012-9463-z
Dionysus: the persistent homology software. Software available at http://www.mrzv.org/software/dionysus
Di Fabio B, Landi C (2011) A Mayer-Vietoris formula for persistent homology with an application to shape recognition in the presence of occlusions. Found Comput Math 11:499–527
Edelsbrunner H (1992) Weighted alpha shapes, tech. report, Champaign, IL, USA
Edelsbrunner H, Harer J (2010) Computational topology: an introduction, American Mathematical Soc.,
Edelsbrunner H, Letscher D, Zomorodian A (2002) Topological persistence and simplification. Discrete Comput. Geom. 28:511–533
Edelsbrunner H, Mucke EP (1994) Three-dimensional alpha shapes. Phys Rev Lett 13:43–72
Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
Fasy BT, Kim J, Lecci F, Maria C (2014) Introduction to the r package tda, arXiv preprint arXiv:1411.1830
Fasy BT, Wang B (2016) Exploring persistent local homology in topological data analysis, in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, IEEE, pp. 6430–6434
Fox NK, Brenner SE, Chandonia J-M (2014) Scope: structural classification of proteins-extended, integrating scop and astral data and classification of new structures. Nucleic Acids Res 42:D304–D309
Freund Y, Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139. https://doi.org/10.1006/jcss.1997.1504
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Frohmader A (2008) Face vectors of flag complexes. Israel J Math 164:153–164
Frosini P, Landi C (2013) Persistent Betti numbers for a noise tolerant shape-based approach to image retrieval. Pattern Recogn Lett 34:863–872
Gameiro M, Hiraoka Y, Izumi S, Kramar M, Mischaikow K, Nanda V (2013) Topological measurement of protein compressibility via persistence diagrams, preprint
Ghrist R (2008) Barcodes: the persistent topology of data. Bull Am Math Soc 45:61–75
Ghrist R (2008) Barcodes: the persistent topology of data. Bull Amer Math Soc 45:61–75
Giansiracusa N, Giansiracusa R, Moon C (2017) Persistent homology machine learning for fingerprint classification, arXiv preprint arXiv:1711.09158
Giusti C, Pastalkova E, Curto C, Itskov V (2015) Clique topology reveals intrinsic geometric structure in neural correlations. Proc Natl Acad Sci 112:13455–13460
Guo W, Manohar K, Brunton SL, Banerjee AG (2018) Sparse-tda: Sparse realization of topological data analysis for multi-way classification. IEEE Trans Knowl Data Eng 30:1403–1408
Hadimaja MZ, Pun CS (2021) A self-calibrated regularized direct estimation for graphical selection and discriminant analysis in high dimensions. Comput Stat Data Anal 155:107105. https://doi.org/10.1016/j.csda.2020.107105
Han YS, Yoo J, Ye JC (2016) Deep residual learning for compressed sensing ct reconstruction via persistent homology analysis, arXiv preprint arXiv:1611.06391
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: Data mining, inference, and prediction, in The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, Springer
Hiraoka Y, Nakamura T, Hirata A, Escolar EG, Matsue K, Nishiura Y (2016) Hierarchical structures of amorphous solids characterized by persistent homology. Proc Natl Acad Sci 113:7035–7040
Hofer C, Kwitt R, Niethammer M, Uhl A (2017) Deep learning with topological signatures. Adv Neural Inf Process Sys 30:1634–1644
Horak D, Maletic S, Rajkovic M (2009) Persistent homology of complex networks. J Statis Mech: Theory Exp 2009:P03034
Horváth L, Kokoszka P (2012) Inference for functional data with applications, Springer. New York. https://doi.org/10.1007/978-1-4614-3655-3
Hylton A, Henselman-Petrusek G, Sang J, Short R (2012) Tuning the performance of a computational persistent homology package. Softw: Prac Exp 49:885–905. https://doi.org/10.1002/spe.2678
Kaczynski T, Mischaikow K, Mrozek M (2004) Computational homology, Springer-Verlag,
Kaji S, Sudo T, Ahara K (2020) Cubical Ripser: Software for computing persistent homology of image and volume data, arXiv:2005.12692
Kališnik S (2018) Tropical coordinates on the space of persistence barcodes. Found Comput Math 19(1):101–29
Kasson PM, Zomorodian A, Park S, Singhal N, Guibas LJ, Pande VS (2007) Persistent voids a new structural metric for membrane fusion. Bioinformatics 23:1753–1759
Kusano G, Hiraoka Y, Fukumizu K (2016) Persistence weighted gaussian kernel for topological data analysis. In International conference on machine learning, pp. 2004–2013
Kwitt R, Huber S, Niethammer M, Lin W, Bauer U (2015) Statistical topological data analysis-a kernel perspective. Adv Neural Inf Process Syst 28:3070–3078
Le T, Yamada M (2018) Riemannian manifold kernel for persistence diagrams, arXiv preprint arXiv:1802.03569
Lee H, Kang H, Chung MK, Kim B, Lee DS (2012) Persistent brain network homology from the perspective of dendrogram. Med Imag IEEE Trans 31:2267–2277. https://doi.org/10.1109/TMI.2012.2219590
Li C, Ovsjanikov M, Chazal F (2014) Persistence-based structural recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1995–2002
Lin HW, Tegmark M, Rolnick D (2017) Why does deep and cheap learning work so well? J Stat Phys 168:1223–1247. https://doi.org/10.1007/s10955-017-1836-5
Liu X, Xie Z, Yi DY (2012) A fast algorithm for constructing topological structure in large data. Homol, Homotopy Appli 14:221–238
Makarenko N, Kalimoldayev M, Pak I, Yessenaliyeva A (2016) Texture recognition by the methods of topological data analysis, Open Engineering, 6
Marchese A, Maroulas V (2017) Signal classification with a point process distance on the space of persistence diagrams. Adv Data Anal Classifi, 12(3):657-82
Maria C (2015) Filtered complexes, in GUDHI User and Reference Manual, GUDHI Editorial Board, http://gudhi.gforge.inria.fr/doc/latest/group__simplex__tree.html
Merelli E, Rucco M, Sloot P, Tesei L (2015) Topological characterization of complex systems: using persistent entropy. Entropy 17:6872–6892
Mileyko Y, Mukherjee S, Harer J (2011) Probability measures on the space of persistence diagrams. Inverse Prob 27:124007
Mischaikow K, Mrozek M, Reiss J, Szymczak A (1999) Construction of symbolic dynamics from experimental time series. Phys Rev Lett 82:1144–1147
Mischaikow K, Nanda V (2013) Morse theory for filtrations and efficient computation of persistent homology. Discret Comput Geom 50:330–353. https://doi.org/10.1007/s00454-013-9529-6
Munkres JR (2018) Elements of algebraic topology, CRC Press
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540
Nanda V. Perseus: the persistent homology software. Software available at http://www.sas.upenn.edu/~vnanda/perseus
Nguyen DD, Cang ZX, Wu KD, Wang ML, Cao Y, Wei GW (2018) Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges, arXiv preprint arXiv:1804.10647
Nguyen DD, Xiao T, Wang ML, Wei GW (2017) Rigidity strengthening: a mechanism for protein-ligand binding. J Chem Inf Model 57:1715–1721
Niyogi P, Smale S, Weinberger S (2011) A topological view of unsupervised learning from noisy data. SIAM J Comput 40:646–663
Obayashi I, Hiraoka Y, Kimura M (2018) Persistence diagrams with linear machine learning models. J Appli Comput Topol 1:421–449
Obayashi I. HomCloud: Software collection for data analysis using persistent homology, Hiraoka Laboratory https://homcloud.dev/
Pachauri D, Hinrichs C, Chung M, Johnson S, Singh V (2011) Topology-based kernels with application to inference problems in alzheimer’s disease. Med Imag, IEEE Trans 30:1760–1770. https://doi.org/10.1109/TMI.2011.2147327
Pachauri D, Hinrichs C, Chung MK, Johnson SC, Singh V (2011) Topology-based kernels with application to inference problems in alzheimer’s disease. IEEE Trans Med Imag 30:1760–1770
Padellini T, Brutti P (2017) Supervised learning with indefinite topological kernels, arXiv preprint arXiv:1709.07100
Pun CS (2021) A sparse learning approach to relative-volatility-managed portfolio selection. SIAM J Financial Math 12:410-445. https://doi.org/10.1137/19M1291674
Pun CS, Wong HY (2016) Resolution of degeneracy in merton’s portfolio problem. SIAM J Financial Math 7:786–811. https://doi.org/10.1137/16m1065021
Pun CS, Wong HY (2018) A linear programming model for selection of sparse high-dimensional multiperiod portfolios. Eur J Oper Res. 273(2):754–71. https://doi.org/10.1016/j.ejor.2018.08.025
Qaiser T, Tsang YW, Taniyama D, Sakamoto N, Nakane K, Epstein D, Rajpoot N (2018) Fast and accurate tumor segmentation of histology images using persistent homology and deep convolutional features, arXiv preprint arXiv:1805.03699
Ramsay JO, Silverman BW (1997) Functional data analysis, Springer. New York. https://doi.org/10.1007/978-1-4757-7107-7
Reininghaus J, Huber S, Bauer U, Kwitt R (2015) A stable multi-scale kernel for topological machine learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4741–4748
Ren S, Wu C, Wu J (2017) Weighted persistent homology, arXiv preprint arXiv:1708.06722
Rieck B, Mara H, Leitte H (2012) Multivariate data analysis using persistence-based filtering and topological signatures. IEEE Trans Visual Comput Graphics 18:2382–2391
Robins V, Turner K (2016) Principal component analysis of persistent homology rank functions with case studies of spatial point patterns, sphere packing and colloids. Physica D 334:99–117
Rucco M, Castiglione F, Merelli E, Pettini M (2016) Characterisation of the idiotypic immune network through persistent entropy. In Proceedings of ECCS 2014, Springer, pp. 117–128
Saadatfar M, Takeuchi H, Robins V, Francois N, Hiraoka Y (2017) Pore configuration landscape of granular crystallization. Nat Commun 8:15082
Seversky LM, Davis S, Berger M (2016) On time-series topological data analysis: New data and opportunities. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 59–67
Silva VD, Ghrist R (2005) Blind swarms for coverage in 2-d, in In Proceedings of Robotics: Science and Systems, p. 01
Singh G, Memoli F, Ishkhanov T, Sapiro G, Carlsson G, Ringach DL (2008) Topological analysis of population activity in visual cortex. J Vision 8(8):11–11. https://doi.org/10.1167/8.8.11
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Tausz A, Vejdemo-Johansson M, Adams H (2011) Javaplex: A research software package for persistent (co)homology. Software available at http://code.google.com/p/javaplex
Turner K, Mileyko Y, Mukherjee S, Harer J (2014) Fréchet means for distributions of persistence diagrams. Dis Comput Geom 52:44–70
Umeda Y (2017) Time series classification via topological data analysis. Inf Media Technol 12:228–239
Wang B, Summa B, Pascucci V, Vejdemo-Johansson M (2011) Branching and circular features in high dimensional data. IEEE Trans Visual Comput Graphics 17:1902–1911
Wang B, Wei GW (2016) Object-oriented persistent homology. J Comput Phys 305:276–299
Wang Y, Ombao H, Chung MK et al. (2014) Persistence landscape of functional signal and its application to epileptic electroencaphalogram data, ENAR Distinguished Student Paper Award
Wu C, Ren S, Wu J, Xia K (2018) Weighted (co) homology and weighted laplacian, arXiv preprint arXiv:1804.06990
Wu KD, Wei GW (2018) Quantitative toxicity prediction using topology based multi-task deep neural networks. J Chem Inf Model 58(2):520–31. https://doi.org/10.1021/acs.jcim.7b00558
Xia KL (2017) A quantitative structure comparison with persistent similarity, arXiv preprint arXiv:1707.03572
Xia KL (2018) Persistent homology analysis of ion aggregations and hydrogen-bonding networks. Phys Chem Chem Phys 20:13448–13460
Xia KL, Feng X, Tong YY, Wei GW (2015) Persistent homology for the quantitative prediction of fullerene stability. J Comput Chem 36:408–422
Xia KL, Li ZM, Mu L (2018) Multiscale persistent functions for biomolecular structure characterization. Bull Math Biol 80:1–31
Xia KL, Wei GW (2014) Persistent homology analysis of protein structure, flexibility and folding. Int J Num Methods Biomed Eng 30:814–844
Xia KL, Wei GW (2015) Multidimensional persistence in biomolecular data. J Comput Chem 36:1502–1520
Xia KL, Wei GW (2015) Persistent topology for cryo-EM data analysis. Int J Num Methods Biomed Eng 31:e02719
Xia KL, Zhao ZX, Wei GW (2015) Multiresolution topological simplification. J Comput Biol 22:1–5
Yao Y, Sun J, Huang XH, Bowman GR, Singh G, Lesnick M, Guibas LJ, Pande VS, Carlsson G (2009) Topological methods for exploring low-density states in biomolecular folding pathways. J Chem Phys 130:144115
Zeppelzauer M, Zieliński B, Juda M, Seidl M (2018) A study on topological descriptors for the analysis of 3d surface texture. Comput Vis Image Underst 167:74–88
Zhang ZF, Song Y, Cui HC, Wu J, Schwartz F, Qi HR (2015) Early mastitis diagnosis through topological analysis of biosignals from low-voltage alternate current electrokinetics, in Engineering in Medicine and Biology Society (EMBC) (2015) 37th annual international conference of the IEEE. IEEE 542–545
Zhou Z, Huang YZ, Wang L, Tan TN (2017) Exploring generalized shape analysis by topological representations. Pattern Recogn Lett 87:177–185
Zhu XJ (2013) Persistent homology: an introduction and a new text representation for natural language processing, in IJCAI, 1953–1959
Zhu XJ, Vartanian A, Bansal M, Nguyen D, Brandl L (2016) Stochastic multiresolution persistent homology kernel, in IJCAI, 2449–2457
Zielinski B, Juda M, Zeppelzauer M (2018) Persistence codebooks for topological data analysis, arXiv preprint arXiv:1802.04852
Zomorodian A (2010) The tidy set: a minimal simplicial set for computing homology of clique complexes, in Proceedings of the 26th annual symposium on computational geometry, ACM , pp. 257–266
Zomorodian A, Carlsson G (2005) Computing persistent homology. Discrete Comput Geom 33:249–274
Zomorodian A, Carlsson G (2008) Localized homology. Comput Geom - Theory Appli 41:126–148
Zomorodian AJ (2005) Topology for computing, vol. 16, Cambridge university press
Funding
This research is partially supported by Nanyang Technological University Startup Grants M4081840 and M4081842, Data Science and Artificial Intelligence Research Centre@NTU M4082115, and Singapore Ministry of Education Academic Research Fund Tier 1 RG109/19, Tier 2 MOE2018-T2-1-033 and MOE-T2EP20120-0013.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Properties of the Protein Secondary Structure
This section gives a review of some of the key properties of the secondary structure of proteins. The two main types of protein secondary structure are the alpha (\(\alpha\))-helix and the beta (\(\beta\))-pleated sheets.
The \(\alpha\)-helix has the following properties:
-
(1)
Bond length between immediate \(C_{\alpha }\) atom is 3.8Å.
-
This corresponds to the length of typical Betti-0 (Dim-0) bars.
-
-
(2)
Each turn is made up of 3.6 amino acid residues.
-
The formation of Betti-1 (Dim-1) bars can be explained using the slicing technique described in Xia and Wei (2014).
-
The alpha-helix structure is stabalised by the presence of hydrogen bonds between the amine hydrogen N-H and carbonyl C\(=\)O oxygen.
-
Each set of 4 \(C_{\alpha }\) atoms form a one-dimensional loop which contributes to a Betti-1 (Dim-1) bar.
-
-
(3)
Absence of Betti-2 (Dim-2) bars where no cavity is formed since there is insufficient "time" such that the loops are filled up as faces before the cavity can be formed for a single alpha helix.
The \(\beta\)-pleated sheets have the following properties:
-
(1)
Bond length/distance between immediate \(C_{\alpha }\) atom in the same strand is also 3.8Å.
-
This corresponds to the length of typical Betti-0 (Dim-0) bars. The bar terminates once the atoms are connected.
-
-
(2)
Each \(\beta\)-pleated sheet is a stretched out polypeptide chain made up of 3 to 10 amino acid residues.
-
(3)
The \(\beta\)-pleated sheets are extended structures that are stabalised by hydrogen bonds between residues in adjacent chains.
-
(4)
Each strand must be connected to adjacent strands where the shortest distance between \(C_{\alpha }\) and the nearest neighbour in adjacent strand is 4.1Å.
-
(5)
Adjacent chains run parallel or antiparallel to one another.
Principal component analysis on binned features
In this appendix, we investigate the effects of principal component analysis (PCA) on PHML for our application in Section 5. We do not involve the tree-based methods with PCA because trees process dimension reduction by their own construction.
There are signs of quite high correlation between adjacent BF as seen in Fig. 6. The use of bins unavoidably suffers from the curse of dimensionality, especially when there are limited number of samples n and \(n\ll p\), where p is the number of features. To tackle such a situation, PCA can be applied to transform features into a few uncorrelated PCs, which can be viewed as new features in a lower dimensional feature space. The downside of such an approach is that the final PCs do not have a clear interpretation to the original bins.
In subsequent reports, the experimental results involving principal components transformed from BFs are denoted by an extension “PC". For consistency, only the first 30 PCs will be used for all transformed features using either RC or AC barcodes. The same set of PCs are used as input features for SVMs and (deep) NNs (with dropout). The settings are the same as specified in Sects. 5.2.1 and 5.2.3. Tables 4 and 5 report the corresponding results.
By comparing the results in Tables 1 and 4 and Tables 3 and 5, we can see that the use of PCA on BFs does not improve the performance. It implies that the PCA transformation lost information of BFs. In conclusion, it is not recommended that ML algorithms with BFs are incorporated with PCA. However, it does not prohibit PCA from being a powerful visualization tool for the unstructured topological data.
Effects of increasing bin number for binned features
In Tables 6 and 9, the first three or four columns specify the settings of PHML and the remaining columns report the evaluation measurements. The highest overall accuracy number across different bin numbers for a given method is highlighted in red.
Rights and permissions
About this article
Cite this article
Pun, C.S., Lee, S.X. & Xia, K. Persistent-homology-based machine learning: a survey and a comparative study. Artif Intell Rev 55, 5169–5213 (2022). https://doi.org/10.1007/s10462-022-10146-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-022-10146-z