Abstract
Detecting possible habitable planets outside of our solar system has been a growing field of study. Among several other topics, this field aims to classify stars using the transit method, i.e., using their light intensity measured over time to spot the moment when a planet follows its orbit and covers part of the star as seen by a satellite. We propose a novel approach to such classification, using an extracted set of features from individual time-series that cover three different domains: temporal, statistical, and spectral. These features are filtered based on relevant measures, and used to train and evaluate models on Kepler data. The results were compared to state-of-the-art methods evaluated on the same data set and surpass existing approaches for some data transformations. All these transformations are related to turning the time-series naïvely stationary before feature extraction. Using principal components extracted from the feature set during model training did not have a considerable impact on results. In order to better evaluate the results, a cross-validation process was performed to eliminate data set bias. During this step, the best model achieved \(100\%\) recall and \(98.82\%\) F1-score for the minority class. In the future, testing additional feature selection methods, as well as assessing feature importance using more explainable metrics is crucial to further understand the distinctions that separate stars with exoplanets from those without.
Similar content being viewed by others
Code Availability
All code produced during the course of this work is available on Github.
References
Priyadarshini, I., Puri, V.: A convolutional neural network (cnn) based ensemble model for exoplanet detection. Earth Sci. Inf. 14(6), 735–747 (2021). https://doi.org/10.1007/s12145-021-00579-5. (ISSN 1865-0473.)
Jara-Maldonado, M., Alarcon-Aquino, V., Rosas-Romero, R., Starostenko, O., Ramirez-Cortes, J.M.: Transiting exoplanet discovery using machine learning techniques: a survey. Earth Sci. Inf. 13(9), 573–600 (2020). https://doi.org/10.1007/s12145-020-00464-7. (ISSN 1865-0473)
Tyagi, N., Arora, P., Chaudhary, R., Bhardwaj, J.: Exoplanet hunting using machine learning. Emerg. Technol. Data Mining Inf. Secur. Proc. IEMIS 2022 1, 687–701 (2023). https://doi.org/10.1007/978-981-19-4193-1_67
Bahel, V., Gaikwad, M.: A study of light intensity of stars for exoplanet detection using machine learning. In 2022 IEEE Region 10 Symposium (TENSYMP), 7, 1–5 (2022). https://doi.org/10.1109/TENSYMP54529.2022.9864366. (ISBN 978-1-6654-6658 ,IEEE)
Michele, J., Brian, D.: Mission overview, (2018). URL https://www.nasa.gov/mission_pages/kepler/overview/index.html. Accessed on 02 Feb 2023
Khan, M.S., Jenkins, J., Yoma, N.B.: Discovering new worlds: A review of signal processing methods for detecting exoplanets from astronomical radial velocity data [applications corner]. IEEE Signal Process. Mag. 34(1), 104–115 (2017). https://doi.org/10.1109/MSP.2016.2617293. (ISSN 1053-5888)
Auvergne, M., Bodin, P., Boisnard, L., Buey, J.-T., Chaintreuil, S., Epstein, G., Jouret, M., Lam-Trong, T., Levacher, P., Magnan, A., Perez, R., Plasson, P., Plesseria, J., Peter, G., Steller, M., Tiphène, D., Baglin, A., Agogué, P., Appourchaux, T., Barbet, D., Beaufort, T., Bellenger, R., Berlin, R., Bernardi, P., Blouin, D., Boumier, P., Bonneau, F., Briet, R., Butler, B., Cautain, R., Chiavassa, F., Costes, V., Cuvilho, J., Cunha-Parro, V., De Oliveira Fialho, F., Decaudin, M., Defise, J.-M., Djalal, S., Docclo, A., Drummond, R., Dupuis, O., Exil, G., Fauré, C., Gaboriaud, A., Gamet, P., Gavalda, P., Grolleau, E., Gueguen, L., Guivarc’h, V., Guterman, P., Hasiba, J., Huntzinger, G., Hustaix, H., Imbert, C., Jeanville, G., Johlander, B., Jorda, L., Journoud, P., Karioty, F., Kerjean, L., Lafond, L., Lapeyrere, V., Landiech, P., Larqué, T., Laudet, P., Le Merrer, J., Leporati, L., Leruyet, B., Levieuge, B., Llebaria, A., Martin, L., Mazy, E., Mesnager, J.-M., Michel, J.-P., Moalic, J.-P., Monjoin, W., Naudet, D., Neukirchner, S., Nguyen-Kim, K., Ollivier, M., Orcesi, J.-L., Ottacher, H., Oulali, A., Parisot, J., Perruchot, S., Piacentino, A., Pinheiro da Silva, L., Platzer, J., Pontet, B., Pradines, A., Quentin, C., Rohbeck, U., Rolland, G., Rollenhagen, F., Romagnan, R., Russ, N., Samadi, R., Schmidt, R., Schwartz, N., Sebbag, I., Smit, H., Sunter, W., Tello, M., Toulouse, P., Ulmer, B., Vandermarcq, O., Vergnault, E., Wallner, R., Waultier, G., Zanatta, P.: The corot satellite in flight: description and performance. Astron. Astrophys. 506, 411–424 (2009). https://doi.org/10.1051/0004-6361/200810860
Ricker, G.R., Winn, J.N., Vanderspek, R., Latham, D.W., Bakos, G.Á., Bean, J.L., Berta-Thompson, Z.K., Brown, T.M., Buchhave, L., Butler, N.R., Paul Butler, R., Chaplin, W.J., Charbonneau, D., Christensen-Dalsgaard, J., Clampin, M., Deming, D., Doty, J., De Lee, N., Dressing, C., Dunham, E.W., Endl, M., Fressin, F., Ge, J., Henning, T., Holman, M.J., Howard, A.W., Ida, S., Jenkins, J.M., Jernigan, G., Johnson, J.A., Kaltenegger, L., Kawai, N., Kjeldsen, H., Laughlin, G., Levine, A.M., Lin, D., Lissauer, J.J., MacQueen, P., Marcy, G., McCullough, P.R., Morton, T.D., Narita, N., Paegert, M., Palle, E., Pepe, F., Pepper, J., Quirrenbach, A., Rinehart, S.A., Sasselov, D., Bun’Sato, S.S., Sozzetti, A., Stassun, K.G., Sullivan, P., Szentgyorgyi, A., Torres, G., Udry, S., Villasenor, J.: Transiting exoplanet survey satellite. J. Astron. Telesc. Instrum. Syst. 1(10), 014003 (2014)
Doyle, L.R., Carter, J.A., Fabrycky, D.C., Slawson, R.W., Howell, S.B., Winn, J.N., Orosz, J.A., Prsa, A., Welsh, W.F., Quinn, S.N., Latham, D., Torres, G., Buchhave, L.A., Marcy, G.W., Fortney, J.J., Shporer, A., Ford, E.B., Lissauer, J.J., Ragozzine, D., Rucker, M., Batalha, N., Jenkins, J.M., Borucki, W.J., Koch, D., Middour, C.K., Hall, J.R., McCauliff, S., Fanelli, M.N., Quintana, E.V., Holman, M.J., Caldwell, D.A., Still, M., Stefanik, R.P., Brown, W.R., Esquerdo, G.A., Tang, S., Furesz, G., Geary, J.C., Berlind, P., Calkins, M.L., Short, D.R., Steffen, J.H., Sasselov, D., Dunham, E.W., Cochran, W.D., Boss, A., Haas, M.R., Buzasi, D., Fischer, D.: Kepler-16: a transiting circumbinary planet. Science 333, 1602–1606 (2011). https://doi.org/10.1126/science.1210923
Borucki, W.J., Koch, D.G., Batalha, N., Bryson, S.T., Rowe, J., Fressin, F., Torres, G., Caldwell, D.A., Christensen-Dalsgaard, J., Cochran, W.D., DeVore, E., Gautier, T.N., Geary, J.C., Gilliland, R., Gould, A., Howell, S.B., Jenkins, J.M., Latham, D.W., Lissauer, J.J., Marcy, G.W., Sasselov, D., Boss, A., Charbonneau, D., Ciardi, D., Kaltenegger, L., Doyle, L., Dupree, A.K., Ford, E.B., Fortney, J., Holman, M.J., Steffen, J.H., Mullally, F., Still, M., Tarter, J., Ballard, S., Buchhave, L.A., Carter, J., Christiansen, J.L., Demory, B.-O., Désert, J.-M., Dressing, C., Endl, M., Fabrycky, D., Fischer, D., Haas, M.R., Henze, C., Horch, E., Howard, A.W., Isaacson, H., Kjeldsen, H., Johnson, J.A., Klaus, T., Kolodziejczak, J., Barclay, T., Li, J., Meibom, S., Prsa, A., Quinn, S.N., Quintana, E.V., Robertson, P., Sherry, W., Shporer, A., Tenenbaum, P., Thompson, S.E., Twicken, J.D., Van Cleve, J., Welsh, W.F., Basu, S., Chaplin, W., Miglio, A., Kawaler, S.D., Arentoft, T., Stello, D., Metcalfe, T.S., Verner, G.A., Karoff, C., Lundkvist, M., Lund, M.N., Handberg, R., Elsworth, Y., Hekker, S., Huber, D., Bedding, T.R., Rapin, W.: Kepler-22b: A 24 earth-radius planet in the habitable zone of a sun-like star. Astrophys. J. 745, 120 (2012). https://doi.org/10.1088/0004-637X/745/2/120
Neubauer, D., Vrtala, A., Leitner, J.J., Firneis, M.G., Hitzenberger, R.: The life supporting zone of kepler-22b and the kepler planetary candidates: Koi268.01, koi701.03, koi854.01 and koi1026.01. Planet. Space Sci. 73(12), 397–406 (2012). https://doi.org/10.1016/j.pss.2012.07.020. (ISSN 00320633 Solar System science before and after Gaia)
Quintana, E.V., Barclay, T., Raymond, S.N., Rowe, J.F., Bolmont, E., Caldwell, D.A., Howell, S.B., Kane, S.R., Huber, D., Crepp, J.R., Lissauer, J.J., Ciardi, D.R., Coughlin, J.L., Everett, M.E., Henze, C.E., Horch, E., Isaacson, H., Ford, E.B., Adams, F.C., Still, M., Hunter, R.C., Quarles, B., Selsis, F.: An earth-sized planet in the habitable zone of a cool star. Science 344, 277–280 (2014). https://doi.org/10.1126/science.1249403
Rory Barnes, S.N., Raymond, R.G., Brian, J., Kaib, N.A.: Corot-7b: Super-earth or super-io? Astrophys. J. 709(2), L95–L98 (2010). https://doi.org/10.1088/2041-8205/709/2/L95. (ISSN 2041-8205)
Pat, B., Kristen, W., Anya, B.: Kepler’s legacy: discoveries and more, 2020. URL https://exoplanets.nasa.gov/keplerscience/. Accessed on 30 Jan 2023
Michele, J., Brian, D.: Liftoff of the kepler spacecraft, 2017. URL https://www.nasa.gov/mission_pages/kepler/launch/index.html. Accessed on 02 Feb 2023
Rick, C., Brian, D.: Briefing materials: Nasa retires the kepler space telescope, 2018. URL https://www.nasa.gov/kepler/presskit. Accessed on 02 Feb 2023
Hönes, C.J., Miller, B.K., Heras, A.M., Foing, B.H.: Automatically detecting anomalous exoplanet transits. CoRR, arXiv:2111.08679, 11 2021https://doi.org/10.48550/arXiv.2111.08679
Cornachione, M.A., Bolton, A.S., Eastman, J.D., Wilson, M.L., Wang, S.X., Johnson, S.A., Sliski, D.H., McCrady, N., Wright, J.T., Plavchan, P., Johnson, J.A., Horner, J., Wittenmyer, R.A.: A full implementation of spectro-perfectionism for precise radial velocity exoplanet detection: A test case with the minerva reduction pipeline. Publ. Astron. Soc. Pacific 131, 124503 (2019). https://doi.org/10.1088/1538-3873/ab4103
Zaleski, S.M., Valio, A., Marsden, S.C., Carter, B.D.: Differential rotation of kepler-71 via transit photometry mapping of faculae and starspots. Mon. Not. R. Astron. Soc. 484(3), 618–630 (2019). https://doi.org/10.1093/mnras/sty3474. (ISSN 0035-8711)
Treu, T., Marshall, P.J., Clowe, D.: Resource letter gl-1: Gravitational lensing. Am. J. Phys. 80(9), 753–763 (2012). https://doi.org/10.1119/1.4726204. (ISSN 0002-9505)
Kane, S.R., Dalba, P.A., Li, Z., Horch, E.P., Hirsch, L.A., Horner, J., Wittenmyer, R.A., Howell, S.B., Everett, M.E., Paul Butler, R., Tinney, C.G., Carter, B.D., Wright, D.J., Jones, H.R.A., Bailey, J., O’Toole, S.J.: Detection of planetary and stellar companions to neighboring stars via a combination of radial velocity and direct imaging techniques. Astron. J. 157, 252 (2019). https://doi.org/10.3847/1538-3881/ab1ddf
Deqing, R., Mohanakrishna, R., Christian, D.J.: A host-star calibration based polarimeter for earth-like exoplanet imaging. Publ. Astron. Soc. Pac. 131(11), 115004 (2019). https://doi.org/10.1088/1538-3873/ab33ca. (ISSN 0004-6280)
Lacour, S., Nowak, M., Wang, J., Pfuhl, O., Eisenhauer, F., Abuter, R., Amorim, A., Anugu, N., Benisty, M., Berger, J.P., Beust, H., Blind, N., Bonnefoy, M., Bonnet, H., Bourget, P., Brandner, W., Buron, A., Collin, C., Charnay, B., Chapron, F., Clénet, Y., Coudé du Foresto, V., de Zeeuw, P.T., Deen, C., Dembet, R., Dexter, J., Duvert, G., Eckart, A., Förster Schreiber, N.M., Fédou, P., Garcia, P., Garcia Lopez, R., Gao, F., Gendron, E., Genzel, R., Gillessen, S., Gordo, P., Greenbaum, A., Habibi, M., Haubois, X., Haußmann, F., Henning, T., Hippler, S., Horrobin, M., Hubert, Z., Jimenez Rosales, A., Jocou, L., Kendrew, S., Kervella, P., Kolb, J., Lagrange, A.-M., Lapeyrère, V., Le Bouquin, J.-B., Léna, P., Lippa, M., Lenzen, R., Maire, A.-L., Mollière, P., Ott, T., Paumard, T., Perraut, K., Perrin, G., Pueyo, L., Rabien, S., Ramírez, A., Rau, C., Rodríguez-Coira, G., Rousset, G., Sanchez-Bermudez, J., Scheithauer, S., Schuhler, N., Straub, O., Straubmeier, C., Sturm, E., Tacconi, L.J., Vincent, F., van Dishoeck, E.F., von Fellenberg, S., Wank, I., Waisberg, I., Widmann, F., Wieprecht, E., Wiest, M., Wiezorrek, E., Woillez, J., Yazici, S., Ziegler, D., Zins, G.: First direct detection of an exoplanet by optical interferometry. Astron. Astrophys. 623, L11 (2019). https://doi.org/10.1051/0004-6361/201935253
Asif Amin, R.M., Khan, A.T., Tasnim Raisa, Z., Chisty, N., SamihaKhan, S., Khaja, M.S., Rahman, R.M.: Detection of exoplanet systems in kepler light curves using adaptive neuro-fuzzy system. In 2018 International Conference on Intelligent Systems (IS),. IEEE 9, 66–72 (2018). https://doi.org/10.1109/IS.2018.8710502. (ISBN 978-1-5386-7097-2.)
Singh, S.P., Misra, D.K.: Exoplanet hunting in deep space with machine learning. Int. J. Res. Eng. Sci. Manag. 3, 187–192 (2020)
Jang, J.S.R., Sun, C.T., Mizutani, E.: A Computational Approach to Learning and Machine Intelligence. Neuro-fuzzy and Soft Computing, Prentice Hall, Hoboken (1997)
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(8), 37–52 (1987). https://doi.org/10.1016/0169-7439(87)80084-9
Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 359–370. AAAI Press, (1994)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(6), 321–357 (2002). https://doi.org/10.1613/jair.953. (ISSN 1076-9757)
Elkan, C.: The foundations of cost-sensitive learning. Int. Joint Conf. Artif. Intell. 17, 973–978 (2001)
Wolpert, D.H.: Stacked generalization. Neural Netw. 5(1), 241–259 (1992). https://doi.org/10.1016/S0893-6080(05)80023-1. (ISSN 08936080)
Wang, G., Hao, J., Ma, J., Jiang, H.: A comparative assessment of ensemble learning for credit scoring. Expert Syst. Appl. 38(1), 223–230 (2011). https://doi.org/10.1016/j.eswa.2010.06.048
Woodward, D., Stevens, E., Linstead, E.: Generating transit light curves with variational autoencoders. In 2019 IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT), IEEE, 7, 24–32 (2019). https://doi.org/10.1109/SMC-IT.2019.00008. (ISBN 978-1-7281-1545-0)
Rob, G., Brian, D.: About tess, 2020. URL https://www.nasa.gov/content/about-tess. Accessed on 30 Jan 2023
Massey, F.J.: The kolmogorov-smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951)
Hsu, C.-W., Lin, C.-J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(3), 415–425 (2002). https://doi.org/10.1109/72.991427. (ISSN 10459227)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
Barandas, M., Folgado, D., Fernandes, L., Santos, S., Abreu, M., Bota, P., Liu, H., Schultz, T., Gamboa, H.: Tsfel: Time series feature extraction library. SoftwareX 11(1), 100456 (2020). https://doi.org/10.1016/j.softx.2020.100456. (ISSN 23527110)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011). (ISSN 15324435)
Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York (2016)
Geurts, P.: Principles of Data Mining and Knowledge Discovery Springer, pp. 115–127. Berlin Heidelberg. Pattern extraction for time series classification, Berlin, Heidelberg (2001)
Ge, L., Ge, L.-J.: Feature extraction of time series classification based on multi-method integration. Optik 127(12), 11070–11074 (2016). https://doi.org/10.1016/j.ijleo.2016.08.089. (ISSN 0030402)
Zheng, Y., Si, Y.-W., Wong, R.: Feature extraction for chart pattern classification in financial time series. Knowl. Inf. Syst. 63(7), 1807–1848 (2021). https://doi.org/10.1007/s10115-021-01569-1
Osborn, D.R., Chui, A.P.L., Smith, J.P., Birchenhall, C.R.: Seasonality and the order of integration for consumption. Oxf. Bull. Econ. Stat. 50(5), 361–377 (1988). https://doi.org/10.1111/j.1468-0084.1988.mp50004002.x. (ISSN 03059049)
Peter, C.B., Denis, K., Peter, S., Phillips, S.Y.: Testing the null hypothesis of stationarity against the alternative of a unit root. J. Econ. 54(10), 159–178 (1992). https://doi.org/10.1016/0304-4076(92)90104-Y. (ISSN 03044076)
White, H.: A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48(5), 817–838 (1980). https://doi.org/10.2307/1912934. (ISSN 00129682)
Crutchfield, J.P., Feldman, D.P.: Regularities unseen, randomness observed: levels of entropy convergence. Chaos Interdiscip. J. Nonlinear Sci. 13(3), 25–54 (2003). https://doi.org/10.1063/1.1530990
Hyunju, K., Gabriele, V., Jake, H., Sara, I.W.: Informational architecture across non-living and living collectives. Theory Biosci. 140, 325–341 (2021). https://doi.org/10.1007/s12064-020-00331-5
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, COLT ’, ACM 92(7), 144–152 (1992). https://doi.org/10.1145/130385.130401. (ISBN 089791497X)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning, 20, 1995. ISSN 15730565https://doi.org/10.1023/A:1022627411411. URL https://link.springer.com/article/10.1023/A:1022627411411
Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)
Hastie, J.F.T., Tibshirani, R.: The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer, Berlin (2009). https://doi.org/10.1007/978-0-387-84858-7
Ho, T.K.: Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 1, pages 278–282. IEEE Comput. Soc. Press, (1995). ISBN 0-8186-7128-9https://doi.org/10.1109/ICDAR.1995.598994. URL http://ieeexplore.ieee.org/document/598994/
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, volume 13-17-August-2016 of KDD ’16, pages 785–794. ACM, 8 2016. ISBN 9781450342322https://doi.org/10.1145/2939672.2939785. https://dl.acm.org/doi/10.1145/2939672.2939785
Fix, E., Hodges, J.L.: Discriminatory analysis - nonparametric discrimination: Consistency properties. Technical report, USAF School of Aviation Medicine, Technical Report 4, Project 21-49-004, Randolph Field, Texas, (1951)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964. (ISSN 0018-9448)
Berkson, J.: Application of the logistic function to bio-assay. J. Am. Stat. Assoc. 39(9), 357–365 (1944)
Hosmer, D.W., Stanley, L., Sturdivant, R.X.: Applied logistic regression. Wiley, New York (2013). https://doi.org/10.1002/9781118548387
Acknowledgements
Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.
Author information
Authors and Affiliations
Contributions
JP conceived the idea and methodology of this work and developed code to run experiments on the available data. JA closely collaborated with JP by performing experiments on normalized data. JP and JA analysed the results and drafted the manuscript. FR provided guidance throughout the research process, from the initial conceptualization to the interpretation of the results, as well as critical thinking about the reasons behind certain results. All authors edited, read, reviewed and approved the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All the authors declared that they have no conflict of interest.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Machine learning methods
1.1 A.1 Support vector machines
SVMs, initially proposed by Boser et al. [50] and extended by Cortes and Vapnik [51], are algorithms that can be used in classification tasks that learn functions to discriminate between classes from a training set. This discrimination is achieved by constructing one or more hyper-planes in a high-dimensional space that are then used to solve the task. These hyper-planes maximally separate the data into different classes using a maximum margin, i.e., the distance between the hyper-plane and the closest data points from each class. By doing so, the algorithm finds the most robust decision boundary to properly generalize new data [52].
SVMs are particularly useful when dealing with data sets whose features have a nonlinear relationship with the response variable. This is accomplished by mapping the original feature space into a higher-dimensional space using a kernel function. This new space is then used to find a nonlinear decision boundary [53]. SVMs are also useful to handle data with high dimensionality, i.e., when the number of features is considerably larger than the number of observations [53]. Lastly, these algorithms are noticeably robust against over-fitting [51].
1.2 A.2 Random forests
Based on the work of Ho [54] and introduced by Breiman [55], RFs are an ensemble algorithm of several decision trees, each trained on a random subset of data and features. When predicting a new value, the predictions of all trees are aggregated, i.e., the class with the most votes is the predicted one [53].
The random aspect of RFs helps in preventing over-fitting, as the noise of individual trees is averaged out, improving the generalization capacity of the models and making their predictions more robust and accurate in comparison to individual decision trees [53, 55]. Furthermore, RFs also have the ability to handle high-dimensional data and provide estimates of feature importance [53]. Nonetheless, Chen and Guestrin [56] discussed that RFs can be prone to over-fitting in data sets with a high number of features, but suggested that this can be mitigated by using feature sub-sampling and bagging.
1.3 A.3 K-Nearest neighbours
KNN, proposed by Fix and Hodges [57] and later extended by Cover and Hart [58], is a nonparametric algorithm, i.e., an algorithm that does not assume the kind of the mapping function, that can be used in a supervised or unsupervised context. This algorithm is based on the assumption that similar data points have similar response values [53]. The process of finding the k nearest neighbours, with k being a hyper-parameter of this algorithm, is performed on feature space with regard to some distance measure and the outcome is the most common class among the k nearest neighbours of the predicted point [53].
1.4 A.4 Logistic regression
Introduced by Berkson [59], LR is a generalized linear model used in binary classification tasks that calculates the probability of the input being part of both classes. This probability is modelled based on the logistic/sigmoid function, i.e., \( P(Y = 1 | x) = \frac{1}{1 + e^{-(\beta _0 + \sum _{i = 1}^{N} \beta _i \times x_i)}} \), where x is a vector of length N that represents the feature values of the input data and \(\beta \) is a vector containing the weights for each feature as well as \(\beta _0\), which represents the intercept, i.e., the log-odds of the target variable when all feature values are zero. This function maps the linear combination of features into a probability [53, 60]. Additionally, the coefficients of this model are estimated by maximizing the likelihood function of the data, which minimizes the errors between the predicted probabilities and the actual values of the target variable [60].
Appendix B OCSB test results
The results of the OCSB test are displayed on Fig. 9. Several seasonal frequencies (from 10 to 1490 with increments of 10 units) where assessed for all time-series, however, only a few stars from the majority class where detected as having a seasonal component. These range from first to third differences.
Appendix C Performance analysis
Figure 10 contains several boxplots that represent the value distributions of performance metrics obtained from training and evaluating models on the original test set. The metrics were grouped by data transformation (X-axis), with each metric being represented as a facet of the plot.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pimentel, J., Amorim, J. & Rudzicz, F. Feature extraction for exoplanet detection. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00552-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41060-024-00552-7