Skip to main content
Log in

Abstract

Detecting possible habitable planets outside of our solar system has been a growing field of study. Among several other topics, this field aims to classify stars using the transit method, i.e., using their light intensity measured over time to spot the moment when a planet follows its orbit and covers part of the star as seen by a satellite. We propose a novel approach to such classification, using an extracted set of features from individual time-series that cover three different domains: temporal, statistical, and spectral. These features are filtered based on relevant measures, and used to train and evaluate models on Kepler data. The results were compared to state-of-the-art methods evaluated on the same data set and surpass existing approaches for some data transformations. All these transformations are related to turning the time-series naïvely stationary before feature extraction. Using principal components extracted from the feature set during model training did not have a considerable impact on results. In order to better evaluate the results, a cross-validation process was performed to eliminate data set bias. During this step, the best model achieved \(100\%\) recall and \(98.82\%\) F1-score for the minority class. In the future, testing additional feature selection methods, as well as assessing feature importance using more explainable metrics is crucial to further understand the distinctions that separate stars with exoplanets from those without.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The original data set used in this work is available on Kaggle. All transformed data can be accessed on Github.

Code Availability

All code produced during the course of this work is available on Github.

Notes

  1. https://archive.stsci.edu/missions-and-data/k2.

  2. https://www.kaggle.com/datasets/keplersmachines/kepler-labelled-time-series-data.

References

  1. Priyadarshini, I., Puri, V.: A convolutional neural network (cnn) based ensemble model for exoplanet detection. Earth Sci. Inf. 14(6), 735–747 (2021). https://doi.org/10.1007/s12145-021-00579-5. (ISSN 1865-0473.)

    Article  Google Scholar 

  2. Jara-Maldonado, M., Alarcon-Aquino, V., Rosas-Romero, R., Starostenko, O., Ramirez-Cortes, J.M.: Transiting exoplanet discovery using machine learning techniques: a survey. Earth Sci. Inf. 13(9), 573–600 (2020). https://doi.org/10.1007/s12145-020-00464-7. (ISSN 1865-0473)

    Article  Google Scholar 

  3. Tyagi, N., Arora, P., Chaudhary, R., Bhardwaj, J.: Exoplanet hunting using machine learning. Emerg. Technol. Data Mining Inf. Secur. Proc. IEMIS 2022 1, 687–701 (2023). https://doi.org/10.1007/978-981-19-4193-1_67

    Article  Google Scholar 

  4. Bahel, V., Gaikwad, M.: A study of light intensity of stars for exoplanet detection using machine learning. In 2022 IEEE Region 10 Symposium (TENSYMP), 7, 1–5 (2022). https://doi.org/10.1109/TENSYMP54529.2022.9864366. (ISBN 978-1-6654-6658 ,IEEE)

  5. Michele, J., Brian, D.: Mission overview, (2018). URL https://www.nasa.gov/mission_pages/kepler/overview/index.html. Accessed on 02 Feb 2023

  6. Khan, M.S., Jenkins, J., Yoma, N.B.: Discovering new worlds: A review of signal processing methods for detecting exoplanets from astronomical radial velocity data [applications corner]. IEEE Signal Process. Mag. 34(1), 104–115 (2017). https://doi.org/10.1109/MSP.2016.2617293. (ISSN 1053-5888)

    Article  Google Scholar 

  7. Auvergne, M., Bodin, P., Boisnard, L., Buey, J.-T., Chaintreuil, S., Epstein, G., Jouret, M., Lam-Trong, T., Levacher, P., Magnan, A., Perez, R., Plasson, P., Plesseria, J., Peter, G., Steller, M., Tiphène, D., Baglin, A., Agogué, P., Appourchaux, T., Barbet, D., Beaufort, T., Bellenger, R., Berlin, R., Bernardi, P., Blouin, D., Boumier, P., Bonneau, F., Briet, R., Butler, B., Cautain, R., Chiavassa, F., Costes, V., Cuvilho, J., Cunha-Parro, V., De Oliveira Fialho, F., Decaudin, M., Defise, J.-M., Djalal, S., Docclo, A., Drummond, R., Dupuis, O., Exil, G., Fauré, C., Gaboriaud, A., Gamet, P., Gavalda, P., Grolleau, E., Gueguen, L., Guivarc’h, V., Guterman, P., Hasiba, J., Huntzinger, G., Hustaix, H., Imbert, C., Jeanville, G., Johlander, B., Jorda, L., Journoud, P., Karioty, F., Kerjean, L., Lafond, L., Lapeyrere, V., Landiech, P., Larqué, T., Laudet, P., Le Merrer, J., Leporati, L., Leruyet, B., Levieuge, B., Llebaria, A., Martin, L., Mazy, E., Mesnager, J.-M., Michel, J.-P., Moalic, J.-P., Monjoin, W., Naudet, D., Neukirchner, S., Nguyen-Kim, K., Ollivier, M., Orcesi, J.-L., Ottacher, H., Oulali, A., Parisot, J., Perruchot, S., Piacentino, A., Pinheiro da Silva, L., Platzer, J., Pontet, B., Pradines, A., Quentin, C., Rohbeck, U., Rolland, G., Rollenhagen, F., Romagnan, R., Russ, N., Samadi, R., Schmidt, R., Schwartz, N., Sebbag, I., Smit, H., Sunter, W., Tello, M., Toulouse, P., Ulmer, B., Vandermarcq, O., Vergnault, E., Wallner, R., Waultier, G., Zanatta, P.: The corot satellite in flight: description and performance. Astron. Astrophys. 506, 411–424 (2009). https://doi.org/10.1051/0004-6361/200810860

    Article  Google Scholar 

  8. Ricker, G.R., Winn, J.N., Vanderspek, R., Latham, D.W., Bakos, G.Á., Bean, J.L., Berta-Thompson, Z.K., Brown, T.M., Buchhave, L., Butler, N.R., Paul Butler, R., Chaplin, W.J., Charbonneau, D., Christensen-Dalsgaard, J., Clampin, M., Deming, D., Doty, J., De Lee, N., Dressing, C., Dunham, E.W., Endl, M., Fressin, F., Ge, J., Henning, T., Holman, M.J., Howard, A.W., Ida, S., Jenkins, J.M., Jernigan, G., Johnson, J.A., Kaltenegger, L., Kawai, N., Kjeldsen, H., Laughlin, G., Levine, A.M., Lin, D., Lissauer, J.J., MacQueen, P., Marcy, G., McCullough, P.R., Morton, T.D., Narita, N., Paegert, M., Palle, E., Pepe, F., Pepper, J., Quirrenbach, A., Rinehart, S.A., Sasselov, D., Bun’Sato, S.S., Sozzetti, A., Stassun, K.G., Sullivan, P., Szentgyorgyi, A., Torres, G., Udry, S., Villasenor, J.: Transiting exoplanet survey satellite. J. Astron. Telesc. Instrum. Syst. 1(10), 014003 (2014)

    Article  Google Scholar 

  9. Doyle, L.R., Carter, J.A., Fabrycky, D.C., Slawson, R.W., Howell, S.B., Winn, J.N., Orosz, J.A., Prsa, A., Welsh, W.F., Quinn, S.N., Latham, D., Torres, G., Buchhave, L.A., Marcy, G.W., Fortney, J.J., Shporer, A., Ford, E.B., Lissauer, J.J., Ragozzine, D., Rucker, M., Batalha, N., Jenkins, J.M., Borucki, W.J., Koch, D., Middour, C.K., Hall, J.R., McCauliff, S., Fanelli, M.N., Quintana, E.V., Holman, M.J., Caldwell, D.A., Still, M., Stefanik, R.P., Brown, W.R., Esquerdo, G.A., Tang, S., Furesz, G., Geary, J.C., Berlind, P., Calkins, M.L., Short, D.R., Steffen, J.H., Sasselov, D., Dunham, E.W., Cochran, W.D., Boss, A., Haas, M.R., Buzasi, D., Fischer, D.: Kepler-16: a transiting circumbinary planet. Science 333, 1602–1606 (2011). https://doi.org/10.1126/science.1210923

    Article  Google Scholar 

  10. Borucki, W.J., Koch, D.G., Batalha, N., Bryson, S.T., Rowe, J., Fressin, F., Torres, G., Caldwell, D.A., Christensen-Dalsgaard, J., Cochran, W.D., DeVore, E., Gautier, T.N., Geary, J.C., Gilliland, R., Gould, A., Howell, S.B., Jenkins, J.M., Latham, D.W., Lissauer, J.J., Marcy, G.W., Sasselov, D., Boss, A., Charbonneau, D., Ciardi, D., Kaltenegger, L., Doyle, L., Dupree, A.K., Ford, E.B., Fortney, J., Holman, M.J., Steffen, J.H., Mullally, F., Still, M., Tarter, J., Ballard, S., Buchhave, L.A., Carter, J., Christiansen, J.L., Demory, B.-O., Désert, J.-M., Dressing, C., Endl, M., Fabrycky, D., Fischer, D., Haas, M.R., Henze, C., Horch, E., Howard, A.W., Isaacson, H., Kjeldsen, H., Johnson, J.A., Klaus, T., Kolodziejczak, J., Barclay, T., Li, J., Meibom, S., Prsa, A., Quinn, S.N., Quintana, E.V., Robertson, P., Sherry, W., Shporer, A., Tenenbaum, P., Thompson, S.E., Twicken, J.D., Van Cleve, J., Welsh, W.F., Basu, S., Chaplin, W., Miglio, A., Kawaler, S.D., Arentoft, T., Stello, D., Metcalfe, T.S., Verner, G.A., Karoff, C., Lundkvist, M., Lund, M.N., Handberg, R., Elsworth, Y., Hekker, S., Huber, D., Bedding, T.R., Rapin, W.: Kepler-22b: A 24 earth-radius planet in the habitable zone of a sun-like star. Astrophys. J. 745, 120 (2012). https://doi.org/10.1088/0004-637X/745/2/120

    Article  Google Scholar 

  11. Neubauer, D., Vrtala, A., Leitner, J.J., Firneis, M.G., Hitzenberger, R.: The life supporting zone of kepler-22b and the kepler planetary candidates: Koi268.01, koi701.03, koi854.01 and koi1026.01. Planet. Space Sci. 73(12), 397–406 (2012). https://doi.org/10.1016/j.pss.2012.07.020. (ISSN 00320633 Solar System science before and after Gaia)

    Article  Google Scholar 

  12. Quintana, E.V., Barclay, T., Raymond, S.N., Rowe, J.F., Bolmont, E., Caldwell, D.A., Howell, S.B., Kane, S.R., Huber, D., Crepp, J.R., Lissauer, J.J., Ciardi, D.R., Coughlin, J.L., Everett, M.E., Henze, C.E., Horch, E., Isaacson, H., Ford, E.B., Adams, F.C., Still, M., Hunter, R.C., Quarles, B., Selsis, F.: An earth-sized planet in the habitable zone of a cool star. Science 344, 277–280 (2014). https://doi.org/10.1126/science.1249403

    Article  Google Scholar 

  13. Rory Barnes, S.N., Raymond, R.G., Brian, J., Kaib, N.A.: Corot-7b: Super-earth or super-io? Astrophys. J. 709(2), L95–L98 (2010). https://doi.org/10.1088/2041-8205/709/2/L95. (ISSN 2041-8205)

    Article  Google Scholar 

  14. Pat, B., Kristen, W., Anya, B.: Kepler’s legacy: discoveries and more, 2020. URL https://exoplanets.nasa.gov/keplerscience/. Accessed on 30 Jan 2023

  15. Michele, J., Brian, D.: Liftoff of the kepler spacecraft, 2017. URL https://www.nasa.gov/mission_pages/kepler/launch/index.html. Accessed on 02 Feb 2023

  16. Rick, C., Brian, D.: Briefing materials: Nasa retires the kepler space telescope, 2018. URL https://www.nasa.gov/kepler/presskit. Accessed on 02 Feb 2023

  17. Hönes, C.J., Miller, B.K., Heras, A.M., Foing, B.H.: Automatically detecting anomalous exoplanet transits. CoRR, arXiv:2111.08679, 11 2021https://doi.org/10.48550/arXiv.2111.08679

  18. Cornachione, M.A., Bolton, A.S., Eastman, J.D., Wilson, M.L., Wang, S.X., Johnson, S.A., Sliski, D.H., McCrady, N., Wright, J.T., Plavchan, P., Johnson, J.A., Horner, J., Wittenmyer, R.A.: A full implementation of spectro-perfectionism for precise radial velocity exoplanet detection: A test case with the minerva reduction pipeline. Publ. Astron. Soc. Pacific 131, 124503 (2019). https://doi.org/10.1088/1538-3873/ab4103

    Article  Google Scholar 

  19. Zaleski, S.M., Valio, A., Marsden, S.C., Carter, B.D.: Differential rotation of kepler-71 via transit photometry mapping of faculae and starspots. Mon. Not. R. Astron. Soc. 484(3), 618–630 (2019). https://doi.org/10.1093/mnras/sty3474. (ISSN 0035-8711)

    Article  Google Scholar 

  20. Treu, T., Marshall, P.J., Clowe, D.: Resource letter gl-1: Gravitational lensing. Am. J. Phys. 80(9), 753–763 (2012). https://doi.org/10.1119/1.4726204. (ISSN 0002-9505)

    Article  Google Scholar 

  21. Kane, S.R., Dalba, P.A., Li, Z., Horch, E.P., Hirsch, L.A., Horner, J., Wittenmyer, R.A., Howell, S.B., Everett, M.E., Paul Butler, R., Tinney, C.G., Carter, B.D., Wright, D.J., Jones, H.R.A., Bailey, J., O’Toole, S.J.: Detection of planetary and stellar companions to neighboring stars via a combination of radial velocity and direct imaging techniques. Astron. J. 157, 252 (2019). https://doi.org/10.3847/1538-3881/ab1ddf

    Article  Google Scholar 

  22. Deqing, R., Mohanakrishna, R., Christian, D.J.: A host-star calibration based polarimeter for earth-like exoplanet imaging. Publ. Astron. Soc. Pac. 131(11), 115004 (2019). https://doi.org/10.1088/1538-3873/ab33ca. (ISSN 0004-6280)

    Article  Google Scholar 

  23. Lacour, S., Nowak, M., Wang, J., Pfuhl, O., Eisenhauer, F., Abuter, R., Amorim, A., Anugu, N., Benisty, M., Berger, J.P., Beust, H., Blind, N., Bonnefoy, M., Bonnet, H., Bourget, P., Brandner, W., Buron, A., Collin, C., Charnay, B., Chapron, F., Clénet, Y., Coudé du Foresto, V., de Zeeuw, P.T., Deen, C., Dembet, R., Dexter, J., Duvert, G., Eckart, A., Förster Schreiber, N.M., Fédou, P., Garcia, P., Garcia Lopez, R., Gao, F., Gendron, E., Genzel, R., Gillessen, S., Gordo, P., Greenbaum, A., Habibi, M., Haubois, X., Haußmann, F., Henning, T., Hippler, S., Horrobin, M., Hubert, Z., Jimenez Rosales, A., Jocou, L., Kendrew, S., Kervella, P., Kolb, J., Lagrange, A.-M., Lapeyrère, V., Le Bouquin, J.-B., Léna, P., Lippa, M., Lenzen, R., Maire, A.-L., Mollière, P., Ott, T., Paumard, T., Perraut, K., Perrin, G., Pueyo, L., Rabien, S., Ramírez, A., Rau, C., Rodríguez-Coira, G., Rousset, G., Sanchez-Bermudez, J., Scheithauer, S., Schuhler, N., Straub, O., Straubmeier, C., Sturm, E., Tacconi, L.J., Vincent, F., van Dishoeck, E.F., von Fellenberg, S., Wank, I., Waisberg, I., Widmann, F., Wieprecht, E., Wiest, M., Wiezorrek, E., Woillez, J., Yazici, S., Ziegler, D., Zins, G.: First direct detection of an exoplanet by optical interferometry. Astron. Astrophys. 623, L11 (2019). https://doi.org/10.1051/0004-6361/201935253

    Article  Google Scholar 

  24. Asif Amin, R.M., Khan, A.T., Tasnim Raisa, Z., Chisty, N., SamihaKhan, S., Khaja, M.S., Rahman, R.M.: Detection of exoplanet systems in kepler light curves using adaptive neuro-fuzzy system. In 2018 International Conference on Intelligent Systems (IS),. IEEE 9, 66–72 (2018). https://doi.org/10.1109/IS.2018.8710502. (ISBN 978-1-5386-7097-2.)

  25. Singh, S.P., Misra, D.K.: Exoplanet hunting in deep space with machine learning. Int. J. Res. Eng. Sci. Manag. 3, 187–192 (2020)

    Google Scholar 

  26. Jang, J.S.R., Sun, C.T., Mizutani, E.: A Computational Approach to Learning and Machine Intelligence. Neuro-fuzzy and Soft Computing, Prentice Hall, Hoboken (1997)

    Google Scholar 

  27. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(8), 37–52 (1987). https://doi.org/10.1016/0169-7439(87)80084-9

    Article  Google Scholar 

  28. Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 359–370. AAAI Press, (1994)

  29. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(6), 321–357 (2002). https://doi.org/10.1613/jair.953. (ISSN 1076-9757)

    Article  Google Scholar 

  30. Elkan, C.: The foundations of cost-sensitive learning. Int. Joint Conf. Artif. Intell. 17, 973–978 (2001)

    Google Scholar 

  31. Wolpert, D.H.: Stacked generalization. Neural Netw. 5(1), 241–259 (1992). https://doi.org/10.1016/S0893-6080(05)80023-1. (ISSN 08936080)

    Article  Google Scholar 

  32. Wang, G., Hao, J., Ma, J., Jiang, H.: A comparative assessment of ensemble learning for credit scoring. Expert Syst. Appl. 38(1), 223–230 (2011). https://doi.org/10.1016/j.eswa.2010.06.048

    Article  Google Scholar 

  33. Woodward, D., Stevens, E., Linstead, E.: Generating transit light curves with variational autoencoders. In 2019 IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT), IEEE, 7, 24–32 (2019). https://doi.org/10.1109/SMC-IT.2019.00008. (ISBN 978-1-7281-1545-0)

  34. Rob, G., Brian, D.: About tess, 2020. URL https://www.nasa.gov/content/about-tess. Accessed on 30 Jan 2023

  35. Massey, F.J.: The kolmogorov-smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951)

    Article  Google Scholar 

  36. Hsu, C.-W., Lin, C.-J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(3), 415–425 (2002). https://doi.org/10.1109/72.991427. (ISSN 10459227)

    Article  Google Scholar 

  37. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)

    Google Scholar 

  38. Barandas, M., Folgado, D., Fernandes, L., Santos, S., Abreu, M., Bota, P., Liu, H., Schultz, T., Gamboa, H.: Tsfel: Time series feature extraction library. SoftwareX 11(1), 100456 (2020). https://doi.org/10.1016/j.softx.2020.100456. (ISSN 23527110)

    Article  Google Scholar 

  39. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011). (ISSN 15324435)

    MathSciNet  Google Scholar 

  40. Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York (2016)

    Book  Google Scholar 

  41. Geurts, P.: Principles of Data Mining and Knowledge Discovery Springer, pp. 115–127. Berlin Heidelberg. Pattern extraction for time series classification, Berlin, Heidelberg (2001)

    Book  Google Scholar 

  42. Ge, L., Ge, L.-J.: Feature extraction of time series classification based on multi-method integration. Optik 127(12), 11070–11074 (2016). https://doi.org/10.1016/j.ijleo.2016.08.089. (ISSN 0030402)

    Article  Google Scholar 

  43. Zheng, Y., Si, Y.-W., Wong, R.: Feature extraction for chart pattern classification in financial time series. Knowl. Inf. Syst. 63(7), 1807–1848 (2021). https://doi.org/10.1007/s10115-021-01569-1

    Article  Google Scholar 

  44. Osborn, D.R., Chui, A.P.L., Smith, J.P., Birchenhall, C.R.: Seasonality and the order of integration for consumption. Oxf. Bull. Econ. Stat. 50(5), 361–377 (1988). https://doi.org/10.1111/j.1468-0084.1988.mp50004002.x. (ISSN 03059049)

    Article  Google Scholar 

  45. Peter, C.B., Denis, K., Peter, S., Phillips, S.Y.: Testing the null hypothesis of stationarity against the alternative of a unit root. J. Econ. 54(10), 159–178 (1992). https://doi.org/10.1016/0304-4076(92)90104-Y. (ISSN 03044076)

    Article  Google Scholar 

  46. White, H.: A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48(5), 817–838 (1980). https://doi.org/10.2307/1912934. (ISSN 00129682)

    Article  MathSciNet  Google Scholar 

  47. Crutchfield, J.P., Feldman, D.P.: Regularities unseen, randomness observed: levels of entropy convergence. Chaos Interdiscip. J. Nonlinear Sci. 13(3), 25–54 (2003). https://doi.org/10.1063/1.1530990

    Article  MathSciNet  Google Scholar 

  48. Hyunju, K., Gabriele, V., Jake, H., Sara, I.W.: Informational architecture across non-living and living collectives. Theory Biosci. 140, 325–341 (2021). https://doi.org/10.1007/s12064-020-00331-5

    Article  Google Scholar 

  49. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf

  50. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, COLT ’, ACM 92(7), 144–152 (1992). https://doi.org/10.1145/130385.130401. (ISBN 089791497X)

  51. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning, 20, 1995. ISSN 15730565https://doi.org/10.1023/A:1022627411411. URL https://link.springer.com/article/10.1023/A:1022627411411

  52. Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)

  53. Hastie, J.F.T., Tibshirani, R.: The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer, Berlin (2009). https://doi.org/10.1007/978-0-387-84858-7

  54. Ho, T.K.: Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 1, pages 278–282. IEEE Comput. Soc. Press, (1995). ISBN 0-8186-7128-9https://doi.org/10.1109/ICDAR.1995.598994. URL http://ieeexplore.ieee.org/document/598994/

  55. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  Google Scholar 

  56. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, volume 13-17-August-2016 of KDD ’16, pages 785–794. ACM, 8 2016. ISBN 9781450342322https://doi.org/10.1145/2939672.2939785. https://dl.acm.org/doi/10.1145/2939672.2939785

  57. Fix, E., Hodges, J.L.: Discriminatory analysis - nonparametric discrimination: Consistency properties. Technical report, USAF School of Aviation Medicine, Technical Report 4, Project 21-49-004, Randolph Field, Texas, (1951)

  58. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964. (ISSN 0018-9448)

    Article  Google Scholar 

  59. Berkson, J.: Application of the logistic function to bio-assay. J. Am. Stat. Assoc. 39(9), 357–365 (1944)

    Google Scholar 

  60. Hosmer, D.W., Stanley, L., Sturdivant, R.X.: Applied logistic regression. Wiley, New York (2013). https://doi.org/10.1002/9781118548387

    Book  Google Scholar 

Download references

Acknowledgements

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

Author information

Authors and Affiliations

Authors

Contributions

JP conceived the idea and methodology of this work and developed code to run experiments on the available data. JA closely collaborated with JP by performing experiments on normalized data. JP and JA analysed the results and drafted the manuscript. FR provided guidance throughout the research process, from the initial conceptualization to the interpretation of the results, as well as critical thinking about the reasons behind certain results. All authors edited, read, reviewed and approved the manuscript.

Corresponding author

Correspondence to João Pimentel.

Ethics declarations

Conflict of interest

All the authors declared that they have no conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Machine learning methods

1.1 A.1 Support vector machines

SVMs, initially proposed by Boser et al. [50] and extended by Cortes and Vapnik [51], are algorithms that can be used in classification tasks that learn functions to discriminate between classes from a training set. This discrimination is achieved by constructing one or more hyper-planes in a high-dimensional space that are then used to solve the task. These hyper-planes maximally separate the data into different classes using a maximum margin, i.e., the distance between the hyper-plane and the closest data points from each class. By doing so, the algorithm finds the most robust decision boundary to properly generalize new data [52].

SVMs are particularly useful when dealing with data sets whose features have a nonlinear relationship with the response variable. This is accomplished by mapping the original feature space into a higher-dimensional space using a kernel function. This new space is then used to find a nonlinear decision boundary [53]. SVMs are also useful to handle data with high dimensionality, i.e., when the number of features is considerably larger than the number of observations [53]. Lastly, these algorithms are noticeably robust against over-fitting [51].

1.2 A.2 Random forests

Based on the work of Ho [54] and introduced by Breiman [55], RFs are an ensemble algorithm of several decision trees, each trained on a random subset of data and features. When predicting a new value, the predictions of all trees are aggregated, i.e., the class with the most votes is the predicted one [53].

The random aspect of RFs helps in preventing over-fitting, as the noise of individual trees is averaged out, improving the generalization capacity of the models and making their predictions more robust and accurate in comparison to individual decision trees [53, 55]. Furthermore, RFs also have the ability to handle high-dimensional data and provide estimates of feature importance [53]. Nonetheless, Chen and Guestrin [56] discussed that RFs can be prone to over-fitting in data sets with a high number of features, but suggested that this can be mitigated by using feature sub-sampling and bagging.

1.3 A.3 K-Nearest neighbours

KNN, proposed by Fix and Hodges [57] and later extended by Cover and Hart [58], is a nonparametric algorithm, i.e., an algorithm that does not assume the kind of the mapping function, that can be used in a supervised or unsupervised context. This algorithm is based on the assumption that similar data points have similar response values [53]. The process of finding the k nearest neighbours, with k being a hyper-parameter of this algorithm, is performed on feature space with regard to some distance measure and the outcome is the most common class among the k nearest neighbours of the predicted point [53].

1.4 A.4 Logistic regression

Introduced by Berkson [59], LR is a generalized linear model used in binary classification tasks that calculates the probability of the input being part of both classes. This probability is modelled based on the logistic/sigmoid function, i.e., \( P(Y = 1 | x) = \frac{1}{1 + e^{-(\beta _0 + \sum _{i = 1}^{N} \beta _i \times x_i)}} \), where x is a vector of length N that represents the feature values of the input data and \(\beta \) is a vector containing the weights for each feature as well as \(\beta _0\), which represents the intercept, i.e., the log-odds of the target variable when all feature values are zero. This function maps the linear combination of features into a probability [53, 60]. Additionally, the coefficients of this model are estimated by maximizing the likelihood function of the data, which minimizes the errors between the predicted probabilities and the actual values of the target variable [60].

Appendix B OCSB test results

The results of the OCSB test are displayed on Fig. 9. Several seasonal frequencies (from 10 to 1490 with increments of 10 units) where assessed for all time-series, however, only a few stars from the majority class where detected as having a seasonal component. These range from first to third differences.

Fig. 9
figure 9

Results of the OCSB test. The Y-axis consists of the count for seasonal differences needed to make a time-series stationary based on the seasonal frequency (X-axis). Red means one seasonal difference is needed, green consists of second differences and blue of third differences

Appendix C Performance analysis

Figure 10 contains several boxplots that represent the value distributions of performance metrics obtained from training and evaluating models on the original test set. The metrics were grouped by data transformation (X-axis), with each metric being represented as a facet of the plot.

Fig. 10
figure 10

Minority class metric values of models trained on features extracted from time-series data. The X-axis consists of data transformations before feature extraction and the Y-axis of the values for each metric (facet)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pimentel, J., Amorim, J. & Rudzicz, F. Feature extraction for exoplanet detection. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00552-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41060-024-00552-7

Keywords

Navigation