Feature extraction for exoplanet detection

Pimentel, João; Amorim, Joana; Rudzicz, Frank

doi:10.1007/s41060-024-00552-7

João Pimentel^1,2,
Joana Amorim^1,2 &
Frank Rudzicz^1,2,3

30 Accesses
Explore all metrics

Abstract

Detecting possible habitable planets outside of our solar system has been a growing field of study. Among several other topics, this field aims to classify stars using the transit method, i.e., using their light intensity measured over time to spot the moment when a planet follows its orbit and covers part of the star as seen by a satellite. We propose a novel approach to such classification, using an extracted set of features from individual time-series that cover three different domains: temporal, statistical, and spectral. These features are filtered based on relevant measures, and used to train and evaluate models on Kepler data. The results were compared to state-of-the-art methods evaluated on the same data set and surpass existing approaches for some data transformations. All these transformations are related to turning the time-series naïvely stationary before feature extraction. Using principal components extracted from the feature set during model training did not have a considerable impact on results. In order to better evaluate the results, a cross-validation process was performed to eliminate data set bias. During this step, the best model achieved \(100\%\) recall and \(98.82\%\) F1-score for the minority class. In the future, testing additional feature selection methods, as well as assessing feature importance using more explainable metrics is crucial to further understand the distinctions that separate stars with exoplanets from those without.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Science Overview of the Europa Clipper Mission

Article Open access 23 May 2024

Feature dimensionality reduction: a review

Article Open access 21 January 2022

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

Data availability

The original data set used in this work is available on Kaggle. All transformed data can be accessed on Github.

Code Availability

All code produced during the course of this work is available on Github.

Notes

References

Priyadarshini, I., Puri, V.: A convolutional neural network (cnn) based ensemble model for exoplanet detection. Earth Sci. Inf. 14(6), 735–747 (2021). https://doi.org/10.1007/s12145-021-00579-5. (ISSN 1865-0473.)
Article Google Scholar
Jara-Maldonado, M., Alarcon-Aquino, V., Rosas-Romero, R., Starostenko, O., Ramirez-Cortes, J.M.: Transiting exoplanet discovery using machine learning techniques: a survey. Earth Sci. Inf. 13(9), 573–600 (2020). https://doi.org/10.1007/s12145-020-00464-7. (ISSN 1865-0473)
Article Google Scholar
Tyagi, N., Arora, P., Chaudhary, R., Bhardwaj, J.: Exoplanet hunting using machine learning. Emerg. Technol. Data Mining Inf. Secur. Proc. IEMIS 2022 1, 687–701 (2023). https://doi.org/10.1007/978-981-19-4193-1_67
Article Google Scholar
Bahel, V., Gaikwad, M.: A study of light intensity of stars for exoplanet detection using machine learning. In 2022 IEEE Region 10 Symposium (TENSYMP), 7, 1–5 (2022). https://doi.org/10.1109/TENSYMP54529.2022.9864366. (ISBN 978-1-6654-6658 ,IEEE)
Michele, J., Brian, D.: Mission overview, (2018). URL https://www.nasa.gov/mission_pages/kepler/overview/index.html. Accessed on 02 Feb 2023
Khan, M.S., Jenkins, J., Yoma, N.B.: Discovering new worlds: A review of signal processing methods for detecting exoplanets from astronomical radial velocity data [applications corner]. IEEE Signal Process. Mag. 34(1), 104–115 (2017). https://doi.org/10.1109/MSP.2016.2617293. (ISSN 1053-5888)
Article Google Scholar
Auvergne, M., Bodin, P., Boisnard, L., Buey, J.-T., Chaintreuil, S., Epstein, G., Jouret, M., Lam-Trong, T., Levacher, P., Magnan, A., Perez, R., Plasson, P., Plesseria, J., Peter, G., Steller, M., Tiphène, D., Baglin, A., Agogué, P., Appourchaux, T., Barbet, D., Beaufort, T., Bellenger, R., Berlin, R., Bernardi, P., Blouin, D., Boumier, P., Bonneau, F., Briet, R., Butler, B., Cautain, R., Chiavassa, F., Costes, V., Cuvilho, J., Cunha-Parro, V., De Oliveira Fialho, F., Decaudin, M., Defise, J.-M., Djalal, S., Docclo, A., Drummond, R., Dupuis, O., Exil, G., Fauré, C., Gaboriaud, A., Gamet, P., Gavalda, P., Grolleau, E., Gueguen, L., Guivarc’h, V., Guterman, P., Hasiba, J., Huntzinger, G., Hustaix, H., Imbert, C., Jeanville, G., Johlander, B., Jorda, L., Journoud, P., Karioty, F., Kerjean, L., Lafond, L., Lapeyrere, V., Landiech, P., Larqué, T., Laudet, P., Le Merrer, J., Leporati, L., Leruyet, B., Levieuge, B., Llebaria, A., Martin, L., Mazy, E., Mesnager, J.-M., Michel, J.-P., Moalic, J.-P., Monjoin, W., Naudet, D., Neukirchner, S., Nguyen-Kim, K., Ollivier, M., Orcesi, J.-L., Ottacher, H., Oulali, A., Parisot, J., Perruchot, S., Piacentino, A., Pinheiro da Silva, L., Platzer, J., Pontet, B., Pradines, A., Quentin, C., Rohbeck, U., Rolland, G., Rollenhagen, F., Romagnan, R., Russ, N., Samadi, R., Schmidt, R., Schwartz, N., Sebbag, I., Smit, H., Sunter, W., Tello, M., Toulouse, P., Ulmer, B., Vandermarcq, O., Vergnault, E., Wallner, R., Waultier, G., Zanatta, P.: The corot satellite in flight: description and performance. Astron. Astrophys. 506, 411–424 (2009). https://doi.org/10.1051/0004-6361/200810860
Article Google Scholar
Ricker, G.R., Winn, J.N., Vanderspek, R., Latham, D.W., Bakos, G.Á., Bean, J.L., Berta-Thompson, Z.K., Brown, T.M., Buchhave, L., Butler, N.R., Paul Butler, R., Chaplin, W.J., Charbonneau, D., Christensen-Dalsgaard, J., Clampin, M., Deming, D., Doty, J., De Lee, N., Dressing, C., Dunham, E.W., Endl, M., Fressin, F., Ge, J., Henning, T., Holman, M.J., Howard, A.W., Ida, S., Jenkins, J.M., Jernigan, G., Johnson, J.A., Kaltenegger, L., Kawai, N., Kjeldsen, H., Laughlin, G., Levine, A.M., Lin, D., Lissauer, J.J., MacQueen, P., Marcy, G., McCullough, P.R., Morton, T.D., Narita, N., Paegert, M., Palle, E., Pepe, F., Pepper, J., Quirrenbach, A., Rinehart, S.A., Sasselov, D., Bun’Sato, S.S., Sozzetti, A., Stassun, K.G., Sullivan, P., Szentgyorgyi, A., Torres, G., Udry, S., Villasenor, J.: Transiting exoplanet survey satellite. J. Astron. Telesc. Instrum. Syst. 1(10), 014003 (2014)
Article Google Scholar
Doyle, L.R., Carter, J.A., Fabrycky, D.C., Slawson, R.W., Howell, S.B., Winn, J.N., Orosz, J.A., Prsa, A., Welsh, W.F., Quinn, S.N., Latham, D., Torres, G., Buchhave, L.A., Marcy, G.W., Fortney, J.J., Shporer, A., Ford, E.B., Lissauer, J.J., Ragozzine, D., Rucker, M., Batalha, N., Jenkins, J.M., Borucki, W.J., Koch, D., Middour, C.K., Hall, J.R., McCauliff, S., Fanelli, M.N., Quintana, E.V., Holman, M.J., Caldwell, D.A., Still, M., Stefanik, R.P., Brown, W.R., Esquerdo, G.A., Tang, S., Furesz, G., Geary, J.C., Berlind, P., Calkins, M.L., Short, D.R., Steffen, J.H., Sasselov, D., Dunham, E.W., Cochran, W.D., Boss, A., Haas, M.R., Buzasi, D., Fischer, D.: Kepler-16: a transiting circumbinary planet. Science 333, 1602–1606 (2011). https://doi.org/10.1126/science.1210923
Article Google Scholar
Borucki, W.J., Koch, D.G., Batalha, N., Bryson, S.T., Rowe, J., Fressin, F., Torres, G., Caldwell, D.A., Christensen-Dalsgaard, J., Cochran, W.D., DeVore, E., Gautier, T.N., Geary, J.C., Gilliland, R., Gould, A., Howell, S.B., Jenkins, J.M., Latham, D.W., Lissauer, J.J., Marcy, G.W., Sasselov, D., Boss, A., Charbonneau, D., Ciardi, D., Kaltenegger, L., Doyle, L., Dupree, A.K., Ford, E.B., Fortney, J., Holman, M.J., Steffen, J.H., Mullally, F., Still, M., Tarter, J., Ballard, S., Buchhave, L.A., Carter, J., Christiansen, J.L., Demory, B.-O., Désert, J.-M., Dressing, C., Endl, M., Fabrycky, D., Fischer, D., Haas, M.R., Henze, C., Horch, E., Howard, A.W., Isaacson, H., Kjeldsen, H., Johnson, J.A., Klaus, T., Kolodziejczak, J., Barclay, T., Li, J., Meibom, S., Prsa, A., Quinn, S.N., Quintana, E.V., Robertson, P., Sherry, W., Shporer, A., Tenenbaum, P., Thompson, S.E., Twicken, J.D., Van Cleve, J., Welsh, W.F., Basu, S., Chaplin, W., Miglio, A., Kawaler, S.D., Arentoft, T., Stello, D., Metcalfe, T.S., Verner, G.A., Karoff, C., Lundkvist, M., Lund, M.N., Handberg, R., Elsworth, Y., Hekker, S., Huber, D., Bedding, T.R., Rapin, W.: Kepler-22b: A 24 earth-radius planet in the habitable zone of a sun-like star. Astrophys. J. 745, 120 (2012). https://doi.org/10.1088/0004-637X/745/2/120
Article Google Scholar
Neubauer, D., Vrtala, A., Leitner, J.J., Firneis, M.G., Hitzenberger, R.: The life supporting zone of kepler-22b and the kepler planetary candidates: Koi268.01, koi701.03, koi854.01 and koi1026.01. Planet. Space Sci. 73(12), 397–406 (2012). https://doi.org/10.1016/j.pss.2012.07.020. (ISSN 00320633 Solar System science before and after Gaia)
Article Google Scholar
Quintana, E.V., Barclay, T., Raymond, S.N., Rowe, J.F., Bolmont, E., Caldwell, D.A., Howell, S.B., Kane, S.R., Huber, D., Crepp, J.R., Lissauer, J.J., Ciardi, D.R., Coughlin, J.L., Everett, M.E., Henze, C.E., Horch, E., Isaacson, H., Ford, E.B., Adams, F.C., Still, M., Hunter, R.C., Quarles, B., Selsis, F.: An earth-sized planet in the habitable zone of a cool star. Science 344, 277–280 (2014). https://doi.org/10.1126/science.1249403
Article Google Scholar
Rory Barnes, S.N., Raymond, R.G., Brian, J., Kaib, N.A.: Corot-7b: Super-earth or super-io? Astrophys. J. 709(2), L95–L98 (2010). https://doi.org/10.1088/2041-8205/709/2/L95. (ISSN 2041-8205)
Article Google Scholar
Pat, B., Kristen, W., Anya, B.: Kepler’s legacy: discoveries and more, 2020. URL https://exoplanets.nasa.gov/keplerscience/. Accessed on 30 Jan 2023
Michele, J., Brian, D.: Liftoff of the kepler spacecraft, 2017. URL https://www.nasa.gov/mission_pages/kepler/launch/index.html. Accessed on 02 Feb 2023
Rick, C., Brian, D.: Briefing materials: Nasa retires the kepler space telescope, 2018. URL https://www.nasa.gov/kepler/presskit. Accessed on 02 Feb 2023
Hönes, C.J., Miller, B.K., Heras, A.M., Foing, B.H.: Automatically detecting anomalous exoplanet transits. CoRR, arXiv:2111.08679, 11 2021https://doi.org/10.48550/arXiv.2111.08679
Cornachione, M.A., Bolton, A.S., Eastman, J.D., Wilson, M.L., Wang, S.X., Johnson, S.A., Sliski, D.H., McCrady, N., Wright, J.T., Plavchan, P., Johnson, J.A., Horner, J., Wittenmyer, R.A.: A full implementation of spectro-perfectionism for precise radial velocity exoplanet detection: A test case with the minerva reduction pipeline. Publ. Astron. Soc. Pacific 131, 124503 (2019). https://doi.org/10.1088/1538-3873/ab4103
Article Google Scholar
Zaleski, S.M., Valio, A., Marsden, S.C., Carter, B.D.: Differential rotation of kepler-71 via transit photometry mapping of faculae and starspots. Mon. Not. R. Astron. Soc. 484(3), 618–630 (2019). https://doi.org/10.1093/mnras/sty3474. (ISSN 0035-8711)
Article Google Scholar
Treu, T., Marshall, P.J., Clowe, D.: Resource letter gl-1: Gravitational lensing. Am. J. Phys. 80(9), 753–763 (2012). https://doi.org/10.1119/1.4726204. (ISSN 0002-9505)
Article Google Scholar
Kane, S.R., Dalba, P.A., Li, Z., Horch, E.P., Hirsch, L.A., Horner, J., Wittenmyer, R.A., Howell, S.B., Everett, M.E., Paul Butler, R., Tinney, C.G., Carter, B.D., Wright, D.J., Jones, H.R.A., Bailey, J., O’Toole, S.J.: Detection of planetary and stellar companions to neighboring stars via a combination of radial velocity and direct imaging techniques. Astron. J. 157, 252 (2019). https://doi.org/10.3847/1538-3881/ab1ddf
Article Google Scholar
Deqing, R., Mohanakrishna, R., Christian, D.J.: A host-star calibration based polarimeter for earth-like exoplanet imaging. Publ. Astron. Soc. Pac. 131(11), 115004 (2019). https://doi.org/10.1088/1538-3873/ab33ca. (ISSN 0004-6280)
Article Google Scholar
Lacour, S., Nowak, M., Wang, J., Pfuhl, O., Eisenhauer, F., Abuter, R., Amorim, A., Anugu, N., Benisty, M., Berger, J.P., Beust, H., Blind, N., Bonnefoy, M., Bonnet, H., Bourget, P., Brandner, W., Buron, A., Collin, C., Charnay, B., Chapron, F., Clénet, Y., Coudé du Foresto, V., de Zeeuw, P.T., Deen, C., Dembet, R., Dexter, J., Duvert, G., Eckart, A., Förster Schreiber, N.M., Fédou, P., Garcia, P., Garcia Lopez, R., Gao, F., Gendron, E., Genzel, R., Gillessen, S., Gordo, P., Greenbaum, A., Habibi, M., Haubois, X., Haußmann, F., Henning, T., Hippler, S., Horrobin, M., Hubert, Z., Jimenez Rosales, A., Jocou, L., Kendrew, S., Kervella, P., Kolb, J., Lagrange, A.-M., Lapeyrère, V., Le Bouquin, J.-B., Léna, P., Lippa, M., Lenzen, R., Maire, A.-L., Mollière, P., Ott, T., Paumard, T., Perraut, K., Perrin, G., Pueyo, L., Rabien, S., Ramírez, A., Rau, C., Rodríguez-Coira, G., Rousset, G., Sanchez-Bermudez, J., Scheithauer, S., Schuhler, N., Straub, O., Straubmeier, C., Sturm, E., Tacconi, L.J., Vincent, F., van Dishoeck, E.F., von Fellenberg, S., Wank, I., Waisberg, I., Widmann, F., Wieprecht, E., Wiest, M., Wiezorrek, E., Woillez, J., Yazici, S., Ziegler, D., Zins, G.: First direct detection of an exoplanet by optical interferometry. Astron. Astrophys. 623, L11 (2019). https://doi.org/10.1051/0004-6361/201935253
Article Google Scholar
Asif Amin, R.M., Khan, A.T., Tasnim Raisa, Z., Chisty, N., SamihaKhan, S., Khaja, M.S., Rahman, R.M.: Detection of exoplanet systems in kepler light curves using adaptive neuro-fuzzy system. In 2018 International Conference on Intelligent Systems (IS),. IEEE 9, 66–72 (2018). https://doi.org/10.1109/IS.2018.8710502. (ISBN 978-1-5386-7097-2.)
Singh, S.P., Misra, D.K.: Exoplanet hunting in deep space with machine learning. Int. J. Res. Eng. Sci. Manag. 3, 187–192 (2020)
Google Scholar
Jang, J.S.R., Sun, C.T., Mizutani, E.: A Computational Approach to Learning and Machine Intelligence. Neuro-fuzzy and Soft Computing, Prentice Hall, Hoboken (1997)
Google Scholar
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(8), 37–52 (1987). https://doi.org/10.1016/0169-7439(87)80084-9
Article Google Scholar
Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 359–370. AAAI Press, (1994)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(6), 321–357 (2002). https://doi.org/10.1613/jair.953. (ISSN 1076-9757)
Article Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. Int. Joint Conf. Artif. Intell. 17, 973–978 (2001)
Google Scholar
Wolpert, D.H.: Stacked generalization. Neural Netw. 5(1), 241–259 (1992). https://doi.org/10.1016/S0893-6080(05)80023-1. (ISSN 08936080)
Article Google Scholar
Wang, G., Hao, J., Ma, J., Jiang, H.: A comparative assessment of ensemble learning for credit scoring. Expert Syst. Appl. 38(1), 223–230 (2011). https://doi.org/10.1016/j.eswa.2010.06.048
Article Google Scholar
Woodward, D., Stevens, E., Linstead, E.: Generating transit light curves with variational autoencoders. In 2019 IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT), IEEE, 7, 24–32 (2019). https://doi.org/10.1109/SMC-IT.2019.00008. (ISBN 978-1-7281-1545-0)
Rob, G., Brian, D.: About tess, 2020. URL https://www.nasa.gov/content/about-tess. Accessed on 30 Jan 2023
Massey, F.J.: The kolmogorov-smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951)
Article Google Scholar
Hsu, C.-W., Lin, C.-J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(3), 415–425 (2002). https://doi.org/10.1109/72.991427. (ISSN 10459227)
Article Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
Google Scholar
Barandas, M., Folgado, D., Fernandes, L., Santos, S., Abreu, M., Bota, P., Liu, H., Schultz, T., Gamboa, H.: Tsfel: Time series feature extraction library. SoftwareX 11(1), 100456 (2020). https://doi.org/10.1016/j.softx.2020.100456. (ISSN 23527110)
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011). (ISSN 15324435)
MathSciNet Google Scholar
Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York (2016)
Book Google Scholar
Geurts, P.: Principles of Data Mining and Knowledge Discovery Springer, pp. 115–127. Berlin Heidelberg. Pattern extraction for time series classification, Berlin, Heidelberg (2001)
Book Google Scholar
Ge, L., Ge, L.-J.: Feature extraction of time series classification based on multi-method integration. Optik 127(12), 11070–11074 (2016). https://doi.org/10.1016/j.ijleo.2016.08.089. (ISSN 0030402)
Article Google Scholar
Zheng, Y., Si, Y.-W., Wong, R.: Feature extraction for chart pattern classification in financial time series. Knowl. Inf. Syst. 63(7), 1807–1848 (2021). https://doi.org/10.1007/s10115-021-01569-1
Article Google Scholar
Osborn, D.R., Chui, A.P.L., Smith, J.P., Birchenhall, C.R.: Seasonality and the order of integration for consumption. Oxf. Bull. Econ. Stat. 50(5), 361–377 (1988). https://doi.org/10.1111/j.1468-0084.1988.mp50004002.x. (ISSN 03059049)
Article Google Scholar
Peter, C.B., Denis, K., Peter, S., Phillips, S.Y.: Testing the null hypothesis of stationarity against the alternative of a unit root. J. Econ. 54(10), 159–178 (1992). https://doi.org/10.1016/0304-4076(92)90104-Y. (ISSN 03044076)
Article Google Scholar
White, H.: A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48(5), 817–838 (1980). https://doi.org/10.2307/1912934. (ISSN 00129682)
Article MathSciNet Google Scholar
Crutchfield, J.P., Feldman, D.P.: Regularities unseen, randomness observed: levels of entropy convergence. Chaos Interdiscip. J. Nonlinear Sci. 13(3), 25–54 (2003). https://doi.org/10.1063/1.1530990
Article MathSciNet Google Scholar
Hyunju, K., Gabriele, V., Jake, H., Sara, I.W.: Informational architecture across non-living and living collectives. Theory Biosci. 140, 325–341 (2021). https://doi.org/10.1007/s12064-020-00331-5
Article Google Scholar
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, COLT ’, ACM 92(7), 144–152 (1992). https://doi.org/10.1145/130385.130401. (ISBN 089791497X)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning, 20, 1995. ISSN 15730565https://doi.org/10.1023/A:1022627411411. URL https://link.springer.com/article/10.1023/A:1022627411411
Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)
Hastie, J.F.T., Tibshirani, R.: The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer, Berlin (2009). https://doi.org/10.1007/978-0-387-84858-7
Ho, T.K.: Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 1, pages 278–282. IEEE Comput. Soc. Press, (1995). ISBN 0-8186-7128-9https://doi.org/10.1109/ICDAR.1995.598994. URL http://ieeexplore.ieee.org/document/598994/
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article Google Scholar
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, volume 13-17-August-2016 of KDD ’16, pages 785–794. ACM, 8 2016. ISBN 9781450342322https://doi.org/10.1145/2939672.2939785. https://dl.acm.org/doi/10.1145/2939672.2939785
Fix, E., Hodges, J.L.: Discriminatory analysis - nonparametric discrimination: Consistency properties. Technical report, USAF School of Aviation Medicine, Technical Report 4, Project 21-49-004, Randolph Field, Texas, (1951)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964. (ISSN 0018-9448)
Article Google Scholar
Berkson, J.: Application of the logistic function to bio-assay. J. Am. Stat. Assoc. 39(9), 357–365 (1944)
Google Scholar
Hosmer, D.W., Stanley, L., Sturdivant, R.X.: Applied logistic regression. Wiley, New York (2013). https://doi.org/10.1002/9781118548387
Book Google Scholar

Download references

Acknowledgements

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

Author information

Authors and Affiliations

Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
João Pimentel, Joana Amorim & Frank Rudzicz
Vector Institute of Artificial Intelligence, Toronto, Ontario, Canada
João Pimentel, Joana Amorim & Frank Rudzicz
Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
Frank Rudzicz

Authors

João Pimentel
View author publications
You can also search for this author in PubMed Google Scholar
Joana Amorim
View author publications
You can also search for this author in PubMed Google Scholar
Frank Rudzicz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JP conceived the idea and methodology of this work and developed code to run experiments on the available data. JA closely collaborated with JP by performing experiments on normalized data. JP and JA analysed the results and drafted the manuscript. FR provided guidance throughout the research process, from the initial conceptualization to the interpretation of the results, as well as critical thinking about the reasons behind certain results. All authors edited, read, reviewed and approved the manuscript.

Corresponding author

Correspondence to João Pimentel.

Ethics declarations

Conflict of interest

All the authors declared that they have no conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Machine learning methods

1.1 A.1 Support vector machines

SVMs, initially proposed by Boser et al. [50] and extended by Cortes and Vapnik [51], are algorithms that can be used in classification tasks that learn functions to discriminate between classes from a training set. This discrimination is achieved by constructing one or more hyper-planes in a high-dimensional space that are then used to solve the task. These hyper-planes maximally separate the data into different classes using a maximum margin, i.e., the distance between the hyper-plane and the closest data points from each class. By doing so, the algorithm finds the most robust decision boundary to properly generalize new data [52].

SVMs are particularly useful when dealing with data sets whose features have a nonlinear relationship with the response variable. This is accomplished by mapping the original feature space into a higher-dimensional space using a kernel function. This new space is then used to find a nonlinear decision boundary [53]. SVMs are also useful to handle data with high dimensionality, i.e., when the number of features is considerably larger than the number of observations [53]. Lastly, these algorithms are noticeably robust against over-fitting [51].

1.2 A.2 Random forests

Based on the work of Ho [54] and introduced by Breiman [55], RFs are an ensemble algorithm of several decision trees, each trained on a random subset of data and features. When predicting a new value, the predictions of all trees are aggregated, i.e., the class with the most votes is the predicted one [53].

The random aspect of RFs helps in preventing over-fitting, as the noise of individual trees is averaged out, improving the generalization capacity of the models and making their predictions more robust and accurate in comparison to individual decision trees [53, 55]. Furthermore, RFs also have the ability to handle high-dimensional data and provide estimates of feature importance [53]. Nonetheless, Chen and Guestrin [56] discussed that RFs can be prone to over-fitting in data sets with a high number of features, but suggested that this can be mitigated by using feature sub-sampling and bagging.

1.3 A.3 K-Nearest neighbours

KNN, proposed by Fix and Hodges [57] and later extended by Cover and Hart [58], is a nonparametric algorithm, i.e., an algorithm that does not assume the kind of the mapping function, that can be used in a supervised or unsupervised context. This algorithm is based on the assumption that similar data points have similar response values [53]. The process of finding the k nearest neighbours, with k being a hyper-parameter of this algorithm, is performed on feature space with regard to some distance measure and the outcome is the most common class among the k nearest neighbours of the predicted point [53].

1.4 A.4 Logistic regression

Introduced by Berkson [59], LR is a generalized linear model used in binary classification tasks that calculates the probability of the input being part of both classes. This probability is modelled based on the logistic/sigmoid function, i.e., \( P(Y = 1 | x) = \frac{1}{1 + e^{-(\beta _0 + \sum _{i = 1}^{N} \beta _i \times x_i)}} \), where x is a vector of length N that represents the feature values of the input data and \(\beta \) is a vector containing the weights for each feature as well as \(\beta _0\), which represents the intercept, i.e., the log-odds of the target variable when all feature values are zero. This function maps the linear combination of features into a probability [53, 60]. Additionally, the coefficients of this model are estimated by maximizing the likelihood function of the data, which minimizes the errors between the predicted probabilities and the actual values of the target variable [60].

Appendix B OCSB test results

The results of the OCSB test are displayed on Fig. 9. Several seasonal frequencies (from 10 to 1490 with increments of 10 units) where assessed for all time-series, however, only a few stars from the majority class where detected as having a seasonal component. These range from first to third differences.

Appendix C Performance analysis

Figure 10 contains several boxplots that represent the value distributions of performance metrics obtained from training and evaluating models on the original test set. The metrics were grouped by data transformation (X-axis), with each metric being represented as a facet of the plot.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pimentel, J., Amorim, J. & Rudzicz, F. Feature extraction for exoplanet detection. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00552-7

Download citation

Received: 01 November 2023
Accepted: 10 April 2024
Published: 16 May 2024
DOI: https://doi.org/10.1007/s41060-024-00552-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature extraction for exoplanet detection

Abstract

Access this article

Similar content being viewed by others

Science Overview of the Europa Clipper Mission

Feature dimensionality reduction: a review

Feature selection techniques for machine learning: a survey of more than two decades of research

Data availability

Code Availability

Notes

References

Acknowledgements