Skip to main content

Development of accurate classification of heavenly bodies using novel machine learning techniques


The heavenly bodies are objects that swim in the outer space. The classification of these objects is a challenging task for astronomers. This article presents a novel methodology that enables an efficient and accurate classification of cosmic objects (3 classes) based on evolutionary optimization of classifiers. This research collected the data from Sloan Digital Sky Survey database. In this work, we are proposing to develop a novel machine learning model to classify stellar spectra of stars, quasars and galaxies. First, the input data are normalized and then subjected to principal component analysis to reduce the dimensionality. Then, the genetic algorithm is implemented on the data which helps to find the optimal parameters for the classifiers. We have used 21 classifiers to develop an accurate and robust classification with fivefold cross-validation strategy. Our developed model has achieved an improvement in the accuracy using nineteen out of twenty-one models. We have obtained the highest classification accuracy of 99.16%, precision of 98.78%, recall of 98.08% and F1-score of 98.32% using evolutionary system based on voting classifier. The developed machine learning prototype can help the astronomers to make accurate classification of heavenly bodies in the sky. Proposed evolutionary system can be used in other areas where accurate classification of many classes is required.


Recently, an accurate stellar object classification using photometric data has gained much popularity. It is an important field of research where lot of improvements are made every year. Hence, many tools have been developed to aid in this task. Some of them include SExtractor Bertin et al. (1996)—a widely used tool for star-galaxy separation. Due to the increase in the number of available sky surveys in optical and near-infrared spectra, and due to varying observing conditions and sensitivities, it is difficult to fine-tune these tools for specific database. The objective of this paper is to present a novel evolutionary system with highest performance in classifying three cosmic objects.

In this work, we chose to conduct our research using the Sloan Digital Sky Survey (SDSS) database Blanton et al. (2017). Astronomers, physicists and mathematicians from many countries took part in this project. Currently, a large part of the classification of objects from SDSS database have used advanced algorithms written in the Matlab environment SDSS (2015). Many researchers have used machine learning methods to classify the heavenly bodies automatically. The details on their study are provided in Sect. 3. The drawbacks of these studies are given below:

  1. 1.

    Majority of previous studies have used either very complex frameworks or basic methods to build the classification model.

  2. 2.

    It is complex and time-consuming to handle huge data.

  3. 3.

    Handling of large number of features will make the model computationally more intensive.

In order to overcome the above-mentioned drawbacks in this work, we proposed a novel machine learning approach that utilized genetic algorithm to find the best model with optimal set of hyperparameters. First, we reduced the number of features using principal component analysis to speed up the computations. Next, we trained a set of 21 classifiers with their default parameters as a baseline. Then, using genetic algorithm, we optimized their hyperparameters. Resulting individual models were then combined into a single voting classifier, using genetic algorithm to find a combination that yielded the best results. During the learning process fivefold cross-validation (Mosteller and Tukey 1968) is used. The main novel contributions of this work are as follows:

  1. 1.

    Introduced new and efficient solution based on machine learning models to classify three classes of cosmic objects.

  2. 2.

    Developed a genetic algorithm optimization technique to obtain high classification accuracy with small number of features.

  3. 3.

    Achieved highest performance (over 98% accuracy score) using 15 out of 21 tested classifiers.

  4. 4.

    Employed various clinical parameters to evaluate the performance of the developed model.

The rest of this paper is structured as follows. Section 2 discusses the previous works. Section 3 describes the proposed model based on evolutionary optimization of classifier parameters; data collection and preprocessing, model design, training and validation, are also described in this section. Experimental results and discussion are provided in Sects. 4 and 5, respectively. Conclusions and future works are given in Sect. 6.

Previous work

There is a significant increase in research works related to stellar spectra detection and classification. Many researchers focused on star-quasar (Zhang et al. 2011; Jin et al. 2019; Zhang et al. 2009, 2013; Viquar et al. 2018), galaxy-quasar (Bailer-Jones et al. 2019) or star-galaxy Philip et al. (2002) binary classification. Others (López et al. 2010, Becker et al. 2020) focused on multi-class classification of stars, galaxies and quasars Cabanac et al. (2002); Acharya et al. (2018). In these works, various methods have been applied to automatically classify the heavenly bodies accurately.

Many authors used classical machine learning algorithms such as support vector machines (SVM) or k-nearest neighbors (kNN) (Zhang et al. 2011, 2009, 2013; Tu et al. 2015,[12,15],Jin et al. 2019; Viquar et al. 2018). Others adopted deep learning techniques (Becker et al. 2020, 11) or developed their own novel solutions (Viquar et al. 2018).

Many databases related to sky survey data are freely available. Among them, most popular databases are SDSS (Zhang et al. 2011, 2013, 2009; Viquar et al. 2018, Acharya et al. 2018), Gaia (Bailer-Jones et al. 2019; Becker et al. 2020), WISE (Jin et al. 2019; Becker et al. 2020) and UKIDSS (Zhang et al. 2011, 2013). The summary of related works conducted using these databases is shown in Table 1

Table 1 Summary of state-of-the-art techniques developed using the sky survey databases

Zhang et al. (2011) operated on the data from UKIDSS database. Their best model, a LS-SVM proved to be a highly efficient and powerful in classifying the photometric data. Jin et al. (2019) used data from WISE database using two new color criterions (yW1W2 and iW1zW2), which were constructed to distinguish quasars from stars efficiently. In Zhang et al. (2009) a kNN algorithm is used to distinguish star and quasar sources. Authors of Zhang et al. (2013) used SVM classifier for the same purpose. They achieved very high accuracy scores using SDSS DR7 and UKIDSS DR7 catalogs of photometric data. Viquar et al. of Viquar et al. (2018) used the same database and asymmetric AdaBoost to classify quasars and stars. In Zhang et al. (2013), Zhang et al. used supervised and unsupervised methods on quasar-star classification problem. Philip et al. (2002) reported higher performance using difference-boosting neural network (DBNN) in star-galaxy classification, which is comparable to the SExtractor.

Authors of Acharya et al. (2018) demonstrated how multi-class classification of stellar sources can be scaled to billions of records by incorporating the scalability of the cloud. Multi-class classification is also performed by Cabanac et al. (2002). They showed that the first 10 eigencomponents of the Karhunen–Loeve expansion or principal component analysis (PCA) provided a robust classification scheme for the identification of stars, galaxies and quasi-stellar objects from multi-band photometry.

López et al. (2010) developed an automatic multistage classification system based on Bayesian networks for the OMC (Optical Monitoring Camera) data. They focused on multi-class classification of different categories of stars. Becker et al. (2020) worked on similar problem. They proposed an end-to-end solution for automated source classification based on RNNs. They then compared the classification results with random forest classifier.

Authors of Bailer-Jones et al. (2019) used Gaussian mixture models to probabilistically classify objects in Gaia data release 2 (GDR2) using photometric and astrometric data. Their trained model is able to classify star, quasar and galaxy with high accuracy.

Genetic algorithms (GAs) are widely used in many fields as they are versatile and can provide very good results, especially when the search space is large, outperforming standard optimization techniques such as random or gird searches (Liashchynskyi and Liashchynskyi 2019). Constant optimizations are being developed since its first inception (De Jong et al. 1977). Wu Deng et al. proposed an algorithm that addressed premature convergence, low search ability and tendency to fall into local optima of quantum evolutionary algorithm (QEA)—improved QEA with multistrategies namely MSIQDE (Deng et al. 2020). Authors used this algorithm to optimize hyperparameters of DBN model. Another usage example is provided in Deng et al. (2020). Authors developed improved QEA based on the niche co-evolution strategy and enhanced particle swarm optimization (PSO)—IPOQEA. Proposed system is used to solve airport gate resource allocation problem. A new optimal mutation strategy based on the complementary advantages of five mutation strategies has been used in Deng et al. (2020) to develop an improved differential evolution algorithm with the wavelet basis function. This algorithm can improve the search quality while simultaneously accelerating convergence and help in avoiding the falling into local optimum. Song et al. proposed a multi-population parallel co-evolutionary differential evolution, named MPPCEDE, to optimize parameters of photovoltaic models. Sezer et al. in Sezer et al. (2017) developed a stock trading system which used optimized technical analysis parameters for creating buy–sell points using genetic algorithms.

Fig. 1

Learning and optimization pipeline developed for the classification

This paper uses approach similar to that presented in Pławiak and Acharya (2020) by Pławiak and Acharya. They used a GA for parameter optimization coupled with k-fold cross-validation (CV) for arrhythmia detection using ECG signals. Similar to their work, we have built the hyperparameter optimization pipeline based on GA, which takes the raw data and performs its pre-processing, parameter optimization coupled with fivefold CV, yielding a list of classifiers with optimal parameters. Using those classifiers, a voting ensemble optimized using GA further increased the classification accuracy.

Materials and methods

Sloan Digital Sky Survey is a project that provides public database on observations of celestial objects Blanton et al. (2017).

A special 2.5-m-diameter telescope was used to observe celestial objects, which was built in New Mexico at the Apache Point Observatory in the USA. The telescope used a camera consisting of 30 charge-coupled device (CCD) chips with a resolution of \(2048\times 2048\) each. The chips were arranged in 5 rows with 6 in each row. Each row observes the space through various optical filters (u’, g’, r’, i’, z’ ) with different wavelengths u’ = 354 nm, g’ = 475 nm, r’ = 622 nm, i’ = 763 nm and z’ = 905 nm [25].

SDSS database consists of two main tables PhotoObj and SpecObj. The SkyServer and CASJOB portals provide web interfaces build for SQL query execution over those tables. We have collected data from aforementioned tables and randomly selected 10,000 records of celestial bodies collected by SDSS Data Release 16 (DR16). While querying the data, we made sure to closely follow approach taken by Peng et al. (2012) and Jin et al. (2019). Query filters of sciencePrimary = 1, Mode = 1 and zWarrning = 0 are applied. Each observation is described by 8 attributes (u’, g’, r’, i’, z’ bands, right ascension, declination and redshift) and the class to which it belongs—star, galaxy or quasar.

Figure 1 presents the learning and optimization pipeline developed for the classification process.

The pipeline consists of the following steps:

(i) Exploratory data analysis is performed manually in order to understand the data better. It is a widely used practice and employed rich data visualization techniques to identify the issues with the data. Those issues might include: missing feature values, large number of outliers, different data scales and presence of categorical features. (ii) Initial data pre-processing is an important step for any machine learning project. Data need to be cleaned and scaled in order for our classifiers to learn the class relationships well. This step also contains dimensionality reduction techniques such as principal component analysis (PCA). Application of PCA allows us to reduce the number of features the model needs to process for accurate prediction (Pearson 1900). This in turn speeds up the learning process significantly. (iii) In train/test split step, dataset is shuffled and split into training and test sets. We used 75% of our data for training and the rest for testing. Stratified split is used to preserve the class balances. (iv) In this work, we have used 21 classifiers from the scikit-learn package Pedregosa et al. (2011) and have initialized their parameters by default setting. (v) After initialization of all classifiers, the classifiers are developed using fivefold CV. This helps us prevent over- or under-fitting of the classifiers. (vi) The verification of final classification performance on the test set which is our baseline study is done. (vii) Genetic parameter optimization consists of: (a) choosing proper fitness function and we chose accuracy as the fitness function. This function, however, can be replaced by other performance metrics such as precision, recall or \(F_{1}\) score. (b) Population generation: population size and other details regarding the generic algorithm are presented in Table 2. (c) Evolution: this step consists of cross-over, selection of best individuals, gene mutation and other related operations. It is performed together with fivefold CV on training dataset and verification on the test dataset. For every individual, the same steps used for baseline are applied. Elitism and multipoint gene mutation strategy is used. Elitism strategy helps us to keep the best individual Bhandari et al. (1996). This way we make sure that even if current generation yielded no better individuals, the best individual from previous generation will still participate in the next one. (d) Saving the best individual: after all generations have passed, the best individual will be saved. This individual can be then used for further applications. (viii) Finally, the optimized learning process is verified and compared with one without optimization. The computational complexity of genetic algorithm in terms of O notation is given by O(gnm) with g indicating the number of generations; n and m denote the size of population and the individuals, respectively.

Table 2 Summary of genetic algorithm parameters used for classification with optimization

During data analysis, high correlation in the light band variables (u’, g’, r’, i’, z’) is observed. The correlation found between magnitudes is to be expected, since the magnitudes contain information about the total brightness of an object and its spectral shape. Those 5 light-bands are substituted by lower number of variables produced by PCA algorithm (Pearson 1900). This correlation can be observed in Fig. 2. The number of principal components set for this is 3. This helped us to keep over 99% of explained variance. Hence, the training and testing time is significantly reduced.

Fig. 2

Correlation matrices for each of the classes

The final dataset used in the learning process contains 10,000 observations, of which each is described by 3 variables produced by the PCA algorithm, redshift, right ascension and declination. The entire dataset is then split into training and test datasets. The training set contained 75% of observations from the original dataset, and the test set contained rest of the samples.

In the learning process, the stratified fivefold CV is used. Finally, the models are tested using the test set. Stage II of the experiment involved using genetic parameter optimization on the training model. The datasets used for both stages are the same. The only difference is the learning process. In the stage II, we trained the model until we have reached the maximum number of generations in our evolutionary algorithm. The best individual is then verified using the same test set which is used in stage I.


This section contains the results of learning and genetic parameter optimization processes.

The below tables and figures show the results obtained using both stages of learning processes. The classifier parameters in stage I are chosen “by hand”, using default values in many cases. Parameters of the classifiers for stage II of the experiment are obtained using the genetic algorithm. Baseline and search space configurations are provided in Appendix A.

The voting classifier Re and Valentini (2012) in the stage I consists of 11 estimators: (i) quadratic discriminant analysis, (ii) support vector machine of type Nu, (iii) radial basis function kernel support vector machine, (iv) poly kernel support vector machine, (v) decision tree classifier, (vi) random forest classifier, (vii) XGBoost classifier, (viii) bagging classifier, (ix) multilayer perceptron, (x) extra trees classifier and (xi) naive Bayes classifier.

Genetic parameter optimization reduced the number of those estimators to only 3: (i) gradient boosted trees classifier; (ii) random forest classifier and (iii) support vector machine with polynomial function kernel.

Table 3 Summary of accuracies obtained using various classifiers before and after genetic parameter optimization

Accuracy score

It can be noted from Table 3 that before the optimization, the best results (99.01%) are achieved by XGBoost classifier (Chen and Guestrin 2016). Second and third best classifiers are gradient boosted trees (98.92%) Bühlmann (2012) and multilayer perceptron (98.88%) (White and Rosenblatt 1963).

After genetic parameter optimization, the best classifier in terms of classification accuracy is voting classifier with 99.16% accuracy. The random forest classifier (99.11%) (Ho 1998) is the second best, and SVM classifier with polynomial kernel ex aequo with gradient boosted trees from the scikit-learn package—both achieved 99.07% classification accuracy.

Figure 3 shows the plot of the accuracies before and after the genetic parameter optimization.

Fig. 3

Accuracies obtained using various classifiers before and after the genetic parameter optimization

The average accuracy before optimization is 96.5% and increased to 97.8% after the genetic optimization. The average increase in classification accuracy is 1.3%. It can also be noted that 7 out of 21 classifiers achieved the accuracy of more than 99%. Before optimization, only XGBoost classifier obtained above 99% accuracy. After optimization, this result is increased to 99.16% (voting classifier). Nineteen out of twenty-one classifiers have performed better after genetic optimization.

Precision score

Random forest classifier yielded the highest precision score of 98.61%. Following that, voting classifier and XGBoost classifier yielded the precision of 98.56 and 98.41%, respectively. The results of all classifiers are presented in Table 4.

Table 4 Summary of precision score obtained using various classifiers before and after the genetic parameter optimization

After the optimization, extra trees classifier (Geurts et al. 2006), AdaBoost and the voting classifier yielded precision scores of 98.66%, 98.61% and 98.57%, respectively.

Figure 4 shows the plot of precision scores before and after the genetic parameter optimization.

Fig. 4

Precision scores obtained using various classifiers before and after the genetic parameter optimization

The average precision before optimization is 95.6%. After genetic optimization, this value is increased to 96.4%. The average increase in precision score is 0.8%. Before optimization, the highest precision score is 98.61% (random forest). After genetic optimization, this result is increased to 98.66% by the extra trees classifier.

Recall score

Quadratic discriminant analysis Bose et al. (ddd), MLP and XGBoost classifier yielded results of 98.39%, 98.02% and 97.91%, respectively, before the genetic algorithm parameter optimization. The summary of all of the classifiers is given in Table 5.

Table 5 Summary of recall score obtained using various classifiers after the genetic parameter optimization

After genetic optimization, quadratic discriminant analysis, SVM with polynomial kernel function and logistic regression (Cabrera 1994) yield the recall scores of 98.52%, 98.43% and 98.12%, respectively.

Figure 5 shows the plot of the recall scores before and after the genetic parameter optimization.

Fig. 5

Recall scores obtained using various classifiers before and after the genetic parameter optimization

The average pre-optimization recall score is 94.1%. After genetic optimization, it is increased to 95.7%. The average increase in the recall score is 1.6%. As a result of parameter optimization, 17 out of 21 classifiers got better results after the optimization. In both conditions, quadratic discriminant analysis performed better than the rest of the classifiers.


The XGBoost classifier before optimization yielded the highest F1-score of 98.16%. MLP model and bagging classifier provided the F1-score of 98.11% and 97.99%, respectively. The F1-scores before and after the genetic optimization are shown in Table 6.

Table 6 Summary of F1-score obtained using various classifiers before and after the genetic parameter optimization

It can be noted from Table 6 that after genetic parameter optimization, SVM classifier with polynomial kernel function, AdaBoost and voting classifier provided the F1-scores of 98.5%, 98.35% and 98.32%, respectively.

Figure 6 shows the plot of F1-scores before and after the genetic parameter optimization.

Fig. 6

F1-scores obtained using various classifiers before and after the genetic parameter optimization

The average value of F1-score before optimization is 94.7%. After genetic optimization, this value is increased to 96%. The average increase in F1 score is 1.3%. Before the optimization, XGBoost yielded the highest F1-score of 98.4%. After optimization, SVM classifier with polynomial kernel yielded the F1-score of 98.5%. It can be noted that for many classifiers F1-score is improved.


Table 7 provides the summary of comparison with other similar works (Viquar et al. 2018; Zhang et al. 2011, 2013; Acharya et al. 2018; Zhang et al. 2009) developed for the automated detection of heavenly bodies using the same SDSS database.

Table 7 Summary of comparison with similar other works for automated celestial object classification using the same database
Fig. 7

An illustration of the proposed genetic parameter optimization methodology usage in real-world scenario on Azure Cloud

It can be noted from Table 7 that most of the previous works (Zhang et al. 2009, 2013, 2011; Viquar et al. 2018) have performed binary classification and obtained high performance.

Recently, Acharya et al. (2018) have classified three classes using random forest classifier and reported the classification accuracy of 94%. To the best of our knowledge, we are the first group to achieve over 99% accuracy for three-class classification of heavenly bodies. In future, we intend to use the whole dataset of 4 million objects to train the model which may improve the classification performance. We can also use genetic algorithms to reduce the number of features and select only those, which would improve our accuracy score. Yet another option will be to use genetic parameter and feature optimization with asymmetric AdaBoost classifier as proposed by Viquar et al. (2018).

Advantages of the proposed system are:

  1. 1.

    Obtained highest classification accuracy.

  2. 2.

    Proposed a novel model based on genetic algorithm.

  3. 3.

    Model is simple to use and robust as it is developed using fivefold cross-validation..

Limitations of our work are:

  1. 1.

    A small number of photometric records (10,000 instances) are analyzed. The challenge for the astronomers is to accurately classify at various scales. Our approach would need to be scaled several order of magnitudes to meet those needs.

  2. 2.

    It is computationally expensive to find the optimal set of classifiers and their parameters. This approach on large-scale data may not be suitable as high computational complexity would require parallel task distribution among many nodes, thus increasing the cost.

The disadvantage of proposed methodology is its computational complexity. Hence, we intend to explore the possibility of using cloud environment. An example of cloud architecture (based on Microsoft Azure) that incorporates our system is shown in Fig. 7. This methodology is not limited to astronomy and can be extended to other applications as well. Proposed architecture can take data from different sources, store them and perform machine learning model optimization using our approach. Elastic scaling of the cloud resources is necessary when the data size is huge. To further leverage fully manage cloud services, we could run our evolutionary optimization pipeline on Azure Batch Service [38] instead of using virtual machines. This will give us dynamic scaling capabilities, and hence, we need to pay for the infrastructure only when we use it. After training and evaluating the model, the cost will be further reduced. Running our pipeline on Azure Batch gives us the ability to run the whole process not only on demand but also automatically. The data used for the testing can be used later to train the model as well. This will make our system more robust and accurate.


In this work, we have proposed a novel method of optimizing multi-class classification task using machine learning techniques and genetic algorithm. This approach helps to find the optimal parameters for the classifiers and achieved the highest accuracy of 99%. (Seven out of twenty-one classifiers have achieved the accuracy score of over 99% using our approach.) In future, the proposed model can be used to classify more classes of heavenly bodies and also can be used for healthcare applications like detection of cardiac ailments, brain abnormalities and other physiological malfunctioning. Various state-of-the-art deep learning techniques can be employed to increase the performance using more data.


  1. Acharya V, Bora P, Karri N, Nazareth A, Anusha S, Rao S (2018) Classification of sdss photometric data using machine learning on a cloud. Curr Sci 115:249 10.18520/cs/v115/i2/249-257

    Article  Google Scholar 

  2. Bagging Bühlmann P (2012) Boosting and ensemble methods. Handb Comput Stat.

    Article  Google Scholar 

  3. Bailer-Jones C, Fouesneau M, Andrae R (2019) Quasar and galaxy classification in gaia data release 2. Mon Notices R Astron Soc 490:5615–5633.

    Article  Google Scholar 

  4. Becker I, Pichara K, Catelan M, Protopapas P, Aguirre C, Nikzat F (2020) Scalable end-to-end recurrent neural network for variable star classification. Mon Notices R Astron Soc 493:2981–2995.

    Article  Google Scholar 

  5. Bertin E, Arnouts S (1996) Sextractor: software for source extraction. Astron Astrophys Suppl Ser.

    Article  Google Scholar 

  6. Bhandari D, Murthy C, Pal S (1996) Genetic algorithm with elitist model and its convergence. Int J Pattern Recognit Artif Intell.

    Article  Google Scholar 

  7. Blanton M, Bershady M, Abolfathi B, Albareti F, Prieto C, Almeida A, Alonso-Garcia J, Anders F, Anderson S, Andrews B, Aquino-Ortíz E, Aragon-Salamanca A, Argudo-Fernandez M, Armengaud E, Aubourg E, Avila-Reese V, Badenes C, Bailey S, Barger K, Zou H (2017) Sloan digital sky survey iv: mapping the milky way, nearby galaxies, and the distant universe. Astron J 154:28

    Article  Google Scholar 

  8. Bose S, Pal A, SahaRay R (2015) Generalized quadratic discriminant analysis. Pattern Recognit.

    Article  MATH  Google Scholar 

  9. Cabanac R, De Lapparent V, Hickson P (2002) Classification and redshift estimation by principal component analysis. Astron Astrophys.

    Article  Google Scholar 

  10. Cabrera A (1994) Logistic regression analysis in higher education: an applied. Perspective 10:225–256

    Google Scholar 

  11. Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system, pp 785–794.

  12. De Jong K, Fogel D, Schwefel H-P (1997) A history of evolutionary computation Handb Evolut Comput A2.3:1–12

  13. Deng W, Liu H, Xu J, Zhao H, Song Y (2020) An improved quantum-inspired differential evolution algorithm for deep belief network. IEEE Trans Instrum Meas.

    Article  Google Scholar 

  14. Deng W, Xu J, Zhao H, Song Y (2020) A novel gate resource allocation method using improved pso-based qea. IEEE Trans Intell Transp Syst.

    Article  Google Scholar 

  15. Deng W, Xu J, Song Y, Zhao H (2020) Differential evolution algorithm with wavelet basis function and optimal mutation strategy for complex optimization problem. Appl Soft Comput.

    Article  Google Scholar 

  16. Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42.

    Article  MATH  Google Scholar 

  17. Gunn J, Carr M, Rockosi C, Sekiguchi M, Berry K, Elms B, Haas E, Ivezic Z, Lupton R, Pauls G, Simcoe R, Hirsch R, Sanford D, Wang S, York D, Annis J, Bartozek L, Boroski W, Brinkman J (1998) The sloan digital sky survey photometric camera. Astron J.

    Article  Google Scholar 

  18. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del R’ıo JF, Wiebe M, Peterson P, G’erard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362.

    Article  Google Scholar 

  19. Ho T (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20:832–844.

    Article  Google Scholar 

  20. Jin X, Zhang Y, Zhang J, Zhao Y, Wu X-B, Fan D (2019) Efficient selection of quasar candidates based on optical and infrared photometric data using machine learning. Mon Notices R Astron Soc 485:4539–4549.

    Article  Google Scholar 

  21. Liashchynskyi P, Liashchynskyi P (2019) Grid search, random search, genetic algorithm: a big comparison for nas. arXiv:1912.06059

  22. López M, Sarro L, Solano E, Gutierrez-Sanchez R, Debosscher J (2010) Supervised star classification system for the omc archive

  23. Microsoft (2020) Batch—cloud-scale job scheduling and compute management. Access 29 May 2020

  24. Mosteller F, Tukey J (1968) Data analysis, including statistics. In: Lindzey G, Aronson E (eds) Revised handbook of social psychology, vol 2. Addison Wesley, pp 80–203

  25. Pearson K (1900) On lines and planes of closest fit to points in space. Philos Mag 2:559–572.

    Article  MATH  Google Scholar 

  26. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E, Louppe G (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  27. Peng N, Zhang Y, Zhao Y, Wu X-B (2012) Selecting quasar candidates using a support vector machine classification system. Mon Notices R Astron Soc 425:2599–2609.

    Article  Google Scholar 

  28. Philip N, Wadadekar Y, Kembhavi A, Kouneiher J (2002) A difference boosting neural network for automated star-galaxy classification. Astron Astrophys.

    Article  Google Scholar 

  29. Pławiak P, Acharya UR (2020) Novel deep genetic ensemble of classifiers for arrhythmia detection using ecg signals. Neural Comput Appl 32:11137–11161.

    Article  Google Scholar 

  30. Re M, Valentini G (2012) Ensemble methods: a review. Adv Mach Learn Data Min Astron 563–594

  31. SDSS (2015) Jpeg images on skyserver. Access 22 Jan 2019

  32. Sezer OB, Ozbayoglu M, Dogdu E (2017) A deep neural-network based stock trading system based on evolutionary optimized technical analysis parameters, Procedia Computer Science 114, 473–480, complex Adaptive Systems Conference with Theme: Engineering Cyber Physical Systems, CAS October 30 - November 1, 2017. Chicago, Illinois, USA.

  33. Tu L, Wei H, Ai L (2015) Galaxy and quasar classification based on local mean-based k-nearest neighbor method 285–288.

  34. Viquar M, Basak S, Dasgupta A, Agrawal S, Saha S (2018) Machine learning in astronomy: a case study in quasar-star classification. Proc IEMIS 3(2019):827–836.

    Article  Google Scholar 

  35. White B, Rosenblatt F (1963) Principles of neurodynamics: perceptrons and the theory of brain mechanisms. Am J Psychol 76:705.

    Article  Google Scholar 

  36. Zhang Y, Zhao Y, Zheng H (2009) Automated classification of quasars and stars. Proc Int Astron Union 5:147–147.

    Article  Google Scholar 

  37. Zhang Y, Zhao Y, Zheng H, Wu X-B (2013) Classification of quasars and stars by supervised and unsupervised methods. Proc Int Astron Union 8:333–334.

    Article  Google Scholar 

  38. Zhang Y, Zhao Y, Peng N (2011) LS-SVM applied for photometric classification of quasars and stars. In: Evans IN, Accomazzi A, Mink, DJ, Rots AH (eds) Astronomical data analysis software and systems XX. Astronomical Society of the Pacific Conference Series, vol 442

Download references


Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS web site is SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, the Chilean Participation Group, the French Participation Group, Harvard-Smithsonian Center for Astrophysics, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU)/University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional/MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.

Author information




Michał Wierzbiński conceived and designed the analysis, collected the data, performed the analysis, wrote the paper; Paweł Pławiak conceived and designed the analysis, wrote the paper; Mohamed Hammad wrote the paper; U. Rajendra Acharya wrote the paper.

Corresponding author

Correspondence to Paweł Pławiak.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Appendix A. Baseline and search space configurations

Table 8 presents the baseline configuration of the classifiers together with their search space configurations. Where possible, the random_state parameter was always set to constant value of 42 for reproducibility purposes. The functions from numpy (Harris et al. 2020) package were used to generate numerical parameter value ranges.

Table 8 A sample long table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wierzbiński, M., Pławiak, P., Hammad, M. et al. Development of accurate classification of heavenly bodies using novel machine learning techniques. Soft Comput 25, 7213–7228 (2021).

Download citation


  • Celestial objects
  • Machine learning
  • Evolutionary systems
  • Genetic optimization
  • Heavenly bodies
  • Celestial bodies