1 Introduction

Recently, an accurate stellar object classification using photometric data has gained much popularity. It is an important field of research where lot of improvements are made every year. Hence, many tools have been developed to aid in this task. Some of them include SExtractor Bertin et al. (1996)—a widely used tool for star-galaxy separation. Due to the increase in the number of available sky surveys in optical and near-infrared spectra, and due to varying observing conditions and sensitivities, it is difficult to fine-tune these tools for specific database. The objective of this paper is to present a novel evolutionary system with highest performance in classifying three cosmic objects.

In this work, we chose to conduct our research using the Sloan Digital Sky Survey (SDSS) database Blanton et al. (2017). Astronomers, physicists and mathematicians from many countries took part in this project. Currently, a large part of the classification of objects from SDSS database have used advanced algorithms written in the Matlab environment SDSS (2015). Many researchers have used machine learning methods to classify the heavenly bodies automatically. The details on their study are provided in Sect. 3. The drawbacks of these studies are given below:

  1. 1.

    Majority of previous studies have used either very complex frameworks or basic methods to build the classification model.

  2. 2.

    It is complex and time-consuming to handle huge data.

  3. 3.

    Handling of large number of features will make the model computationally more intensive.

In order to overcome the above-mentioned drawbacks in this work, we proposed a novel machine learning approach that utilized genetic algorithm to find the best model with optimal set of hyperparameters. First, we reduced the number of features using principal component analysis to speed up the computations. Next, we trained a set of 21 classifiers with their default parameters as a baseline. Then, using genetic algorithm, we optimized their hyperparameters. Resulting individual models were then combined into a single voting classifier, using genetic algorithm to find a combination that yielded the best results. During the learning process fivefold cross-validation (Mosteller and Tukey 1968) is used. The main novel contributions of this work are as follows:

  1. 1.

    Introduced new and efficient solution based on machine learning models to classify three classes of cosmic objects.

  2. 2.

    Developed a genetic algorithm optimization technique to obtain high classification accuracy with small number of features.

  3. 3.

    Achieved highest performance (over 98% accuracy score) using 15 out of 21 tested classifiers.

  4. 4.

    Employed various clinical parameters to evaluate the performance of the developed model.

The rest of this paper is structured as follows. Section 2 discusses the previous works. Section 3 describes the proposed model based on evolutionary optimization of classifier parameters; data collection and preprocessing, model design, training and validation, are also described in this section. Experimental results and discussion are provided in Sects. 4 and 5, respectively. Conclusions and future works are given in Sect. 6.

2 Previous work

There is a significant increase in research works related to stellar spectra detection and classification. Many researchers focused on star-quasar (Zhang et al. 2011; Jin et al. 2019; Zhang et al. 2009, 2013; Viquar et al. 2018), galaxy-quasar (Bailer-Jones et al. 2019) or star-galaxy Philip et al. (2002) binary classification. Others (López et al. 2010, Becker et al. 2020) focused on multi-class classification of stars, galaxies and quasars Cabanac et al. (2002); Acharya et al. (2018). In these works, various methods have been applied to automatically classify the heavenly bodies accurately.

Many authors used classical machine learning algorithms such as support vector machines (SVM) or k-nearest neighbors (kNN) (Zhang et al. 2011, 2009, 2013; Tu et al. 2015,[12,15],Jin et al. 2019; Viquar et al. 2018). Others adopted deep learning techniques (Becker et al. 2020, 11) or developed their own novel solutions (Viquar et al. 2018).

Many databases related to sky survey data are freely available. Among them, most popular databases are SDSS (Zhang et al. 2011, 2013, 2009; Viquar et al. 2018, Acharya et al. 2018), Gaia (Bailer-Jones et al. 2019; Becker et al. 2020), WISE (Jin et al. 2019; Becker et al. 2020) and UKIDSS (Zhang et al. 2011, 2013). The summary of related works conducted using these databases is shown in Table 1

Table 1 Summary of state-of-the-art techniques developed using the sky survey databases

Zhang et al. (2011) operated on the data from UKIDSS database. Their best model, a LS-SVM proved to be a highly efficient and powerful in classifying the photometric data. Jin et al. (2019) used data from WISE database using two new color criterions (yW1W2 and iW1zW2), which were constructed to distinguish quasars from stars efficiently. In Zhang et al. (2009) a kNN algorithm is used to distinguish star and quasar sources. Authors of Zhang et al. (2013) used SVM classifier for the same purpose. They achieved very high accuracy scores using SDSS DR7 and UKIDSS DR7 catalogs of photometric data. Viquar et al. of Viquar et al. (2018) used the same database and asymmetric AdaBoost to classify quasars and stars. In Zhang et al. (2013), Zhang et al. used supervised and unsupervised methods on quasar-star classification problem. Philip et al. (2002) reported higher performance using difference-boosting neural network (DBNN) in star-galaxy classification, which is comparable to the SExtractor.

Authors of Acharya et al. (2018) demonstrated how multi-class classification of stellar sources can be scaled to billions of records by incorporating the scalability of the cloud. Multi-class classification is also performed by Cabanac et al. (2002). They showed that the first 10 eigencomponents of the Karhunen–Loeve expansion or principal component analysis (PCA) provided a robust classification scheme for the identification of stars, galaxies and quasi-stellar objects from multi-band photometry.

López et al. (2010) developed an automatic multistage classification system based on Bayesian networks for the OMC (Optical Monitoring Camera) data. They focused on multi-class classification of different categories of stars. Becker et al. (2020) worked on similar problem. They proposed an end-to-end solution for automated source classification based on RNNs. They then compared the classification results with random forest classifier.

Authors of Bailer-Jones et al. (2019) used Gaussian mixture models to probabilistically classify objects in Gaia data release 2 (GDR2) using photometric and astrometric data. Their trained model is able to classify star, quasar and galaxy with high accuracy.

Genetic algorithms (GAs) are widely used in many fields as they are versatile and can provide very good results, especially when the search space is large, outperforming standard optimization techniques such as random or gird searches (Liashchynskyi and Liashchynskyi 2019). Constant optimizations are being developed since its first inception (De Jong et al. 1977). Wu Deng et al. proposed an algorithm that addressed premature convergence, low search ability and tendency to fall into local optima of quantum evolutionary algorithm (QEA)—improved QEA with multistrategies namely MSIQDE (Deng et al. 2020). Authors used this algorithm to optimize hyperparameters of DBN model. Another usage example is provided in Deng et al. (2020). Authors developed improved QEA based on the niche co-evolution strategy and enhanced particle swarm optimization (PSO)—IPOQEA. Proposed system is used to solve airport gate resource allocation problem. A new optimal mutation strategy based on the complementary advantages of five mutation strategies has been used in Deng et al. (2020) to develop an improved differential evolution algorithm with the wavelet basis function. This algorithm can improve the search quality while simultaneously accelerating convergence and help in avoiding the falling into local optimum. Song et al. proposed a multi-population parallel co-evolutionary differential evolution, named MPPCEDE, to optimize parameters of photovoltaic models. Sezer et al. in Sezer et al. (2017) developed a stock trading system which used optimized technical analysis parameters for creating buy–sell points using genetic algorithms.

Fig. 1
figure 1

Learning and optimization pipeline developed for the classification

This paper uses approach similar to that presented in Pławiak and Acharya (2020) by Pławiak and Acharya. They used a GA for parameter optimization coupled with k-fold cross-validation (CV) for arrhythmia detection using ECG signals. Similar to their work, we have built the hyperparameter optimization pipeline based on GA, which takes the raw data and performs its pre-processing, parameter optimization coupled with fivefold CV, yielding a list of classifiers with optimal parameters. Using those classifiers, a voting ensemble optimized using GA further increased the classification accuracy.

3 Materials and methods

Sloan Digital Sky Survey is a project that provides public database on observations of celestial objects Blanton et al. (2017).

A special 2.5-m-diameter telescope was used to observe celestial objects, which was built in New Mexico at the Apache Point Observatory in the USA. The telescope used a camera consisting of 30 charge-coupled device (CCD) chips with a resolution of \(2048\times 2048\) each. The chips were arranged in 5 rows with 6 in each row. Each row observes the space through various optical filters (u’, g’, r’, i’, z’ ) with different wavelengths u’ = 354 nm, g’ = 475 nm, r’ = 622 nm, i’ = 763 nm and z’ = 905 nm [25].

SDSS database consists of two main tables PhotoObj and SpecObj. The SkyServer and CASJOB portals provide web interfaces build for SQL query execution over those tables. We have collected data from aforementioned tables and randomly selected 10,000 records of celestial bodies collected by SDSS Data Release 16 (DR16). While querying the data, we made sure to closely follow approach taken by Peng et al. (2012) and Jin et al. (2019). Query filters of sciencePrimary = 1, Mode = 1 and zWarrning = 0 are applied. Each observation is described by 8 attributes (u’, g’, r’, i’, z’ bands, right ascension, declination and redshift) and the class to which it belongs—star, galaxy or quasar.

Figure 1 presents the learning and optimization pipeline developed for the classification process.

The pipeline consists of the following steps:

(i) Exploratory data analysis is performed manually in order to understand the data better. It is a widely used practice and employed rich data visualization techniques to identify the issues with the data. Those issues might include: missing feature values, large number of outliers, different data scales and presence of categorical features. (ii) Initial data pre-processing is an important step for any machine learning project. Data need to be cleaned and scaled in order for our classifiers to learn the class relationships well. This step also contains dimensionality reduction techniques such as principal component analysis (PCA). Application of PCA allows us to reduce the number of features the model needs to process for accurate prediction (Pearson 1900). This in turn speeds up the learning process significantly. (iii) In train/test split step, dataset is shuffled and split into training and test sets. We used 75% of our data for training and the rest for testing. Stratified split is used to preserve the class balances. (iv) In this work, we have used 21 classifiers from the scikit-learn package Pedregosa et al. (2011) and have initialized their parameters by default setting. (v) After initialization of all classifiers, the classifiers are developed using fivefold CV. This helps us prevent over- or under-fitting of the classifiers. (vi) The verification of final classification performance on the test set which is our baseline study is done. (vii) Genetic parameter optimization consists of: (a) choosing proper fitness function and we chose accuracy as the fitness function. This function, however, can be replaced by other performance metrics such as precision, recall or \(F_{1}\) score. (b) Population generation: population size and other details regarding the generic algorithm are presented in Table 2. (c) Evolution: this step consists of cross-over, selection of best individuals, gene mutation and other related operations. It is performed together with fivefold CV on training dataset and verification on the test dataset. For every individual, the same steps used for baseline are applied. Elitism and multipoint gene mutation strategy is used. Elitism strategy helps us to keep the best individual Bhandari et al. (1996). This way we make sure that even if current generation yielded no better individuals, the best individual from previous generation will still participate in the next one. (d) Saving the best individual: after all generations have passed, the best individual will be saved. This individual can be then used for further applications. (viii) Finally, the optimized learning process is verified and compared with one without optimization. The computational complexity of genetic algorithm in terms of O notation is given by O(gnm) with g indicating the number of generations; n and m denote the size of population and the individuals, respectively.

Table 2 Summary of genetic algorithm parameters used for classification with optimization

During data analysis, high correlation in the light band variables (u’, g’, r’, i’, z’) is observed. The correlation found between magnitudes is to be expected, since the magnitudes contain information about the total brightness of an object and its spectral shape. Those 5 light-bands are substituted by lower number of variables produced by PCA algorithm (Pearson 1900). This correlation can be observed in Fig. 2. The number of principal components set for this is 3. This helped us to keep over 99% of explained variance. Hence, the training and testing time is significantly reduced.

Fig. 2
figure 2

Correlation matrices for each of the classes

The final dataset used in the learning process contains 10,000 observations, of which each is described by 3 variables produced by the PCA algorithm, redshift, right ascension and declination. The entire dataset is then split into training and test datasets. The training set contained 75% of observations from the original dataset, and the test set contained rest of the samples.

In the learning process, the stratified fivefold CV is used. Finally, the models are tested using the test set. Stage II of the experiment involved using genetic parameter optimization on the training model. The datasets used for both stages are the same. The only difference is the learning process. In the stage II, we trained the model until we have reached the maximum number of generations in our evolutionary algorithm. The best individual is then verified using the same test set which is used in stage I.

4 Results

This section contains the results of learning and genetic parameter optimization processes.

The below tables and figures show the results obtained using both stages of learning processes. The classifier parameters in stage I are chosen “by hand”, using default values in many cases. Parameters of the classifiers for stage II of the experiment are obtained using the genetic algorithm. Baseline and search space configurations are provided in Appendix A.

The voting classifier Re and Valentini (2012) in the stage I consists of 11 estimators: (i) quadratic discriminant analysis, (ii) support vector machine of type Nu, (iii) radial basis function kernel support vector machine, (iv) poly kernel support vector machine, (v) decision tree classifier, (vi) random forest classifier, (vii) XGBoost classifier, (viii) bagging classifier, (ix) multilayer perceptron, (x) extra trees classifier and (xi) naive Bayes classifier.

Genetic parameter optimization reduced the number of those estimators to only 3: (i) gradient boosted trees classifier; (ii) random forest classifier and (iii) support vector machine with polynomial function kernel.

Table 3 Summary of accuracies obtained using various classifiers before and after genetic parameter optimization

4.1 Accuracy score

It can be noted from Table 3 that before the optimization, the best results (99.01%) are achieved by XGBoost classifier (Chen and Guestrin 2016). Second and third best classifiers are gradient boosted trees (98.92%) Bühlmann (2012) and multilayer perceptron (98.88%) (White and Rosenblatt 1963).

After genetic parameter optimization, the best classifier in terms of classification accuracy is voting classifier with 99.16% accuracy. The random forest classifier (99.11%) (Ho 1998) is the second best, and SVM classifier with polynomial kernel ex aequo with gradient boosted trees from the scikit-learn package—both achieved 99.07% classification accuracy.

Figure 3 shows the plot of the accuracies before and after the genetic parameter optimization.

Fig. 3
figure 3

Accuracies obtained using various classifiers before and after the genetic parameter optimization

The average accuracy before optimization is 96.5% and increased to 97.8% after the genetic optimization. The average increase in classification accuracy is 1.3%. It can also be noted that 7 out of 21 classifiers achieved the accuracy of more than 99%. Before optimization, only XGBoost classifier obtained above 99% accuracy. After optimization, this result is increased to 99.16% (voting classifier). Nineteen out of twenty-one classifiers have performed better after genetic optimization.

4.2 Precision score

Random forest classifier yielded the highest precision score of 98.61%. Following that, voting classifier and XGBoost classifier yielded the precision of 98.56 and 98.41%, respectively. The results of all classifiers are presented in Table 4.

Table 4 Summary of precision score obtained using various classifiers before and after the genetic parameter optimization

After the optimization, extra trees classifier (Geurts et al. 2006), AdaBoost and the voting classifier yielded precision scores of 98.66%, 98.61% and 98.57%, respectively.

Figure 4 shows the plot of precision scores before and after the genetic parameter optimization.

Fig. 4
figure 4

Precision scores obtained using various classifiers before and after the genetic parameter optimization

The average precision before optimization is 95.6%. After genetic optimization, this value is increased to 96.4%. The average increase in precision score is 0.8%. Before optimization, the highest precision score is 98.61% (random forest). After genetic optimization, this result is increased to 98.66% by the extra trees classifier.

4.3 Recall score

Quadratic discriminant analysis Bose et al. (ddd), MLP and XGBoost classifier yielded results of 98.39%, 98.02% and 97.91%, respectively, before the genetic algorithm parameter optimization. The summary of all of the classifiers is given in Table 5.

Table 5 Summary of recall score obtained using various classifiers after the genetic parameter optimization

After genetic optimization, quadratic discriminant analysis, SVM with polynomial kernel function and logistic regression (Cabrera 1994) yield the recall scores of 98.52%, 98.43% and 98.12%, respectively.

Figure 5 shows the plot of the recall scores before and after the genetic parameter optimization.

Fig. 5
figure 5

Recall scores obtained using various classifiers before and after the genetic parameter optimization

The average pre-optimization recall score is 94.1%. After genetic optimization, it is increased to 95.7%. The average increase in the recall score is 1.6%. As a result of parameter optimization, 17 out of 21 classifiers got better results after the optimization. In both conditions, quadratic discriminant analysis performed better than the rest of the classifiers.

4.4 F1-score

The XGBoost classifier before optimization yielded the highest F1-score of 98.16%. MLP model and bagging classifier provided the F1-score of 98.11% and 97.99%, respectively. The F1-scores before and after the genetic optimization are shown in Table 6.

Table 6 Summary of F1-score obtained using various classifiers before and after the genetic parameter optimization

It can be noted from Table 6 that after genetic parameter optimization, SVM classifier with polynomial kernel function, AdaBoost and voting classifier provided the F1-scores of 98.5%, 98.35% and 98.32%, respectively.

Figure 6 shows the plot of F1-scores before and after the genetic parameter optimization.

Fig. 6
figure 6

F1-scores obtained using various classifiers before and after the genetic parameter optimization

The average value of F1-score before optimization is 94.7%. After genetic optimization, this value is increased to 96%. The average increase in F1 score is 1.3%. Before the optimization, XGBoost yielded the highest F1-score of 98.4%. After optimization, SVM classifier with polynomial kernel yielded the F1-score of 98.5%. It can be noted that for many classifiers F1-score is improved.

5 Discussion

Table 7 provides the summary of comparison with other similar works (Viquar et al. 2018; Zhang et al. 2011, 2013; Acharya et al. 2018; Zhang et al. 2009) developed for the automated detection of heavenly bodies using the same SDSS database.

Table 7 Summary of comparison with similar other works for automated celestial object classification using the same database
Fig. 7
figure 7

An illustration of the proposed genetic parameter optimization methodology usage in real-world scenario on Azure Cloud

It can be noted from Table 7 that most of the previous works (Zhang et al. 2009, 2013, 2011; Viquar et al. 2018) have performed binary classification and obtained high performance.

Recently, Acharya et al. (2018) have classified three classes using random forest classifier and reported the classification accuracy of 94%. To the best of our knowledge, we are the first group to achieve over 99% accuracy for three-class classification of heavenly bodies. In future, we intend to use the whole dataset of 4 million objects to train the model which may improve the classification performance. We can also use genetic algorithms to reduce the number of features and select only those, which would improve our accuracy score. Yet another option will be to use genetic parameter and feature optimization with asymmetric AdaBoost classifier as proposed by Viquar et al. (2018).

Advantages of the proposed system are:

  1. 1.

    Obtained highest classification accuracy.

  2. 2.

    Proposed a novel model based on genetic algorithm.

  3. 3.

    Model is simple to use and robust as it is developed using fivefold cross-validation..

Limitations of our work are:

  1. 1.

    A small number of photometric records (10,000 instances) are analyzed. The challenge for the astronomers is to accurately classify at various scales. Our approach would need to be scaled several order of magnitudes to meet those needs.

  2. 2.

    It is computationally expensive to find the optimal set of classifiers and their parameters. This approach on large-scale data may not be suitable as high computational complexity would require parallel task distribution among many nodes, thus increasing the cost.

The disadvantage of proposed methodology is its computational complexity. Hence, we intend to explore the possibility of using cloud environment. An example of cloud architecture (based on Microsoft Azure) that incorporates our system is shown in Fig. 7. This methodology is not limited to astronomy and can be extended to other applications as well. Proposed architecture can take data from different sources, store them and perform machine learning model optimization using our approach. Elastic scaling of the cloud resources is necessary when the data size is huge. To further leverage fully manage cloud services, we could run our evolutionary optimization pipeline on Azure Batch Service [38] instead of using virtual machines. This will give us dynamic scaling capabilities, and hence, we need to pay for the infrastructure only when we use it. After training and evaluating the model, the cost will be further reduced. Running our pipeline on Azure Batch gives us the ability to run the whole process not only on demand but also automatically. The data used for the testing can be used later to train the model as well. This will make our system more robust and accurate.

6 Conclusion

In this work, we have proposed a novel method of optimizing multi-class classification task using machine learning techniques and genetic algorithm. This approach helps to find the optimal parameters for the classifiers and achieved the highest accuracy of 99%. (Seven out of twenty-one classifiers have achieved the accuracy score of over 99% using our approach.) In future, the proposed model can be used to classify more classes of heavenly bodies and also can be used for healthcare applications like detection of cardiac ailments, brain abnormalities and other physiological malfunctioning. Various state-of-the-art deep learning techniques can be employed to increase the performance using more data.