Development of accurate classification of heavenly bodies using novel machine learning techniques

The heavenly bodies are objects that swim in the outer space. The classification of these objects is a challenging task for astronomers. This article presents a novel methodology that enables an efficient and accurate classification of cosmic objects (3 classes) based on evolutionary optimization of classifiers. This research collected the data from Sloan Digital Sky Survey database. In this work, we are proposing to develop a novel machine learning model to classify stellar spectra of stars, quasars and galaxies. First, the input data are normalized and then subjected to principal component analysis to reduce the dimensionality. Then, the genetic algorithm is implemented on the data which helps to find the optimal parameters for the classifiers. We have used 21 classifiers to develop an accurate and robust classification with fivefold cross-validation strategy. Our developed model has achieved an improvement in the accuracy using nineteen out of twenty-one models. We have obtained the highest classification accuracy of 99.16%, precision of 98.78%, recall of 98.08% and F1-score of 98.32% using evolutionary system based on voting classifier. The developed machine learning prototype can help the astronomers to make accurate classification of heavenly bodies in the sky. Proposed evolutionary system can be used in other areas where accurate classification of many classes is required.


Introduction
Recently, an accurate stellar object classification using photometric data has gained much popularity. It is an important field of research where lot of improvements are made every year. Hence, many tools have been developed to aid in this task. Some of them include SExtractor Bertin et al. (1996)a widely used tool for star-galaxy separation. Due to the increase in the number of available sky surveys in optical and near-infrared spectra, and due to varying observing conditions and sensitivities, it is difficult to fine-tune these tools for specific database. The objective of this paper is to present a novel evolutionary system with highest performance in classifying three cosmic objects.
In this work, we chose to conduct our research using the Sloan Digital Sky Survey (SDSS) database Blanton et al. (2017). Astronomers, physicists and mathematicians from many countries took part in this project. Currently, a large part of the classification of objects from SDSS database have used advanced algorithms written in the Matlab environment SDSS (2015). Many researchers have used machine learning methods to classify the heavenly bodies automatically. The details on their study are provided in Sect. 3. The drawbacks of these studies are given below: 1. Majority of previous studies have used either very complex frameworks or basic methods to build the classification model. 2. It is complex and time-consuming to handle huge data. 3. Handling of large number of features will make the model computationally more intensive.
In order to overcome the above-mentioned drawbacks in this work, we proposed a novel machine learning approach that utilized genetic algorithm to find the best model with optimal set of hyperparameters. First, we reduced the number of features using principal component analysis to speed up the computations. Next, we trained a set of 21 classifiers with their default parameters as a baseline. Then, using genetic algorithm, we optimized their hyperparameters. Resulting individual models were then combined into a single voting classifier, using genetic algorithm to find a combination that yielded the best results. During the learning process fivefold cross-validation (Mosteller and Tukey 1968) is used. The main novel contributions of this work are as follows: 1. Introduced new and efficient solution based on machine learning models to classify three classes of cosmic objects. 2. Developed a genetic algorithm optimization technique to obtain high classification accuracy with small number of features. 3. Achieved highest performance (over 98% accuracy score) using 15 out of 21 tested classifiers. 4. Employed various clinical parameters to evaluate the performance of the developed model.
The rest of this paper is structured as follows. Section 2 discusses the previous works. Section 3 describes the proposed model based on evolutionary optimization of classifier parameters; data collection and preprocessing, model design, training and validation, are also described in this section. Experimental results and discussion are provided in Sects. 4 and 5, respectively. Conclusions and future works are given in Sect. 6.

Previous work
There is a significant increase in research works related to stellar spectra detection and classification. Many researchers focused on star-quasar (Zhang et al. 2011;Jin et al. 2019;Zhang et al. 2009Zhang et al. , 2013Viquar et al. 2018), galaxy-quasar (Bailer-Jones et al. 2019) or star-galaxy Philip et al. (2002) binary classification. Others (López et al. 2010, Becker et al. 2020) focused on multi-class classification of stars, galaxies and quasars Cabanac et al. (2002); Acharya et al. (2018). In these works, various methods have been applied to automatically classify the heavenly bodies accurately.
Many databases related to sky survey data are freely available. Among them, most popular databases are SDSS (Zhang et al. 2011(Zhang et al. , 2013(Zhang et al. , 2009Viquar et al. 2018, Acharya et al. 2018), Gaia (Bailer-Jones et al. 2019Becker et al. 2020), WISE (Jin et al. 2019;Becker et al. 2020) and UKIDSS (Zhang et al. 2011(Zhang et al. , 2013. The summary of related works conducted using these databases is shown in Table 1 Zhang et al. (2011) operated on the data from UKIDSS database. Their best model, a LS-SVM proved to be a highly efficient and powerful in classifying the photometric data. Jin et al. (2019) used data from WISE database using two new color criterions (yW1W2 and iW1zW2), which were constructed to distinguish quasars from stars efficiently. In Zhang et al. (2009) a kNN algorithm is used to distinguish star and quasar sources. Authors of Zhang et al. (2013) used SVM classifier for the same purpose. They achieved very high accuracy scores using SDSS DR7 and UKIDSS DR7 catalogs of photometric data. Viquar et al. of Viquar et al. (2018) used the same database and asymmetric AdaBoost to classify quasars and stars. In Zhang et al. (2013), Zhang et al. used supervised and unsupervised methods on quasar-star classification problem. Philip et al. (2002) reported higher performance using difference-boosting neural network (DBNN) in star-galaxy classification, which is comparable to the SExtractor.
Authors of Acharya et al. (2018) demonstrated how multiclass classification of stellar sources can be scaled to billions of records by incorporating the scalability of the cloud. Multi-class classification is also performed by Cabanac et al. (2002). They showed that the first 10 eigencomponents of the Karhunen-Loeve expansion or principal component analysis (PCA) provided a robust classification scheme for the identification of stars, galaxies and quasi-stellar objects from multi-band photometry. López et al. (2010) developed an automatic multistage classification system based on Bayesian networks for the OMC (Optical Monitoring Camera) data. They focused on multi-class classification of different categories of stars. Becker et al. (2020) worked on similar problem. They proposed an end-to-end solution for automated source classification based on RNNs. They then compared the classification results with random forest classifier. Authors of Bailer-Jones et al. (2019) used Gaussian mixture models to probabilistically classify objects in Gaia data release 2 (GDR2) using photometric and astrometric data. Their trained model is able to classify star, quasar and galaxy with high accuracy.
Genetic algorithms (GAs) are widely used in many fields as they are versatile and can provide very good results, especially when the search space is large, outperforming standard optimization techniques such as random or gird searches (Liashchynskyi and Liashchynskyi 2019). Constant optimizations are being developed since its first inception (De Jong et al. 1977). Wu Deng et al. proposed an algorithm that addressed premature convergence, low search ability and tendency to fall into local optima of quantum evolutionary algorithm (QEA)-improved QEA with multistrategies namely MSIQDE . Authors used this algorithm to optimize hyperparameters of DBN model. Another usage example is provided in . Authors developed improved QEA based on the niche co-evolution strategy and enhanced particle swarm optimization (PSO)-IPOQEA. Proposed system is used to solve airport gate resource allocation problem. A new optimal mutation strategy based on the complementary advantages of five mutation strategies has been used in  to develop an improved differential evolution algorithm with the wavelet basis function. This algorithm can improve the search quality while simultaneously accelerating convergence and help in avoiding the falling into local optimum. Song et al. proposed a multi-population parallel co-evolutionary differential evolution, named MPPCEDE, to optimize parameters of photovoltaic models. Sezer et al. in Sezer et al. (2017) developed a stock trading system which used optimized technical analysis parameters for creating buy-sell points using genetic algorithms.
This paper uses approach similar to that presented in Pławiak and Acharya (2020) by Pławiak and Acharya. They used a GA for parameter optimization coupled with k-fold crossvalidation (CV) for arrhythmia detection using ECG signals. Similar to their work, we have built the hyperparameter optimization pipeline based on GA, which takes the raw data and performs its pre-processing, parameter optimization coupled with fivefold CV, yielding a list of classifiers with optimal parameters. Using those classifiers, a voting ensemble optimized using GA further increased the classification accuracy.

Materials and methods
Sloan Digital Sky Survey is a project that provides public database on observations of celestial objects Blanton et al. (2017).
A special 2.5-m-diameter telescope was used to observe celestial objects, which was built in New Mexico at the Apache Point Observatory in the USA. The telescope used a camera consisting of 30 charge-coupled device (CCD) chips with a resolution of 2048 × 2048 each. The chips were arranged in 5 rows with 6 in each row. Each row observes the space through various optical filters (u', g', r', i', z' ) with different wavelengths u' = 354 nm, g' = 475 nm, r' = 622 nm, i' = 763 nm and z' = 905 nm [25].
SDSS database consists of two main tables PhotoObj and SpecObj. The SkyServer and CASJOB portals provide web interfaces build for SQL query execution over those tables. We have collected data from aforementioned tables and randomly selected 10,000 records of celes-tial bodies collected by SDSS Data Release 16 (DR16). While querying the data, we made sure to closely follow approach taken by Peng et al. (2012) and Jin et al. (2019). Query filters of sciencePrimary = 1, Mode = 1 and zWarrning = 0 are applied. Each observation is described by 8 attributes (u', g', r', i', z' bands, right ascension, declination and redshift) and the class to which it belongs-star, galaxy or quasar. Figure 1 presents the learning and optimization pipeline developed for the classification process.
The pipeline consists of the following steps: (i) Exploratory data analysis is performed manually in order to understand the data better. It is a widely used practice and employed rich data visualization techniques to identify the issues with the data. Those issues might include: missing feature values, large number of outliers, different data scales and presence of categorical features. (ii) Initial data pre-processing is an important step for any machine learning project. Data need to be cleaned and scaled in order for our classifiers to learn the class relationships well. This step also contains dimensionality reduction techniques such as principal component analysis (PCA). Application of PCA allows us to reduce the number of features the model needs to process for accurate prediction (Pearson 1900). This in turn speeds up the learning process significantly. (iii) In train/test split step, dataset is shuffled and split into training and test sets. We used 75% of our data for training and the rest for testing. Strati-  Table 2. (c) Evolution: this step consists of cross-over, selection of best individuals, gene mutation and other related operations. It is performed together with fivefold CV on training dataset and verification on the test dataset. For every individual, the same steps used for baseline are applied. Elitism and multipoint gene mutation strategy is used. Elitism strategy helps us to keep the best individual Bhandari et al. (1996). This way we make sure that even if current generation yielded no better individuals, the best individual from previous generation will still participate in the next one. (d) Saving the best individual: after all generations have passed, the best individual will be saved. This individual can be then used for further applications. (viii) Finally, the optimized learning process is verified and compared with one without optimization. The computational complexity of genetic algorithm in terms of O notation is given by O(gnm) with g indicating the number of generations; n and m denote the size of population and the individuals, respectively. During data analysis, high correlation in the light band variables (u', g', r', i', z') is observed. The correlation found between magnitudes is to be expected, since the magnitudes contain information about the total brightness of an object and its spectral shape. Those 5 light-bands are substituted by lower number of variables produced by PCA algorithm (Pearson 1900). This correlation can be observed in Fig. 2. The number of principal components set for this is 3. This helped us to keep over 99% of explained variance. Hence, the training and testing time is significantly reduced.
The final dataset used in the learning process contains 10,000 observations, of which each is described by 3 variables produced by the PCA algorithm, redshift, right ascension and declination. The entire dataset is then split into training and test datasets. The training set contained 75% of observations from the original dataset, and the test set contained rest of the samples.
In the learning process, the stratified fivefold CV is used. Finally, the models are tested using the test set. Stage II of the experiment involved using genetic parameter optimization on the training model. The datasets used for both stages are the same. The only difference is the learning process. In the stage II, we trained the model until we have reached the maximum number of generations in our evolutionary algorithm. The best individual is then verified using the same test set which is used in stage I.

Results
This section contains the results of learning and genetic parameter optimization processes.
The below tables and figures show the results obtained using both stages of learning processes. The classifier parameters in stage I are chosen "by hand", using default values in many cases. Parameters of the classifiers for stage II of the experiment are obtained using the genetic algorithm. Baseline and search space configurations are provided in Appendix A.
Genetic parameter optimization reduced the number of those estimators to only 3: (i) gradient boosted trees classifier; (ii) random forest classifier and (iii) support vector machine with polynomial function kernel.

Accuracy score
It can be noted from Table 3 that before the optimization, the best results (99.01%) are achieved by XGBoost classifier (Chen and Guestrin 2016). Second and third best classifiers are gradient boosted trees (98.92%) Bühlmann (2012)  The best results are shown in bold and multilayer perceptron (98.88%) (White and Rosenblatt 1963).
After genetic parameter optimization, the best classifier in terms of classification accuracy is voting classifier with 99.16% accuracy. The random forest classifier (99.11%) (Ho 1998) is the second best, and SVM classifier with polyno-mial kernel ex aequo with gradient boosted trees from the scikit-learn package-both achieved 99.07% classification accuracy. Figure 3 shows the plot of the accuracies before and after the genetic parameter optimization.
The average accuracy before optimization is 96.5% and increased to 97.8% after the genetic optimization. The average increase in classification accuracy is 1.3%. It can also be noted that 7 out of 21 classifiers achieved the accuracy of more than 99%. Before optimization, only XGBoost classifier obtained above 99% accuracy. After optimization, this result is increased to 99.16% (voting classifier). Nineteen out of twenty-one classifiers have performed better after genetic optimization.

Precision score
Random forest classifier yielded the highest precision score of 98.61%. Following that, voting classifier and XGBoost classifier yielded the precision of 98.56 and 98.41%, respectively. The results of all classifiers are presented in Table 4.
After the optimization, extra trees classifier (Geurts et al. 2006), AdaBoost and the voting classifier yielded precision scores of 98.66%, 98.61% and 98.57%, respectively. Figure 4 shows the plot of precision scores before and after the genetic parameter optimization.
The average precision before optimization is 95.6%. After genetic optimization, this value is increased to 96.4%. The average increase in precision score is 0.8%. Before optimization, the highest precision score is 98.61% (random forest). After genetic optimization, this result is increased to 98.66% by the extra trees classifier.

Recall score
Quadratic discriminant analysis Bose et al. (ddd), MLP and XGBoost classifier yielded results of 98.39%, 98.02% and  The best results are shown in bold 97.91%, respectively, before the genetic algorithm parameter optimization. The summary of all of the classifiers is given in Table 5. After genetic optimization, quadratic discriminant analysis, SVM with polynomial kernel function and logistic regression (Cabrera 1994) yield the recall scores of 98.52%, 98.43% and 98.12%, respectively. Figure 5 shows the plot of the recall scores before and after the genetic parameter optimization.  The average pre-optimization recall score is 94.1%. After genetic optimization, it is increased to 95.7%. The average increase in the recall score is 1.6%. As a result of parameter optimization, 17 out of 21 classifiers got better results after the optimization. In both conditions, quadratic discriminant analysis performed better than the rest of the classifiers.

F1-score
The XGBoost classifier before optimization yielded the highest F1-score of 98.16%. MLP model and bagging classifier provided the F1-score of 98.11% and 97.99%, respectively. The F1-scores before and after the genetic optimization are shown in Table 6.
It can be noted from Table 6 that after genetic parameter optimization, SVM classifier with polynomial kernel function, AdaBoost and voting classifier provided the F1-scores of 98.5%, 98.35% and 98.32%, respectively. Figure 6 shows the plot of F1-scores before and after the genetic parameter optimization.
The average value of F1-score before optimization is 94.7%. After genetic optimization, this value is increased to 96%. The average increase in F1 score is 1.3%. Before the optimization, XGBoost yielded the highest F1-score of 98.4%. After optimization, SVM classifier with polynomial kernel yielded the F1-score of 98.5%. It can be noted that for many classifiers F1-score is improved. Table 7 provides the summary of comparison with other similar works (Viquar et al. 2018;Zhang et al. 2011Zhang et al. , 2013Acharya et al. 2018;Zhang et al. 2009) developed for the automated detection of heavenly bodies using the same SDSS database.

Discussion
It can be noted from Table 7 that most of the previous works (Zhang et al. 2009(Zhang et al. , 2013(Zhang et al. , 2011Viquar et al. 2018) have performed binary classification and obtained high performance.
Recently, Acharya et al. (2018) have classified three classes using random forest classifier and reported the classification accuracy of 94%. To the best of our knowledge, we are the first group to achieve over 99% accuracy for threeclass classification of heavenly bodies. In future, we intend to use the whole dataset of 4 million objects to train the model which may improve the classification performance. We can also use genetic algorithms to reduce the number of features and select only those, which would improve our accuracy score. Yet another option will be to use genetic parameter and feature optimization with asymmetric AdaBoost classifier as proposed by Viquar et al. (2018).
Advantages of the proposed system are: 1. Obtained highest classification accuracy. 2. Proposed a novel model based on genetic algorithm. 3. Model is simple to use and robust as it is developed using fivefold cross-validation..
Limitations of our work are: 1. A small number of photometric records (10,000 instances) are analyzed. The challenge for the astronomers is to accurately classify at various scales. Our approach would need to be scaled several order of magnitudes to meet those needs. The best result is shown in bold 2. It is computationally expensive to find the optimal set of classifiers and their parameters. This approach on large-scale data may not be suitable as high computa-tional complexity would require parallel task distribution among many nodes, thus increasing the cost. The best results are shown in bold The disadvantage of proposed methodology is its computational complexity. Hence, we intend to explore the possibility of using cloud environment. An example of cloud architecture (based on Microsoft Azure) that incorporates our system is shown in Fig. 7. This methodology is not limited to astronomy and can be extended to other applications as well. Proposed architecture can take data from different sources, store them and perform machine learning model optimization using our approach. Elastic scaling of the cloud resources is necessary when the data size is huge. To further leverage fully manage cloud services, we could run our evolutionary optimization pipeline on Azure Batch Service [38] instead of using virtual machines. This will give us dynamic scaling capabilities, and hence, we need to pay for the infrastructure only when we use it. After training and evaluating the model, the cost will be further reduced. Running our pipeline on Azure Batch gives us the ability to run the whole process not only on demand but also automatically. The data used for the testing can be used later to train the model as well. This will make our system more robust and accurate.