1 Introduction

Education is generally said to be the bedrock of national growth and national integration, but studies over the years have shown that the relationship between national integration and education is not linear [1]. Higher education in Nigeria began in 1932 with the establishment of the Yaba Higher College [2], and in 1940, the University College at Ibadan was established in an effort to promote higher education within the colony. According to Okebukola [3] as quoted by Olujuwon [2], regional universities were created based on the recommendations of the Asbby commission; the University of Nigeria was established in the east in 1960, University of Ife in 1961, and in 1962, the University College at Ibadan was granted a full university status. In the North, the Ahmadu Bello University was established in 1962. This marked the beginning of region-based education in Nigeria, and as at today there are 165 Universities in Nigeria according to the National Universities Commission, 43 of which are federal universities, 47 state-owned universities and 75 private universities with only 29% of Joint Admissions and Matriculation Board (JAMB) applicants admitted into the university [4] due to the limited admission slots [5].

Nigeria, with a land area of 923,768 km2 and an estimated population of over 190 million [7] has thirty-six (36) states which are divided into six (6) geopolitical zones, namely; South South, South West, South East, North East, North West, and North Central as shown in Fig. 1. The states were aggregated based on their cultural similarities, ethnicity and common history [8, 9]. With more than two hundred and fifty ethnic groups and over 500 languages in Nigeria, the government had to develop a model for ensuring adequate allocation of political, economic, and educational resources across the regions, and this was achieved by grouping into states and geopolitical zones [10]. Overtime, it became evident that the reception and value for education varies across the geopolitical zones in Nigeria.

Fig. 1
figure 1

The six geopolitical zones in Nigeria [6]

According to Ukiwo [1], access to education in Nigeria has been politicised due to the plural nature of the society, and this has a tendency to engender economic inequalities. Nigeria is majorly divided along ethnic and religious lines, and these factors strongly influence how government policies and efforts generally, are perceived and interpreted. The inequality and politics of higher education started in the early 90’s with the intense pursuit of educational development and attainment in the south west region of the country while the Northern region was quite laid back. To put this in perspective, according to Abernethy [11] as quoted by Ukiwo [1], although the North had about 55% of the national population in 1912, 1926 and 1937, enrolment into primary schools stood at 950, 5200 and 20,250 in the North respectively while in the South, enrolment figures were 35,700, 138,250 and 218,600 and this indicates a significant disparity in regional educational status. The Eastern and Western region of the country achieved drastic growth in educational attainment as compared with the Northern region at the birth of the nation Nigeria. The old Northern region of Nigeria which was educationally laid back, now comprises three geopolitical zones and these are: North West (NW), North East (NE), and North Central (NC) while the former Eastern region and Western region in the South that were developed educationally make up the remaining three geopolitical zones.

Primary and secondary education are vital for tackling illiteracy issues and for ensuring national development. In order to develop a total man equipped with adequate knowledge and capacity to handle today’s societal problems and developmental needs, higher education is required [12]. With a focus on higher education, the National Universities Commission was setup in 1962 as an advisory agency, and it became a statutory body under Decree 1 of 1974 charged with the responsibility of ensuring adequate development and regulation of university education in Nigeria. Increasing demand for university education has created a daunting task for regulatory bodies in a bid to ensure increasing admission slots, and at the same time, enforce quality education with the reality of the ever-present national challenges such as inadequate finance, insufficient educational facilities, unavailable material resources and variances in regional educational status.

Regional differences in the quality and acceptance of education have been drastically reduced over the years due to various deliberate regulatory efforts, but evidences of it still exist in the Nigerian education sector. Admission into institutions in Nigeria is often influenced by the geopolitical zone of the applicants; some states in Nigeria are considered as educationally disadvantaged while some universities have catchment areas, and the cut-off mark (admissible score) for students from this regions are set lower than the general cut-off mark for students from other regions, even though they all sat for the same entry examination. Likewise, government related recruitment are also polarised with deliberate efforts put in place to ensure that successful applicants are selected from all the geopolitical zones. To achieve this, sometimes the pass mark for some regions is deliberately reduced as compared to others. These practices create conflict of opinions, while some Nigerians are pleased that it ensures fair sharing of opportunities among all the geopolitical zones, others think it is unfair and does not ensure success of the best candidates, and is therefore not the best option if Nigeria desires to attain its full human resource potential, and also maximize her entrepreneurial capability and governmental performance [13].

Does the geopolitical zone of origin of students affect their academic performance in higher institutions? In this study, the effect of the geopolitical zone from which a student originates on the anticipated graduation result of the student is examined using a number of selected admission criteria. The dataset analysed in this study is from a private Nigerian university with broad admission criteria that enable students from all the geopolitical zones of the country to fairly compete for the available admission slots. Predictive analyses using data mining on Orange software and regression models were carried out to identify hidden knowledge and vital statistical trends for students from the six geopolitical zones in Nigeria, in order to understand the impact, if any, of ethnicity [14] on the prediction accuracy of the graduation CGPA of University students in Nigeria using selected features.

2 Background

Data related research is on the increase due to the enormous potential benefit in terms of knowledge acquisition and application [15, 16]. Educational data mining is the application of data mining methodologies in educational-data related research studies toward solving education related issues [17]. It entails the stepwise extraction of hidden and useful information from a dataset [18] generated within the education sector in a bid to further understand students, and the effectiveness of the learning process. Educational data mining converts seemingly meaningless data into useful knowledge that can greatly impact the practices and regulatory methods within the education domain. Generally, educational data mining comprises data collection, data sorting and pre-processing, data mining, and post processing of data mining results. Some of the data mining techniques deployed includes clustering, text mining, association rule mining, classification and so forth. The mode of delivery of education has transcended from the traditional classroom-based method to various web-based teaching platforms and adaptive e-learning systems [19] due to the availability of modern learning technologies [20,21,22], and this therefore enables diverse education related data to be generated, logged, processed and studied for a better understanding of the learning process and learners [23], and educational technology integration practices [24].

In a higher institution, various types of data can be collected with different levels of relationship and hierarchy, and as such, the type of knowledge that can be mined from an educational dataset is a function of the nature and the origin of the data. Educational data mining can be used for different objectives [17]. Educational data mining helps to understand the relationship between the educators and the students, it reveals weaknesses and gaps in the learning process, it can be used to predict potential for negative student behaviours [23], and for predicting dropout potential and students performance [25, 26], it aids the development and review of learning models, it can be used to measure the effectiveness of any intervention deployed, and may also be used to guide the learning efforts of learners. The quality and effectiveness of decision processes can be greatly enhanced using educational data mining [27], and likewise, vital feedbacks from students can also be evaluated using data mining techniques in order to identify lapses, areas of need and improvement in teaching and learning processes. Through data mining techniques, students can be classified into unique groups based on well-defined criteria to enable the deployment of purpose-specific and targeted learning interventions, and for identifying common skill set, social attitudes, learning behaviours [28, 29] and interests [30, 31]. The effectiveness of academic modules and the developments of new contents can also be evaluated using data mining techniques [32].

2.1 Related literatures

In the study by Hussain et al. [33], 24-attribute based dataset was evaluated using Bayes Network, Random Forest, J48 and PART classifiers on WEKA data mining platform to predict the semester performance of students. A predictive Multi-Layer Perceptron model was developed by Nurhayati et al. [34] for evaluating the student record of 292 students by using 5 features to predict the graduation potential of the students. In Adekitan and Salau [35] the data of 1445 students covering academic sessions from 2005 to 2009 was evaluated using data mining techniques to determine the extent of the correlation between the admission selection scores and the scholarly performance of student in their first academic year. Similarly, the performance of students in their first year was predicted by Ahmad, Ismail [36] using rule-based, decision tree (J48) and naïve bays classifiers to data mine 399 student records. Using dataset generated at a Bulgarian university, the study by Kabakchieva [37] demonstrated the use of data mining techniques for enhancing university management decision making process by extracting knowledge from 10,330 students’ record with 20 features using the WEKA software.

The study by Alharbi et al. [38] carried out data mining analysis for early identification of at-risk students using the admission records and performance in their first year of study. The results of student-failure potential analysis create an opportunity for warning at-risk student of a potential failure early enough so that drastic intervention can be deployed [39]. In the study by Atta Ur et al. [40], the level of acceptance of course time table and teaching methods by student was measured using machine learning algorithms, by administering an investigative questionnaire consisting of 38 teaching and learning related questions. Learning analytics can be defined as the measurement, gathering, investigation and reporting of relevant data about student, and the learning process, and methods [41, 42]. The research by Bharara et al. [43] identified new metrics that are relevant to the learning process in order to develop a robust model for evaluating student performance. Using 4 categories of features; these are interactional, academic, the level of parent’s participation in the education of their children, and demographic features. In the study, student dataset containing 500 samples was analysed using clustering data mining methodologies.

3 Data descriptive statistics

Statistical attributes of 2413 student dataset across the 4 colleges in Covenant university is presented in this section. Table 1 shows the descriptive statistics of the JAMB Score, WAEC Aggregate and the CGPA of the 2413 students. Figures 2 and 3 show the probability density function plot of the JAMB score and the cumulative probability plot of the JAMB score. Figures 4 and 5 present the probability density function plot and the cumulative probability plot of the WAEC aggregate score respectively, while Figs. 6 and 7 show the probability density function plot and the cumulative probability plot of the CGPA of the students at graduation respectively. To show the variations in the graduation CGPA of students across the six geopolitical zones, the CGPA data is presented as a box plot in Fig. 8, while in Fig. 9 the box plot shows the CGPA variations across the four colleges. As shown in Fig. 8, the North east and North west students had the lowest average CGPA of 3.369 and 3.366 respectively. From Fig. 9, it was observed that students from the college of engineering had the highest average CGPA of 3.52 while the college of science and technology had the least average CGPA of 3.34. Figure 10 presents the distribution of the number of student per grade classification for the six geopolitical zones.

Table 1 Descriptive statistics of the numeric data features
Fig. 2
figure 2

Probability density function plot of JAMB score

Fig. 3
figure 3

Cumulative probability plot of the JAMB score per geopolitical zone

Fig. 4
figure 4

Probability density function plot of WAEC aggregate

Fig. 5
figure 5

Cumulative probability plot of the WAEC score per geopolitical zone

Fig. 6
figure 6

Probability density function plot of CGPA

Fig. 7
figure 7

Cumulative probability plot of the CGPA per geopolitical zone

Fig. 8
figure 8

CGPA variations across the six geopolitical zones

Fig. 9
figure 9

CGPA variations across the four colleges

Fig. 10
figure 10

Distribution of the class of grades across the six geopolitical zones

4 Methodology

The dataset sourced from Badejo et al. [14] contains 2413 student records as generated by the students’ records department of Covenant University with the support of Covenant University Centre for Systems and Information Services. The data contains records of students across the 6 geopolitical zones in Nigeria i.e. South East (SE), South West (SW), South South (SS), North West (NW), North East (NE), and North Central (NC). The data span over a 5-year period from 2008 to 2013 graduation set, across the four colleges of the University i.e. College of Leadership and Development Studies (CLDS), College of Business and Social Sciences (CBSS), College of Science and Technology (CST), and College of Engineering (COE). The dataset contains the year of graduation, the geopolitical zone, the college, pre-admission scores for the West African Examination Council (WAEC) examination and the Joint Admission Matriculation Board (JAMB), and also the Cumulative Grade Point Average (CGPA) at the point of graduation. The class of grade at the point of graduation was added to the features, and the colleges were coded numerically as 1, 2, 3, and 4 respectively, while the geopolitical zones were coded as 1, 2, 3, 4, 5 and 6 respectively. WAEC conducts the West African Senior School Certificate Examination for west African countries. The examination board was established in 1952 for conducting the final examination for graduating high school students.

In this study, the student record was analysed using Orange data mining software as discussed in detail in Sect. 5.1, and regression analysis as discussed in Sect. 5.3. For the two models, the predictive analysis was done in two folds: in the first mode, the geopolitical zone of each student was considered and in the second mode the geopolitical zone was skipped in order to identify if there is any significant variation in model performance as a result of the geopolitical zone feature. Also, statistical analysis and various plots were also developed to show data trends, and variations among the geopolitical zones. For the data mining analysis, the class of grade was used as the target while the CGPA was ignored, whereas for the regression-based analysis, the class of grade was skipped while the CGPA was used as the dependent variable. The class of grade is coded as follows; 1st, for first class, 2UP for second class upper, 2LD for second class lower division, and 3rd for third class. To acquire hidden knowledge from the dataset, the study was carried out in stages which comprise data collection, data cleaning and pre-processing, determination of the descriptive statistics of the dataset, predictive modelling using data mining and regression, result evaluation and interpretation. The impact of data balancing techniques; over-sampling and under-sampling on the predictive accuracy of the algorithms, on the imbalanced dataset was also investigated and presented in Sect. 5.2.

5 Results

5.1 Educational data mining using Orange application

The Orange software is open-source software that provides a visual approach to machine learning for an interactive data analysis which enables easy construction and configuration of workflows for various machine learning studies. In this study, an Orange workflow was developed as shown in Fig. 11. Four data mining algorithms; the Classification Tree, the Neural Network [44], the Naïve Bayes and the Random Forest algorithms were applied to evaluate the predictive capabilities of the student features considered. 70% of the data population was randomly selected using stratified sampling for training the model, while the remaining 30% was deployed for performance evaluation, and the training-test sequence was repeated ten times. The result of the analysis is presented in two folds: considering the geopolitical zone as a feature and excluding the geopolitical zone in the analysis. The lift curve shows comparatively the performance of the data mining algorithms using the average over class as the target class. The confusion matrix of the sub-samples is presented in Tables 2 and 3 for the Classification Tree algorithm, while Tables 4 and 5 show the confusion matrix for the Neural Network algorithm as a percentage of the actual. Tables 6 and 7 show the confusion matrix for the Naïve Bayes, while Tables 8 and 9 present the confusion matrix for the Random Forest data mining algorithm. A comparative evaluation of the performance of the four data mining algorithms is presented in Tables 10 and 11 respectively for the two cases i.e. when the geopolitical zone was considered as a feature, and when it was excluded in the analysis. The data mining classifiers are rated according to their Classification Accuracy (CA), the Precision rate, the Area under ROC Curve (AUC), the F1 score, and the Recall.

Fig. 11
figure 11

The Orange data mining workflow

5.1.1 The tree algorithm

Table 2 Confusion matrix for the tree algorithm considering geopolitical zone
Table 3 Confusion matrix for the tree algorithm excluding geopolitical zone

5.1.2 The neural network algorithm

Table 4 Confusion matrix for the neural network considering geopolitical zone
Table 5 Confusion matrix for the neural network excluding geopolitical zone

5.1.3 The Naïve Bayes algorithm

Table 6 Confusion matrix for the Naïve Bayes considering geopolitical zone
Table 7 Confusion matrix for the Naïve Bayes excluding geopolitical zone

5.1.4 The random forest algorithm

Table 8 Confusion Matrix for the Random Forest considering geopolitical zone
Table 9 Confusion matrix for the random forest excluding geopolitical zone

The quality of a data mining prediction can be examined visually using a lift curve. The Lift curve is obtained by plotting the true positive predictions (TP Rate) against the actual total number of positive instances (P rate) as a measure of the proficiency of the classifier algorithm. The lift curves for the data mining analyses are presented for the four target classes in Figs. 12 to 19 showing the variations in the predictive capabilities of the four data mining algorithms. Figures 12 and 13 show the lift curve for students that graduated with 1st class grade as the target class. Figures 14 and 15 present the lift curve for the second-class upper grade (2UP), both for the inclusion and exclusion of the geopolitical zone as a feature respectively. The lift curve for the prediction of students that graduated with second class lower grade (2LD) is displayed in Figs. 16 and 17 respectively, while Figs. 18 and 19 show the lift curve expressing the predictive capabilities of the four data mining algorithms for the third-class graduation grade (3rd). From the lift curves, it is observed that the algorithms were better able to predict the 1st class grade and the 3rd class grade than the second class upper and the second-class lower division graduation CGPA grades.

Fig. 12
figure 12

The lift curve for the 1st class grade considering geopolitical zone

Fig. 13
figure 13

The lift curve for the 1st class grade excluding geopolitical zone

Fig. 14
figure 14

The lift curve for the 2UP grade considering geopolitical zone

Fig. 15
figure 15

The lift curve for the 2UP grade excluding geopolitical zone

Fig. 16
figure 16

The lift curve for the 2LD grade considering geopolitical zone

Fig. 17
figure 17

The lift curve for the 2LD grade excluding geopolitical zone

Fig. 18
figure 18

The lift curve for the 3rd class grade considering geopolitical zone

Fig. 19
figure 19

The lift curve for the 3rd class grade excluding geopolitical zone

5.1.5 Performance comparison of all the algorithms

Table 10 Performance comparison for the data mining algorithms considering geopolitical zone
Table 11 Performance comparison for the data mining algorithms excluding geopolitical zone

5.2 The impact of under-sampling and over-sampling on model performance

As a result of the typical variation in student performance, the dataset analysed in this study is imbalanced in terms of the number of student in each class of grade (first class, second class upper, second class lower, and third class). To evaluate the effect of this imbalance, a balanced dataset was generated using data-based techniques. Two data mining experiments were performed using under-sampling and over-sampling while excluding the geopolitical zone feature. For the under-sampling analysis, the number of total samples was reduced to 532 i.e. 133 samples per class, and for the over-sampling analysis, the number of samples was increased to 5096 i.e. 1024 samples per class by sampling with replacement.

The result of the under-sampling is presented in Table 12, and it does not show a significant difference from the performance of the model using the actual dataset without under-sampling. By under-sampling, a lot of information is lost. For example, the 2LD samples is reduced from 1024 to 133 which does not depict reality. The result of the over-sampling which is often performed to improve classification accuracy is presented in Table 13, and it reveals an improvement in performance for the Tree, and Random Forest algorithms. Although, over-sampling improved the classification accuracy, but it has created a bias towards the minority class. For example, the total number of first class grade is 133, but this was scaled up to 1024 to create a balanced dataset, and as such, this does not depict reality. In most educational institutions, there will always be more average student, than exceptionally good or poor student. This is a typical challenge with data mining, and as such, the predictive accuracy is further investigated using traditional regression analysis.

Table 12 Performance comparison using under-sampling while excluding geopolitical zone
Table 13 Performance comparison using over-sampling while excluding geopolitical zone

5.3 Analysis using multiple linear regression

Multiple linear regression is a statistical model that represents the relationship between a dependent variable, and more than one independent variables. It attempts to explain the extent to which variations in the dependent variable is explained by variations in each of the independent variables. To further evaluate the accuracy and results of the data mining analysis; multiple linear regression study of the dataset was also carried out. Just like the case of the data mining analysis, the regression analysis is also in two folds. The first case considers the coded geopolitical zone of the students as an independent variable, while the second case excludes the insignificant predictors identified in the first case. The variables applied in the analysis are the year of graduation, the aggregate WAEC score, the JAMB score, the coded college of the student, the coded geopolitical zone as independent variables, and the CGPA as the dependent variable.

5.3.1 Regression analysis considering geopolitical zone

The multiple regression analysis has five independent variables, coded geopolitical zone (GeoZone), coded college, the year of graduation (YoG), jamb score, and the WAEC Aggregate score while the dependent variable is the graduation CGPA. The result of the analysis is presented in Table 14 as the regression model summary, while key regression statistics are available in Table 15. Table 16 shows the regression Anova, and the normal probability plot of the CGPA is displayed in Fig. 20. It was observed from the P value column in Table 14 that, at 5% significance level (P = 0.05), GeoZone (0.9503 > 0.05), College (0.1544 > 0.05), YoG (0.0693 > 0.05) and the intercept (0.0631 > 0.05) are not statistically significant. The Multiple R which explains the correlation between the actual values of the CGPA and the predicted CGPA is 0.4602 while the adjusted R square is 0.2101 which implies that the independent variables only explain 21.01% of the variability of the graduation CGPA of the students. To evaluate the overall model, we consider Table 16 which shows that although some of the independent variables are not strong predictors of the graduation CGPA of the students, but the P value of the regression model for the F test statistic (Significance F) is 1.3E−121 which is far lower than 0.05 and as such it implies that the regression model is actually significant for predicting the graduation CGPA of student.

Table 14 Regression model summary
Table 15 Regression statistics
Table 16 Regression anova
Fig. 20
figure 20

The normal probability plot of the CGPA

5.3.2 Regression analysis after excluding insignificant predictors

Based on the previous multiple regression analysis in Sect. 5.3.1 above, the insignificant independent variables i.e. the coded geopolitical zone, the coded college variable, and the year of graduation were removed from the model and the regression analysis was repeated. The result of the analysis is presented in Table 17 as the regression model summary, while key regression statistics are available in Table 18, and Table19 shows the regression Anova. From Table 17, it can be observed that all the predictors including the intercept are significant with P values less than 0.05 which confirms that the predictors on which the model is based are statistically significant. The F statistic has also increased from 129.3312 in Table 16 to 320.2196 in Table 19, and the overall P value of the model is 4.7194E−124 < 0.05 confirming that the regression model is significant for the prediction of the graduation CGPA of students from different geopolitical zones in Nigeria using the pre-admission features of the student. With an adjusted R square value of 0.2093, it implies the that model only explains 20.93% of the variations in the CGPA. Although the R2 is not better than the previous case in Sect. 5.3.1 but the model is more reliable because it only contains statistically significant predictors.

Table 17 Regression model summary
Table 18 Regression statistics
Table 19 Regression anova

Using the coefficients of the multiple linear regression model, the predictive regression equation can be written as follows:

$${\text{CGPA}} = 0.9051 + (0.0061 \times {\text{JAMB}}\,{\text{score}}) + (0.3805 \times{\text{WAEC Aggregate}})$$

6 Discussion

In this study, the dataset of 2413 graduated students of Covenant University in Nigeria has been evaluated, using the Orange data mining software and regression analysis. Considering the results in Tables 10 and 11 for the data mining analysis using four classifiers i.e. the Tree, Random Forest, Neural Network, and the Naive Bayes Classifiers for the two cases: using the Geopolitical zone and excluding Geopolitical zone as a feature, it can be seen that the AUC for the Tree increased slightly from 0.573 to 0.576, while that of the Random Forest reduced from 0.639 to 0.628. Likewise, for the Neural Network, the AUC increased from 0.674 to 0.681, while for the Naïve Bayes, the AUC increased slightly from 0.655 to 0.666. A similar trend can be observed for the other measure of model fitness parameters in Tables 10 and 11. This implies that the consideration of the geopolitical zone of origin of the students only increased the performance of the Random Forest algorithm, while for the other three algorithms their AUC and CA actually reduced slightly when the geopolitical zone was considered as a feature. The result of the data mining analysis was verified by multiple regression analysis, and the model revealed that the coded geopolitical zone is actually statistically insignificant in the prediction of the graduation CGPA of Covenant University using the admission scores of each student. The above-average accuracy observed, is an indication that the pre-admission academic performance of students is not a complete predictor of their performance once in the university. A number of factors; academic and non-academic such as social lifestyle, financial buoyancy, class attendance, game and internet addictions etc. will shape the performance of a student once admitted into the university. Over-sampling technique to create a balanced dataset towards improving the prediction of the minority class will create a bias towards the minority class. Except if the goal is to predict the minority class, oversampling technique for predicting student performance should be avoided as this distorts reality.

7 Conclusion

In recent times, the application of data mining is gaining ground even in new fields of study. Educational data mining is a great tool that reveals useful information by scientifically mining dataset generated within the education domain. The quality, acceptance and value for education in Nigeria is greatly influenced by ethnicity. There are cities in Nigeria where a significant number of their elites are professors and Ph.D. holders, while only few of such exists in others. In this study, the significance of the geopolitical zone of students on the prediction accuracy of their graduation CGPA in a University in Nigeria was examined. Various statistical analyses were carried out to show trends among the six geopolitical zones. The average CGPA across geopolitical zones varied from 3.366 to 3.427 which is quite close, and implies that the average performance of students across the geopolitical zones do not vary much, as it would have been in the early days of the country Nigeria. The data mining and regression analyses further revealed that the geopolitical zone of the students is not a statistically significant variable for predicting the graduation CGPA of students based on their admission criteria scores, using Covenant University; a private institution in Nigeria as a case study. Although, the multiple regression model developed is significant for predictive analysis but the R2 observed is 0.2099.

The accuracy of the model may be improved by considering additional features, especially the academic scores of the students in their first year which is a reflection of how well the students have transformed from their secondary school perception of education, and settled into the academic rigours and demands of a university. Since the case study university is in the south west region of the country, it will be interesting to find out what the results of a similar study would be using universities in other regions as case study, in order to see if the variations in the average academic performance of students from the six geopolitical zone will have the same trend. Likewise, the analysis in this study is based on regression and data mining using the Orange software. It will be interesting to see what the results and trends would look like using alternative tools and analytical techniques.