1 Introduction

Choosing a university major or a career domain is a challenging task overflowing with concern that makes students distracted (Zhang, 2021). A university major is the study field at the university, such as computer science, architecture, applied languages, management, arts, law, etc. Furthermore, students do not have enough knowledge about the existing career domains, and with the enormous amounts of data on the Web, they do not find easily the needed information (Chamandy & Gaudreau, 2019). In most cases, students know the careers in their environment, such as their parents and family, and discover other domains through career fairs or everyday life. This lack of knowledge forces them to choose an academic major that does not necessarily correspond to their expectations and would affect their career domain in life. In many situations, the career decision is difficult to undo due to various reasons, i.e., financial, familial, personal restrictions, etc., which may create a seed of regret and dissatisfaction with people’s lives (Budjanovcanin & Woodrow, 2022). Therefore, high school students require guidance to balance their interests with the available universities and majors.

The recommender systems (RSs) have proven their ability to assist users with personalized recommendations in many applications (Aggarwal, 2016; AlBanna et al., 2016). RSs have been utilized as the most significant machine learning tools to predict users’ behaviors and recommend personalized items, i.e., courses, articles, books, movies, playlists, products, etc. RSs collect several types of information about the users, including interests, preferences, etc. Two categories of information are usually integrated into RSs in order to suggest adequate recommendations: (1) The characteristic information about the items, such as keywords and categories, and the users, such as preferences, interests, and profiles; (2) The user-item interaction information, such as ratings, reviews, likes, and total of purchases. Every RS employs a filtering technique to retrieve suitable suggestions and items for the users. These techniques are mainly categorized into various types as follows:

  • Collaborative filtering recommender systems (CF), which are based on model-based (clustering, regression, etc.) or memory based (user-item interactions) techniques.

  • Content-based recommender systems (CB), which depend on the characteristics of information.

  • Knowledge-based recommender systems (KB), which adopt either ontology-based, case-based or constraint-based techniques.

  • Demographic-based recommender systems (DF), which depend on the users’ demographic data.

  • Hybrid recommender systems, which represent a combination of two or more filtering techniques.

In the educational domain, most of the existing RSs recommend courses or materials to learners, while the concern of recommending personalized higher education studies and career domain guidance to learners has been neglected. Accordingly, this study investigates the appropriate RS that can answer the following questions:

  • What is the appropriate RS approach to process high school students’ profiles, interests, education, demographic, and career knowledge to provoke personalized recommendations?

  • What are the solutions to overcome the limitations of traditional RS approaches?

  • Is applying one RS approach sufficient to get accurate results? Or should we apply a hybrid one? In the latter case, what is the suitable hybridization technique for a uniform hybrid RS?

Thus, the main contributions in this study can be summarized as follows:

  • This study exclusively investigates different recommender systems for university study field and career domain guidance. Many recommender systems have been conducted in the field of education. To the best of our knowledge, none of them recommended higher education study fields (majors) to high school students based on their profiles.

  • Most of the studies recommended educational resources or activities for learners. To the best of our knowledge, this research study is specialized in guiding high school students towards higher education paths by recommending universities, university’s majors, and career domains.

  • A detailed comparison is conducted by investigating five approaches of RSs to address the problem of university majors and career guidance. These RS approaches are: (1) the stand-alone user-based and item-based CF RSs, (2) the stand-alone DF RSs, (3) the stand-alone KB RSs supported by Case-Based Reasoning (KB using CBR), (4) the stand-alone KB RS supported by ontology and CBR (KB using CBR + Ontology), and (5) the Hybrid RS combined with CBR, ontology, and user-based CF (KB using CF + CBR + Ontology).

  • We uniquely propose the fifth approach, the Hybrid RS incorporated with user-based CF, ontology and CBR (KB using CF + CBR + Ontology), to answer the presented research questions.

  • This proposed hybridization is domain-independent, as it can be extended to solve similar challenges in other domains by defining the corresponding domain of knowledge.

  • We uniquely introduce the GraduateOnto ontology to describe the domain knowledge of graduates’ students from universities. Thus, it could be reused in other problems in the educational domain.

The rest of the paper is structured as follows. In the next section, we present the works related to RSs in the educational domain. Section 3 introduces the data collection and preprocessing processes. Section 4 discusses the five investigated approaches to assist high school students. Section 5 provides a thorough discussion of the experimental results of these approaches, while Section 6 concludes the paper and highlights the expected future works.

2 Related recommender systems in education

Several approaches were proposed to develop recommender systems, which provoke recommendations to their users as per certain criteria that meet their preferences (Isinkaye et al., 2015; Jumaa et al., 2017; Maher et al., 2020; Moussa et al., 2020). However, these approaches make the prediction process fits a specific domain and dataset complexity. In this section, we present the main related works of different RS techniques in the education domain, except the content-based technique, because it is not compatible with our data nature. As for the CF technique, many problems have been detected, such as (Breese et al., 2013):

  • Cold start: The CF technique requires the previous users’ history, like the users’ ratings as well as activities to explore accurate recommendations. This problem occurs when the used data do not comprise enough interests and ratings. Therefore, reliable recommendations become hard to provide. Usually, the cold start issue happens due to three main reasons: a new active user, a new community, or a new item added to the system (Schafer et al., 2007). For instance, if a new active user asks for a recommendation, the system finds difficulty to match him/her to similar users, since minimal history exists about his/her activities or ratings in the database. Hybrid systems are used to overcome this problem.

  • Sparsity: This concern arises when the matrix table of items and users is broadly scattered, which decreases the accuracy of recommendations, having all the available items without any rating from the previous users in the system. Hybridization is commonly used to improve recommendation techniques and to solve this issue. For example, combining the CF and DF recommendation techniques is one method to minimize the sparsity problem of the CF algorithm.

  • Grey-sheep: Odd recommendations result in such concern, in which the user might have other variant characteristics that do not meet other users (de Campos et al., 2010). This may happen when a user neither complies nor contradicts with any user. Grey-sheep issue can increase the error rate in recommendations and consequently affects the precision of the RS. Furthermore, this issue possibly would negatively affect the predictions for the rest of the community in the dataset (Bruke, 2002).

  • Scalability: Enormous groups of users and items exist in several environments in which CF systems make their recommendations. Hence, great computation power is necessary to compute recommendations. Dimensionality reduction and clustering techniques are ways to overcome this challenge.

  • Handling high dimensional data: Elemental recommender filtering approaches cannot maintain high dimensional data, which encompass many attributes. Minimizing the number of attributes could be one solution to this concern or considering a hybrid RS that can handle huge volumes of data.

  • Handling heterogeneous datatypes: Elemental recommender filtering approaches cannot maintain heterogeneous datatypes, in which the hybrid RSs can manipulate heterogeneous data.

Thus, the scope of our related works is on the CF technique combined with other techniques in hybrid RSs to overcome these limitations. The following sub-sections present 3 main categories of RSs in the education field: Ontology-based, CBR-based, and hybrid RSs.

2.1 Ontology-based RSs

An ontology is a formal, precise specification of a shared conceptualization (Guarino et al., 2009). It is formal, as it is written in a formal syntax with semantics allowing it to be understandable and interpretable by machines, while it is explicit because its concepts and the relation between them are explicitly defined. Besides, it is shared since the ontology represents a domain knowledge agreed upon and shared by a group of persons (Guarino et al., 2009). The conceptual model of ontology permits reasoning at all concept levels. Hence, an ontology-based RS is an approach of knowledge-based RS techniques that is very popular in the e-learning domain due to its capability to cluster the learners’ models based on their educational background, learning style, study trajectory, and knowledge level (Amane et al., 2022; Tarus et al., 2018). In addition, it resolves the cold-start problem (Jeevamol & Renumol, 2021). Numerous ontology-based RSs have been developed with the association of many different recommendation techniques (Rahayu et al., 2022).

(Romero et al., 2019; Shishehchi et al., 2012) presented an ontology-based system to recommend suitable materials to learners. The used ontology integrated the learners’ and learning materials’ knowledge. Similarly, (Bouihi & Bahaj, 2019) recommended learning materials based on ontology and Semantic Web Rule Language (SWRL) rules, taking into account the learners’ learning context. In (Assami et al., 2019), an ontology-based RS was proposed to recommend personalized Massive Open Online Courses (MOOC) resources to learners according to their pace of learning, cognitive learning style, learners’ profile, and learning history. (Capuano et al., 2014), built an adaptive e-learning RS called “IntelligentWebTeacher”, combining CF and KB techniques and supported by ontology. Another ontology-based hybrid-filtering system called the ontology-based personalized course recommendation (OPCR) was proposed by (Ibrahim et al., 2019) to recommend a higher education course at the university based on the learner’s profile and the course content. It combined the CF, KB and CB techniques. In (Sarwar et al., 2019), CBR, neural networks and ontology were combined to recommend personalized content to learners according to their profiles and context awareness, in order to enhance the degree of learner’s productivity. In addition, (Qomariyah & Fajar, 2019) proposed an e-learning RS to recommend material content to learners according to their learning style based on the Active Pairwise Relation Learner (APARELL) logic approach and ontology. Authors in (Gulzar et al., 2018) presented the Personalized Course Recommender System (PCRS) based on a hybrid approach of N-Grams queries and ontology to recommend courses to researchers in order to help them choose the appropriate courses in seminal years for time gain and better research.

2.2 CBR-based RSs

CBR (Perner, 2019) is an artificial intelligence technique applicable to problem-solving and learning where earlier cases are available. It is the process of addressing a new problem based on the solutions of similar prior problems, retrieved from a library of prior cases called case-base. CBR-based RSs are also considered as KB RS. Unlike other RSs, a CBR-based RS does not need to save an enormous volume of data about items rating or specific users. The CBR is a specific information retrieval technique extensively used in nearest-neighbor RSs. Several CBR-based RSs have been proposed in the education domain by developing various recommendations techniques.

(Sandvig & Burke, 2005) proposed the Academic Advisor Course Recommendation Engine (AACORN) that implements CBR based on the knowledge of past cases. It integrated knowledge such as past students’ experience and courses’ history to guide learners in choosing appropriate courses. (Gil et al., 2012) proposed the Architecture for Intelligent Recovery of Educational content in Heterogeneous Environments (AIREH) that can retrieve and incorporate varied personalized labeled educational content acquired from diverse environments by a CBR system. In (Bousbahi & Chorfi, 2015), a CBR-based RS was proposed to recommend the most suitable MOOCs from different resources in reply to a particular request of the learner based on his/her profile, requirements, and knowledge. Another assistant RS was introduced by (Duque Méndez et al., 2018) to guide learners in choosing educational material based on CBR. In (Salam & Fathurrahmad, 2021), a student final project RS was proposed to improve the quality of final assignments in universities. The system was based on CBR to detect a list of research topics and used programming languages in similar projects. Authors in (Gomez-Albarran & Jimenez-Diaz, 2009) presented a CBR approach for personalized recommendations and learners’ authoring tasks in online repositories of Learning Objects (LOs), combining CB filtering with CF mechanisms. The learners’ authoring tasks included the integration of ratings of the new as well as existing LOs.

2.3 Hybrid RSs

Hybrid RSs have widely shown improved outcomes rather than any standalone filtering technique. Several hybrid combinations between CF and DF techniques have been proposed. Such hybridization minimizes the limitations of CF, such as the sparsity and the cold start concerns because the DF technique does not need the user’s rating history. As presented, most of the related works more concentrate on recommending learning to learners’ content using different techniques, rather than helping students to find the academic path corresponding to their interests. The applied techniques in these works are summarized in Fig. 1 to demonstrate the strengths and weaknesses of the used hybridization approaches.

Fig. 1
figure 1

Comparison of the hybridization of RS techniques

For instance, (Schafer et al., 2007) proposed the hybridization of CF and DF approaches to improve the movie recommendation quality. In (Xia et al., 2009), an augmentation item-based CF hybrid system was presented using demographic data to predict missed data such as age and occupation information. In (Agarwal et al., 2017), the authors used users’ demographic data instead of users’ rating history to generate accurate movie recommendations and overcome the CF cold-start issue. Eventually, minor hybridization approaches were addressed using three filtering techniques. In (Benouaret, 2017), the demographic, semantic, and CF core techniques were combined to propose a hybridization strategy to reinforce the experience of visitors in tourist places and museums. Each method was adapted to a specific stage of the museum visit. The demographic approach was applied to overcome the CF cold-start problem, the semantic approach provoked recommendations semantically close to their previous appreciated visits, whereas the CF approach recommended visits previously liked by similar users.

In education, hybrid RSs have been used in most cases to recommend learning activities or resources to learners (Deschênes, 2020; Tarus et al., 2017). Indeed, the authors in (Farzan & Brusilovsky, 2006) worked on developing an RS based on an adaptive community to recommend appropriate courses to active learners. They analyzed learners’ career goals by implementing a social navigation technique. Protus was presented in (Klašnja-Milićević et al., 2011) as a programming tutoring based on the learners’ knowledge levels and interests. In (Chavarriaga et al., 2014), a KB with CF technique was introduced in order to advise learning materials, helping learners achieve advanced competence levels using an online course platform, whereas in (Tarus et al., 2017). a KB hybrid RS was proposed to advise e-learning materials to learners based on sequential pattern mining with ontology. Moreover, authors in (Rodríguez et al., 2015) introduced a student-centered LO RS that combined CB, KB, and CF techniques, in which the learner’s model/profile was used to adapt the LOs retrieved from the LO databases, considering the descriptive metadata stored for the objects. Yet, a hybridization of CF, KB, and DF approaches was not tackled.

3 Data collection and preparation

Since our study focuses on assisting high school students to conveniently decide about their higher education choices, the expected recommendations will be based on the university graduates’ educational trajectories. Unfortunately, the required data to construct an adequate knowledge base are not available. Moreover, it is very challenging to be obtained online, where users are unwilling to reveal their data. Thus, we gathered the required data by disseminating an online survey including 55 questions. The survey was created in bilingual form (English and French). The survey’s dissemination process included the graduates of the Lebanese university in three governorates: South, North, and Beirut. The online survey was posted on social media for three months and sent by email to many mailing lists. A real-world dataset was collected of 869 university graduate profiles and 20,000 high school course ratings.

In order to evaluate the five investigated hybrid RS approaches to generate recommendations for universities, university majors, and career fields, the following four criteria were adopted to create our survey sections and questions. The survey should include graduates':

  1. 1.

    Family information, demographics, and personal data, such as gender, hobby, language, etc., to recommend personalized recommendations, forming the "Graduate personal information" section.

  2. 2.

    High school or vocational school data, such as graduates' school courses interests, school sector, school education system, etc., to recommend to high school students’ recommendations based on their high school information, forming the "Graduate high school or vocational school information" section.

  3. 3.

    University information, such as teaching effectiveness, university major, university name, etc., to recommend to high school students recommendations related to university paths, forming the "Graduate first and currently attended university information" sections.

  4. 4.

    Career information, such as their current occupation, career interests, etc., to recommend to high school students career choices related to their career interests, forming the "Graduate interests and career information" section.

All these criteria have covered the graduates' trajectories, starting from studying at high school, then studying at the university, followed by entering the career market. Integrating university graduates' trajectories data in our hybrid recommendation process would help recommending to high school students promising university paths and career choices. Figure 2 shows some samples of the questions in the conducted survey. The university graduates rated their level based on 23 high school courses, namely: Arabic language, Biology, Chemistry, Dance, Drawing, Economics, English language, French language, Other foreign languages, Mathematics, Geography, History, Music, Literature, Physical education, Philosophy, Science of engineering, Physics, Psychology, Technology and Computer Science Religion, Sociology, and Theatre.

Fig. 2
figure 2

Samples of survey questions

4 The investigated approaches

In this section, we investigate five approaches to choose the most accurate one to recommend a major and career domain to high school students based on the trajectories of university graduates. A case study is formulated based on the dataset collected from the survey conducted in Lebanon. The selection of these approaches was affected by the data types and high dimensionality of attributes in the dataset. These approaches are:

  1. 1.

    The stand-alone user-based and item-based CF RS.

  2. 2.

    The stand-alone DF RS.

  3. 3.

    The stand-alone KB RS supported by CBR.

  4. 4.

    The stand-alone KB RS supported by ontology and CBR.

  5. 5.

    The KB Hybrid RS combined with the user-based CF and supported by ontology and CBR.

The overall research process for this study is represented in Fig. 3. Our dataset has been collected from a survey as presented in Section 3. Then, it has been analyzed and stored into the knowledge base (ontology), whereas the ratings were stored into databases. Each of the five considered approaches implements one or more RS techniques, as shown in Fig. 3. Applying these approaches on our dataset has generated different results of recommendations. We compared these results and evaluated them in order to determine the most appropriate approach for our research questions and to present the optimum personalized recommendations to the high school students. Thus, this study provides a comparative analysis considering two main perspectives:

  • The integrity of all data related to the adopted research questions, which is reflected in the four data categories collected in our dataset, as discussed in Section 3.

  • Considering all the evaluation criteria that can be applied to evaluate the five recommender systems and support our investigation to answer the adopted research questions, including the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) accuracy metrics, in addition to other similarity rate metrics as further explained in the upcoming sub-sections.

Fig. 3
figure 3

The overall conducted research process

A detailed experimental evaluation is demonstrated in the following sub-sections for each approach to examine the accuracy of its recommendations for high school students.

4.1 The stand-alone user-based and item-based CF RS

This approach integrates ratings data to find interests’ similarities between high school students and university graduates, so that to generate career domain recommendations. In our case study, the CF recommendation is based on high school students’ and graduates’ course ratings. The university graduates rated their level on 23 high school courses. Table 1 shows a sample of course ratings for a high school student. The CF RS engine uses these ratings to recommend to him/her a career domain based on the prior university graduates’ ratings.

Table 1 An example for a high school student courses’ rating

Since the experimental evaluation is based on course ratings, we developed the memory-based technique of CF, in which both the user-based and item-based methods are adopted (Ghazarian & Nematbakhsh, 2015). The user-based method associates similar users to the active user, recognized as neighbor users. Furthermore, missed ratings are predicted using various similarity metrics. The metrics calculate the similarities values based on the past users’ ratings. The item-based method focuses on the items instead of the users to find the most similar items based on the active user’s ratings compared to the past users’ rating history. As for the CF experiments, we used a dataset of 469 objects having 39 attributes. The objects represent the graduates of the university and the 39 attributes represent their high school courses and career ratings. The dataset contains 11,000 ratings for 39 attributes, provided by the 469 graduates. All the university graduates in the dataset rated at least 20 attributes.

The dataset was split into two sub-datasets in our experimental study; the training and testing data. The Euclidean distance similarity, City Block similarity, Spearman Correlation similarity, Pearson Correlation similarity, Uncentered Cosine similarity metrics (Bagchi, 2015) have been used to find similarities between the graduates and high school students based on their ratings. For each similarity metric, an evaluation has been conducted based on the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) accuracy metrics (Bagchi, 2015). Some parameters, such as the N neighborhood size, and the training ratio, have been defined as shown in Fig. 4. The N neighborhood represents the nearest neighbors to the object location. As for the user neighborhood, the RS can find the most similar users to the selected user. The size of the neighbor can affect the prediction quality. By changing the number of neighbors, the sensitivity of the neighborhood is determined. The training ratio represents the percentage of each user’s preferences to use for recommendations production; the rest of the training ratio is compared to the estimated preference values to evaluate the recommender’s accuracy.

Fig. 4
figure 4

RMSE/MAE for the item-based and user-based similarities

To evaluate the accuracy of this CF RS approach, we implemented the mahout evaluation method (Giacomelli, 2013). For each user, 90% of the preferences provided by the given data model were set as training data to generate the recommendations, where the rest of the data was compared against the estimated preference values to see how much the recommender's predicted preferences match the user's actual preferences. The return is a score representing how well the recommender's estimated preferences match actual values. Lower scores mean a better match and 0 is a perfect match. The result of many experiments shows that the user-based CF algorithm with the Euclidean distance similarity metric, Neighborhood size equal to 50, and training ratio equal to 0.8 generated the lowest MAE and RMSE values that is 0.45 and 0.58 respectively, compared to the item-based CF algorithm and its similarity metrics as shown in Fig. 4. The following is an example of a recommendation based on the user-based CF approach:

  • Recommended university major: Information Technology, similarity rate: 3.0”

In the above recommendation, the similarity rate represents the highest rate of the recommended major, which means it is 100% similar, whereas Information Technology represents the recommended career domain. Even though the obtained results are accurate, this approach was applied to a part of our dataset, which is the rating. However, our dataset contains more heterogeneous data, such as students’ interests, career information, and demographic data. Thus, this approach is inadequate for the whole dataset.

4.2 The stand-alone DF RS

The DF RS is based on the demographic data, which does not take into account the domain knowledge, user interests, and ratings in its recommendation process. Thus, the DF RS can provide recommendations before receiving any rating from the active students. However, for many students, generalizations with the demographic features seemed to be too general for the highly personalized recommendations. For example, not all 17-year-old male students who liked scientific courses in high school would prefer the same university major or career in the future. In addition, students with different opinions or unusual interests result in low correlation coefficients with other students. The recommendations for this kind of students are very hard to generate. Thus, the recommendations that are based only on demographic data, such as student’s language, gender and location, etc., may lead to inaccurate predictions, leading to the grey-sheep limitation (de Campos et al., 2010).

Therefore, the stand-alone DF RS approach was considered unsuitable for the dataset, since it does not take into consideration the domain knowledge, students’ preferences, and rating history. In addition, the experiments revealed that no correlation was found between the demographic data and the courses ratings in the university graduates’ dataset. In our case, demographic data are not enough on their own; they must be combined with domain knowledge, course rating, and students’ preferences to generate more personalized recommendations. To overcome the problems and limitations of the stand-alone DF RS, different RSs approaches should be examined, such as the KB and hybrid systems.

4.3 The stand-alone KB RS supported by CBR

How to generate personalized recommendations to a high school student based on his/her interests and knowledge and not on his/her coursers’ ratings? The KB approach could be a good solution to generate recommendations based on the domain knowledge instead of ratings (Tarus et al., 2018). Thus, we implemented the CBR KB approach, which uses indexes to speed up retrievals from a case base. The indexes apply feature matching to organize and label cases so that appropriate cases can be found when needed. Cases may be indexed by an open vocabulary or a prefixed, and within a hierarchical or a flat index structure (Perner, 2019).

As the size of the case base increases, it becomes critical that CBR would access the stored cases efficiently (Recio-García et al., 2014). To address this, jColibri provides a persistence mechanism through different “Connectors” and data structures for in-memory organization for case base management. jColibri separates the case storage from the indexing structure, where Connectors know how to access and retrieve cases from the medium and return those cases to the CBR system in a uniform way. In-memory or indexing is the second layer of Case Base management. In-memory case organization is the data structure used to organize the cases once loaded into memory, i.e., linear lists, trees, case retrieval nets, etc. Fig. 5 shows a sample of a high school student’s query via the KB RS’s implemented interface, whereas Fig. 6 shows five recommendations suggested to the high school student based on his/her query. Each recommendation suggests a university/college, university major, and career domain. The high school student can select a suitable recommendation from the retrieved cases that match his/her interests. To evaluate the accuracy of the recommendations, we used NNScoringMethod in jColibri2 to measure the similarity rate (Recio-García et al., 2014). This function performs a Nearest Neighbor numeric scoring comparison of attributes to evaluate the retrieval of the most similar cases. Fig. 7 shows that the first two solutions are 100% similar to the above query and the other three solutions are 80% similar to the same query. Although the CBR-based approach provided good results, it does not take into account the semantic similarity between the concepts.

Fig. 5
figure 5

A sample of CBR KB RS query

Fig. 6
figure 6

The CBR knowledge-based RS retrieved recommendations

Fig. 7
figure 7

The CBR KB RS recommendations’ evaluations

4.4 The stand-alone KB RS supported by ontology and CBR

The KB RS supported by ontology and CBR used career knowledge, higher education knowledge, students’ interests, and demographic data. We constructed “GraduateOnto” as our ontology that encompasses the higher education, school, career, and student profile concepts as shown in Fig. 8. The purple classes and subclasses represent our GraduateOnto concepts, the green classes represent the DBpedia ontology concepts (Lehmann et al., 2015), and the pink classes represent the Schema.org concepts (Patel-Schneider, 2014). Combining ontology and CBR approaches would improve the personalization of recommendations. We linked our ontology to these sources by reusing some concepts already defined online and thus these concepts are linked to open data. The CBR is based on three elements: description, solution, and the case. Thus, our GraduateOnto and the CBR are connected as follows:

  • The graduate concept encompasses the graduates’ cases describing all graduates’ instances in the knowledge base.

  • The student information, school information, and person concepts cover the description of these graduates.

  • The university information and the career information concepts encompass the solution.

Fig. 8
figure 8

GraduateOnto design

The experiments are based on the prior graduates’ cases stored as instances in our ontology. The graduates’ cases were extracted from our survey based on many criteria, such as using only data related to university graduates having a university major related to their current job and their job that meets their interests. The final refined case base encompasses 658 graduate cases. This case-base is integrated into the ontology design and computed by jColibri2’s NNScoringMethod retrieval function (Recio-García et al., 2014). It uses: (1) global similarity functions such as the mean Average to compare compound attributes and (2) local similarity functions such as Detail to compare simple attributes. For example, the Graduate case component is a compound attribute composed of several simple attributes (gender, language, hobby, country, etc.). When two cases are compared, local similarity computes the similarity between simple attributes and global similarity computes the average over the local similarities. Thus, a global similarity function is assigned to the description like the average function. The method returns a collection of RetrievalResult objects.

Most similar cases are selected once they have been scored according to their similarity with the query, where only the top k most similar cases are selected. Hence, we apply the k-NN retrieval process, which combines Nearest Neighbor scoring and top k selection. Once the similarity function and weight are set for the attributes, the similarity function is executed to obtain a list of retrieval result objects that contain the most similar cases to the query. Finally, the most similar cases are obtained using the selectTopKRR function (Recio-García et al., 2008). We used the jColibri2 retrieval process to compare high school students’ cases with university graduates’ cases and find the most similar cases in order to provide appropriate recommendations. Fig. 9 shows an example of a high school student’s query. The query’s attributes are selected from the instances saved in the ontology design. The scoring of the most similar cases in this process is computed based on the prior cases’ similarity with the query. The top k most similar cases are retrieved, mixing the Nearest Neighbor scoring and top k selection techniques. The calculation returns a value between (zero ∼ one), showing the retrieved solution or case being the least or most similar to the active query case respectively. Fig. 10 presents the first retrieved case with 100% similarity to the active user query case in Fig. 9, while Fig. 11 shows the second most similar case to the active user query, which is approximately 88% similar to the submitted query.

Fig. 9
figure 9

Active student query example

Fig. 10
figure 10

Most similar retrieved case to an active student’s query

Fig. 11
figure 11

Second most similar case

This KB RS generated personalized recommendations to the high school student with the support of the ontology and CBR concepts. The HoldOutEvaluator algorithm was used to evaluate the accuracy of this RS approach (Recio-García et al., 2014). As shown in Table 2, the evaluation results show high accuracy levels using 10% and 15% of the dataset for testing, carrying out the process several times with a different number of cycles.

Table 2 HoldOutEvaluator evaluation results (approach 4)

Thus, this system is considered more efficient than the previously tested RSs. The analysis of this approach revealed that the ontology is very useful in supporting CBR KB RSs. The ontology helped to integrate the high school students’ interests, graduates’ knowledge, and high school and higher education knowledge into the KB RS, conceptualizing them in a formal language. Likewise, the reuse of ontology also benefits from its reliability and stability. Moreover, throughout the similarity calculation, the ontology permits linking the gap between the high school student’s query and the case-based vocabulary. This integration allowed the system to generate personalized recommendations. However, this approach does not take into consideration the high school students and graduates courses’ ratings.

4.5 The KB hybrid RS combined with the user-based CF and supported by ontology and CBR

This approach presents a hybrid RS based on the CF, CBR, DF, and KB techniques with ontology. This approach is a combination of approaches 1, 2, and 4. It allows for generating recommendations based on the domain knowledge, students’ ratings, interests, and demographic data. The demographic data of graduates has been integrated into GraduateOnto. Thus, this hybridization aims to improve the recommendations and precision of the system. Fig. 12 illustrates the sequence diagram of this approach. First, the high school student enters his profile, preferences, and ratings of the high school courses into the system’s GUI. These data are stored in the corresponding databases. Then, the student uses the search system to get personalized recommendations about the universities, majors, and careers domain. In order to get such recommendations, the user-based CF system interrogates the databases to search for similar users (in our case the graduates) based on the high school courses’ ratings. As mentioned in Section 4.1, the recommendation of our user-based CF is based on the Euclidean distance metric. The CF system selects the most similar result and integrates the career domain information of this result as a new feature in the KB system, by using the Feature Augmentation Hybridization strategy (Bruke, 2002). This new feature is used in the KB system as a assist knowledge to the query of the high school student.

Fig. 12
figure 12

The sequence diagram of Approach 5

Figure 13 illustrates a query sample requested by a high school student to get recommendations for the university paths, while Fig. 14 shows an example of the “Graduate Interest Career Domain” feature that is integrated with the high school student’s query, as well as the top N recommendations generated as per Approach 5. The KB system interrogates GraduateOnto to search for similar paths between graduates’ cases (that are saved as instances in the ontology) and the high school student, based on his/her search query and the CF recommendation. The system selects the top K results and presents them to the high school student. Approach 5 integrates a dataset that encompasses 658 graduate cases representing only the university graduates that have a university major related to their current job and their job meets their interests. By implementing the ontology similarity and CBR retrieval method, this hybrid KB system can retrieve the most similar cases that best fit the high school student’s interests.

Fig. 13
figure 13

A sample query in Approach 5

Fig. 14
figure 14

Recommendation results in Approach 5

The HoldOutEvaluator algorithm was applied to evaluate the accuracy of Approach 5 (Recio-García et al., 2014), and trained with 658 university graduate’s cases as shown in Table 3. Table 4 shows that this approach achieves high accuracy levels as per two criteria, namely the “accuracy of retrieving the most similar cases” and the “accuracy of generating appropriate recommendations”. The tests were conducted with a sample size of 60 high school students, 40 university students, and 40 university graduates. The university students were requested to participate since they have experienced the transition from school to university, whereas the university graduates were requested to participate as they have already passed this transition and know its outcomes. The results are shown in Tables 5 and 6. This test’s purpose is to find out whether the use of prior graduates’ knowledge can be applied to assist current high school students. All experiments have proven the efficiency of our user-based CF system supported by CBR, ontology, and KB RS as presented in Table 4. In addition, our analysis indicated that the hybridization in Approach 5 offers the most adequate solution to our high-dimensional data, which include more than 50 heterogeneous attributes.

Table 3 HoldOutEvaluator evaluation results (Approach 5)
Table 4 The accuracy results of Approach 5
Table 5 Results of interest in the recommendation of Approach 5
Table 6 Results of users’ satisfaction in the recommendation of Approach 5

5 Discussion and insights

Table 7 summarizes the comparative analysis of the five investigated approaches presented in this paper, showing the advantages and the disadvantages of each approach. The experiments demonstrate that the stand-alone user-based and item-based CF approach is not adequate to our dataset, since it computes only rating history and has many limitations such as cold-start problem, data sparsity and, grey-sheep. Additionally, it recommends only career domains. Similarly, the stand-alone DF approach is not adequate for our dataset since it integrates only demographic data. In addition, our analysis revealed the weakness of DF approach in generating personalized recommendations, as it does not consider the ratings and domain knowledge of graduates. On the other hand, the stand-alone KB supported by CBR is a valuable tool when the item is infrequently used because it is not dependent on the ratings. However, like the first two approaches, this technique is not adequate for our dataset as it does not consider the graduates’ ratings. Furthermore, this approach needs to be combined with an ontology-based technique in order to define the graduates’ domain knowledge and to retrieve similar cases semantically. This was the aim of the fourth investigated approach, the stand-alone KB supported by ontology and CBR. This approach has generated personalized and accurate recommendations as illustrated in Table 2. However, as the previous approaches, it is not adequate for our dataset as it does not take into account the graduates’ ratings. Therefore, it should be incorporated with the CF technique in a hybrid RS in order to compute knowledge and ratings, which is the aim of the fifth approach (KB using DF + user-based CF and supported by ontology and CBR).

Table 7 Comparative analysis of the five recommender systems approaches

Therefore, it can be concluded from this comparative analysis that this hybridization is the most suitable recommendation approach to answer our research questions by generating 98% of similar cases, 95% of them are personalized based on the interests of high school students, since it computes the returns based on the domain knowledge, high school student profile, ratings, interests, and demographic data. The average usefulness of the proposed hybrid approach ranged from 92.5% to 95% as presented in Table 5, whereas the average satisfaction level ranged from 90% to 92.5% as shown in Table 6. Thus, high-accuracy recommendations are generated by integrating the three core DF, CF, KB techniques as presented in Table 4. Furthermore, it shows a high precision in treating heterogeneous data types and high dimensional datasets as shown in Table 3. Based on this discussion and analysis, we recommend using this hybridization approach in other problems, as well as we encourage the reusability of the constructed GraduateOnto ontology in other problems related to the educational domain.

6 Conclusion

In this paper, we implemented and evaluated five recommendation approaches in order to select the most appropriate approach to recommend a university, university major (study field) and career domain to high school students based on their profile, preferences, and level at high school. These approaches are: (1) the stand-alone user-based and item-based CF approach, (2) stand-alone DF approach, (3) stand-alone KB approach supported by CBR, (4) stand-alone KB approach supported by ontology and CBR, and (5) KB Hybrid RS combined with the user-based CF technique and supported by the ontology and CBR. In addition, the GraduateOnto ontology is introduced, which describes the domain knowledge of graduates from universities, including their profile, information about their high school studies, university studies, career occupation and interests. The experimental results indicated the efficiency of the KB hybrid RS, combined with the user-based CF and supported by ontology and CBR approach, generating 98% of similar cases, 95% of them are personalized accurate recommendations based on the interests of the high school students. Thus, we deduce that this hybrid approach is promising to guide high school students towards the university paths. The comparative study of the five implemented approaches presented in this paper could help researchers to determine the appropriate hybridization techniques for their work. Furthermore, the introduced combination of CF, DF, KB supported by ontology and CBR is novel and could be a solution for similar problems, regardless of the application domain. Besides, the uniquely constructed ontology could be reused in other problems in the educational domain.

We plan in the future to increase our dataset in order to verify the results on more cases, as well as to extend the ontology by considering more information about the majors/study fields (description, fees, available grants, privileged majors, percentage of unemployment, etc.). In addition, we intend to apply different ontology acquisition approaches, i.e., automatic or semi-automatic construction of ontology, acquiring the corresponding terms and relations between the concepts and embedding them with an easy ontology representation for better retrieval and reuse. Finally, we plan to evaluate the adaptability of our system in another country, such as France, by using a new dataset.