1 Introduction

In the past decades, online education has become a global phenomenon by penetrating into virtually all countries in the world with the expansion of the Internet infrastructure (Chan et al., 2022; Dhawan, 2020; McCarty et al., 2006; Rye, 2014). This trend is particularly the case for China, with perhaps the largest population of learners on the Earth. According to a 2018 report, up to 144 million people had taken up online education as of June 2017 in China (Yang & Du, 2018); the number surged amazingly to over 300 million (including millions of teachers) in 2020 after the breakout of the COVID-19 pandemic, when online delivery was almost the only choice for both educators and students then (Li, 2020b). This world’s largest ICT-based teaching experiment and reform have been continuing in the post-epidemic era, offering a wealth of diversified education resources through the Internet (Li, 2020a). Online education, receiving unprecedented and ever-increasing attention from educators, students and the public, has become integral to the whole educational system.

The abundant materials provided by all kinds of online education platforms bring both benefits and challenges to users. On the one hand, teachers and students are enabled to employ online courses to complete pedagogical processes. On the other hand, it tends to be rather time-consuming to retrieve the right materials that one is seeking out of a huge amount of online resources. To some extent, the efficiency of resource search and retrieval may determine the quality and influence of online education. Thus, enriched content entails enhanced classification methods for effectively and efficient delivery of online education, otherwise many learners may get lost and bored, and even daunted, when wading through massive data of online materials.

In the practice of classifying online education resources, there seem to be problems of incomplete coverage of fields and unscientific categorisation, among others. In this research, an accurate classification method based on support vector machine (SVM) is proposed in order to improve the utilization of online education resources. In a properly built model for autonomous learner users, SVM has the potential to help enhance resources classification and allocation of online education resources. Delving into the optimisation of SVM is hence of significant educational and research value. A brief review of relevant literature is set out below, followed by the proposed algorithm and results of a comparison experiment to evaluate the effectiveness and efficiency of the SVM-based classification algorithm.

2 Literature review

Classification can be seen as a process to summarise the features and classification rules based on sample data sets of existing resources and to establish rules of discrimination, so that new resources can be categorised according to such established rules (see Diederich, 2008). To classify the ever-growing body of online education resources may be a task beyond human capacities. We may take China as an example: years ago, it was planned that by 2020, there would be over 3,000 national-level courses available online (Yang & Du, 2018); the amount of online courses offered by all levels of education providers may become even multi-fold of that now. In addition, each course may include multiple formats of materials: texts, images, audios, videos and so on, which exacerbates the complexity and difficulty of data search and retrieval. In this case, it seems a good solution to train machines to analyse such large and complex datasets (Shalev-Shwartz & Ben-David, 2014), and document classification has been a traditional task that machine learning can deal with satisfactorily (Mehryar et al., 2018).

In history, international research on the classification methods of education resources had an early start. Methods based on word frequency statistics and factor analysis have been widely used in email, information retrieval, and so on (see Joachims, 2002). Other traditional methods of more complex resource classification may have to rely on multiple neural network integration and deep learning (Lam et al., 2012). However, some problems, such as incomplete coverage of fields and unscientific categorisation, can be identified in practice, which ultimately affects the quality of their application. Thus, the support vector machine (SVM) is adopted as the basis for the proposed classifier in this research, since this algorithm may help solve the extremum problem in the traditional methods (Hamel, 2009).

SVM is an established method in natural language process (NLP). It is one of the most significant kernel-based methods of machine learning and one of the most popular supervised learning algorithms, widely used for classification and regression analysis (Steinwart & Christmann, 2008). A classic definition is set out as follows (Cristianini & Shawe-Taylor, 2000, p. 7):

Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory. This learning strategy introduced by Vapnik and co-workers is a principled and very powerful method that in the few years since its introduction has already outperformed most other systems in a wide variety of applications.

Given the agreed-on outperformance of SVM over other methods, it is not surprising that SVM has been widely used in research of various fields, including but limited to financial analysis, medical analysis, biology, so on and so forth (Ma & Guo, 2014; Murty & Raghava, 2016; Suykens et al., 2015; Wang, 2005).

Here is a basic explanation and illustration of the fundamental principle of SVM for classification, as shown in Fig. 1. The central task of SVM is to create a hyperplane between data sets to indicate which class an item probably belongs to. The challenge is to train the machine to understand structure from data and map with the right class label. For the best result, the hyperplane has the largest distance to the nearest training data points of any class. Thus, the classification based on the SVM algorithm can be considered as a hyperplane made up of multiple separate heterogeneous data samples, to which a solution can be worked out. Through the calculation of the maximum spacing of heterogeneous samples, the category of the target sample can be determined, and upon further processing the classification of all samples can be completed.

Fig. 1
figure 1

The principle of classification using the SVM algorithm

The signs " + " and "-" in Fig. 1 represent distributed and negative data samples respectively. It can be seen in the graph that, H points to the centre of the adjacent edge between the two types of samples, serving as the isolation barrier for different types of samples; L1 and L2 are located on the verge of the two types of samples, and the distance between them shows the classification margin. During the process of solving the maximum value, if the dividing line H can completely separate positive and negative samples like a watershed, H is then optimal (Liu et al., 2019). Experimental results show that, if the classification samples are distributed in multi-dimensional space, on the assumption that there is hyperplane H-value completely separated with the most significant classification interval, and that there are two types of classification – ‘positive’ and ‘negative’, then the classification interval is the largest, which can also be used to predict other data of the same classification. Deng et al. (2013) provide more comprehensive and extended elaboration on the mathematic representation of SVM.

Certainly SVM is not perfect from its inception; it has been embracing continuous improvement with researchers’ effort. Wiering and Schomaker (2015) point out the limitations of standard SVM: a shallow model with a single layer of support vector coefficients, and overreliance on inflexible kernel functions; instead, they propose a transition from the single-layer SVM to the multi-layer SVM with deep architectures. There have also been proposals to integrate SVM with other established algorithms (e.g. Stoean & Stoean, 2014). In a word, iterative progress can make SVM a refined tool for both research and practice. Although it may not be the most novel or popular one, there seems to be much room to optimise this method based on previous and existing endeavours.

3 Methodology

This research follows the methodological framework of design science research (DSR) or design research. DSR has been widely used in IT engineering for decades (Gregor, 2021; Hevner et al., 2004; Peppers et al., 2007), and it features intentional creation and development of artifacts that serve human purposes (Dresch et al., 2015), aiming to “change the state-of-the-world through the introduction of novel artifacts” (Vaishnavi & Kuechler, 2008, p. 18). Apart from tangible and concrete solutions to identified problems in human activities, DSR also seeks new understanding through dynamic interaction of artifacts and knowledge.

This research aims to develop an optimised classifier with higher classification accuracy of online education resources and gain new knowledge on how to best classify massive resources for best delivery of online education. The ultimate goal of the proposed method is to categorise the disciplines, specialisation and chapters of such resources based on the content. However, different education resources rely on various storage formats, including texts, videos, web pages, images, etc. It is helpful to target online education resources in different multimedia formats and determine the specific types via content analysis and feature analysis. Prior to the design work, it is necessary to identify and set rules for online resources of different disciplines as the reference standards for classification.

3.1 Automatic collection

Appropriate methods of automatic collection and processing of online education resources are entailed. The first step is to collect the target online education resources on the Internet using web crawlers. Compared with traditional methods of data collection, web crawlers can start to search webpages on the Internet according to pre-set URLs and extract the links in the pages. Then new links can be retrieved, and education resources can be downloaded automatically. The initial collection of education resources involves data including texts, videos and webpages, and the main module of web crawlers supports webpage data download and data parsing.

In the actual collection process, the action is to initiate the downloader module of web crawlers according to set parameters and read the first URL. According to the results, online education resources are searched on the Internet. Comparison with data in the local resource library is then conducted upon identification of the online resource location, so as to check if the online resource has already been stored in the local repository. If it is already included, there is no need to repeatedly download it, otherwise it will be downloaded and stored in the local repository. During the comparison of education resources, apart from the resource names, the size and update date/time should also be compared, to ensure that the collected resources are of the latest versions. Upon the completion of a round of collecting resource data, the corresponding URLs are stored in the list of assessed resources. The next step is to parse the newly downloaded data using the parser in the crawlers and extract the URL information. If the extracted URL is existing, it is deleted immediately, otherwise it is stored in the list of resources to access. Upon the completion of resource downloading following the above procedure, another URL is extracted and the above procedure is cycled. All the online education resources are collected from the Internet, and collected data are stored separately according to different storage types.

3.2 Processing of collected resources

It is helpful to clarify the processing of online resources in text, image and video formats. The purpose of processing text-format educational resources is to extract the target data and convert it to row format. The units of text data are words, phrases, paragraphs and so on, and text information exactly consists of such units of natural language. In the process of text feature representation, it is a major step to extract the noun phrases and proper nouns, like names of people and places in the texts. There might be a large number of characters irrelevant to the central ideas of the texts, such as numbers, links, punctuation marks, and stop words. In order to reduce the complexity of processing text data, it is helpful to have the semantic vocabulary set of highly conceptual texts as the textual feature set and replace original text with the textual features. To ensure the effective information content in the textual feature vector, pre-processing text data is an important procedure worthy of further research. The next step is to filter the stop words in the original texts. Stop words can be divided into two kinds: undistinguishable words and function words. Undistinguishable words refer to high frequency words in almost all types of texts. Function words include pronouns, participles, and so on. Upon establishment of the stop word list, the matched words will be deleted, while the unmatched will be retained in the keyword list.

3.2.1 Preprocessing image-format education resources

Pre-processing image-format educational resources takes two steps: the one is screening based on image quality, and the other is unified processing of image formats. Simple filtering comes first during the quality-based screening of images, where median filtering processor is employed. If the output image resolution is higher than 75%, it will be retained, otherwise the image will be removed. The unified processing of image formats can be divided into two aspects: the unification of image storage formats and image colours. It is set that the storage format of educational resources should be JPG and the colour space should be RGB.

3.2.2 Pre-processing video-format education resources

To guarantee the classification efficiency of educational resources in video formats, it is necessary to mine valuable information frame-by-frame in the video resources, which can be deemed as a video-format bag of words. Figure 2 illustrates the text mining process of video-format educational resources.

Fig. 2
figure 2

Flow chart of text mining of video education resources

Mining of image information in video-format educational resources can be conducted in the same way. Then video-format educational resources can be pre-processed following the above methods for text-format and image-format educational resources.

3.3 Extracting features online education resources

This section addresses document frequency, information gain and word frequency. Firstly, document frequency refers to the number of documents in the collection that contain a term. The larger the number is, the more frequently the certain term appears in documents, and the more the feature words contribute to classification. This can be used as an important criterion for classification. After word filtering, the dimensions of text vector are reduced, with little impact on the classification accuracy. The extraction of feature words can help reduce the dimensions of vector space and amount of calculation, which can indirectly improve the efficiency and accuracy of text classification.

Secondly, information gain of textual resources refers to the changed amount of entropy with the creation of texts, which is an important part of text classification. The features of information gain extracted from online educational texts can be expressed as:

$$\begin{array}{l}{\mu }_{IG}\left(w\right)=P\left(W\right)\sum\limits_{i=1}^{m}P\left({C}_{i}|W\right)\lg P\left({C}_{i}|W\right)-\sum\limits_{i=1}^{m}P\left({C}_{i}\right)\lg P\left({C}_{i}\right)\\ +P\left(\overline{W }\right)\sum\limits_{i=1}^{m}P\left({C}_{i}|\overline{W }\right)\end{array}$$
(1)

In Eq. (1), Ci and W stand for class variables and features respectively. The variables P(Ci|W) and \(P\left(\overline{W }\right)\) are the probabilities of the text falling into the category Ci under two conditions: including or excluding the feature W; \(P\left({C}_{i}|\overline{W }\right)\) is the conditional probability of the text falling into the category C when W is not included.

Thirdly, word frequency refers to the frequency of target words in textual educational resources, which can be calculated as follows:

$${\mu }_{W}\left(t, \overrightarrow{d}\right)=\frac{tf\left(t, \overrightarrow{d}\right)\times \lg\left(\frac{N}{{n}_{t}}+0.01\right)}{\sqrt{{\sum }_{t\in \overrightarrow{d}}{\left[tf\left(t, \overrightarrow{d}\right)\times \lg \left(\frac{N}{{n}_{t}}+0.01\right)\right]}^{2}}}$$
(2)

In Eq. (2), \({\mu }_{W}\left(t, \overrightarrow{d}\right)\) refers to the weight of the word t in the text \(\overrightarrow{d}\); the variable \(tf\left(t, \overrightarrow{d}\right)\) is the word frequency; N and nt stand for the total numbers of training texts and those found to include the word t.

3.4 Building an accurate classifier of resources using the SVM algorithm

Based on the fundamental SVM illustration in Fig. 1, an optimised accurate classifier can be built with the following steps. We assume that, as per the set of training texts of online educational resources, the initial input is (xn, yn), and then the division of the hyperplane H can be shown using the linear equation in Eq. (3) in the two-dimensional space.

$${u}^{T}x+b=0$$
(3)

In Eq. (3), u and b are the normal vector and offset value of the linear equation respectively. The samples on L1 and L2 in Fig. 1 are defined as the support vector, and \(\frac{2}{\Vert u\Vert }\) represents the classification margin. Then the maximization problem of online educational resource classification can be converted to a corresponding dual problem: with constraints set, the task is to find out the maximum value of the dual function.

$$\left\{\begin{array}{l}\sum\limits_{i=1}^{n}{\alpha }_{i}{y}_{i}=0\\ L\left(\alpha \right)=\sum\limits_{i=1}^{n}{\alpha }_{i}-\frac{1}{2}\sum\limits_{i=1}^{n}{\alpha }_{i}{\alpha }_{j}{y}_{i}{y}_{j}{x}_{i}^{T}{x}_{j}\end{array}\right.$$
(4)

The vector α in Eq. (4) is the language multiplier corresponding to the samples in the training data. With the solution of the vector α, the specific values of the parameters of u and b in the optimal hyperplane function. The kernel function of null SVM is K(xi, xj), based on which the function of an accurate classifier for non-linear resources can be developed as:

$$f\left(x\right)=\sum\limits_{i=1}^{n}{\alpha }_{i}{y}_{i}K\left({x}_{i}, {x}_{j}\right)+b$$
(5)

Lastly, the iterative training procedure of the proposed SVM-based optimised classifier is shown in Fig. 3.

Fig. 3
figure 3

The iterative training procedure of the proposed resource classifier

3.5 Implementing classification of online educational resources

This step is to have the automatically collected and pre-processed online educational resources as the entries and load them chronologically into the SVM-based classifier. According to the features of the loaded resources, with the classifier serving as the running environment, the degree of similarity between the feature vector extracted and the features of the target categories can be calculated following this equation:

$$Sim\left({D}_{0}, {D}_{i}\right)=\sum\limits_{k=1}^{p}{\mu }_{ki}\times {\mu }_{kj}$$
(6)

In Eq. (6), μki and μkj are the standard feature vector of a certain category D0 and the comprehensive feature vector extracted from loaded samples. If we compare the results from Eq. (6) with the pre-set similarity threshold, the category with calculated value above the threshold can be used to label the specific educational resource. If there are more than one category that can meet the criterion, the one with the highest similarity will be used as the final result of classification (Beyene et al., 2020).

4 Results from a comparison experiment

A comparison experiment was designed to test the classification accuracy of the proposed SVM-based accurate classifier of online educational resources. The counterparts in the experiment were the resource classification method based on fusion of multiple neural networks and the method based on deep learning (Lam et al., 2012). In order to reduce the impact of independent variables on the results, the online education platform and raw samples of resources were identical.

4.1 Configuration of the online education platform

The experiment used the online education platform of a university located in the southwestern region of China as the environment. The platform consists of several client sides for students and teachers, and one server and one database. All the resources of the online education platform are stored in the database. In the experiment, the three classification methods (the SVM-based one, and the ones based on fusion of multiple neural networks and deep learning) were translated into codes and embedded into the online education platform. Figure 4 illustrates the configuration of the classification methods in the experimental environment.

Fig. 4
figure 4

Configuration of the classification methods of online education resources

The two experimental comparison methods can also be imported and configured following the same way. In order to guarantee the independence of the three classification methods, parallel running was selected to implement the calling and switching among the different methods.

The samples of online educational resources used in this experiment were taken from two sources: the database of the education platform of the above-mentioned university, and the database of the University of California, Irvine. Several experiments were conducted on multiple data sets from the above database, to find more accurate results of classification. Upon statistical processing, the size of the sample online educational resources in this experiment amounted to 254.65 GB, all for the discipline of mathematics. The types of resources included texts, tables, images, videos, audios and so on.

After that, three indicators: precision ratio, recall ratio and F measurement, were set as the indicators to evaluate classification results. Recall ratio refers to the proportion of correctly categorised documents after classification, while precision ratio refers to the ratio between the number of correctly categorised documents and the number of documents expected in that category. The results of precision ratio and recall ratio are shown as P and R. The quantitative results of the two indicators can be expressed as:

$$\left\{\begin{array}{l}P=\frac{TP}{TP+FP}\times 100\mathrm{\%}\\ R=\frac{TP}{TP+FN}\times 100\mathrm{\%}\end{array}\right.$$
(7)

In Eq. (7), TP refers to the amount of correctly categorised resources, while FP is the amount of incorrectly categorised resources, and FN refers to the amount of unclassified resources that fall into the category. The last indicator F measurement is used to measure the balance between precision ratio and recall ratio, which can be expressed as:

$${F}_{\beta }\left(P, R\right)=\frac{\left({\beta }^{2}+1\right)PR}{{\beta }^{2}P+R}$$
(8)

In Eq. (8), β is the adjustment parameter. In general, the higher the F measurement is, the more balanced the precision ratio and recall ratio are.

4.2 Result analysis of the comparison experiment

As designed, the different classification methods were imported into the experiment environment. Upon debugging it was ensured that the methods could run in the experiment environment. Classification results of the proposed SVM-based classifier, together with those of the other two methods, can be worked out following the same procedure. Then we compared the results with the quantitative results of the set indicators, as shown in Table 1.

Table 1 Testing results of classification performance of online education resources

Putting the data in Table 1 into Eq. (7), we can work out the average precision ratio and recall ratio of the three classification methods, respectively as: 94.01% and 96.29%; 95.51% and 97.26%; 98.02% and 98.86%. It can be seen that the proposed SVM-based classification method achieved some improvement in both precision and recall ratios. If we input the results of precision ratio and recall ratio into Eq. (8) and determine the parameter β as 1, the quantitative results of F measurement on different data sets can be worked out, as shown in Fig. 5.

Fig. 5
figure 5

Testing results of F measurement

As can be seen from Fig. 5, the F measurement result of the proposed SVM-based accurate classification method for online educational resources is always higher in all data sets than those of the other two methods. In other words, the classification performance of the proposed classifier tends to be more balanced. It implies that, the proposed accurate classifier of online educational resources may achieve the least exclusion of possible samples to the utmost degree of classification accuracy.

5 Conclusions

In a nutshell, the SVM-based classifier designed in this research can achieve slight improvement in both precision ratio and recall ratio of resource classification, when compared with the traditional methods. The classifier can reconcile the two seemingly conflicting indicators to a large extent and produce more balanced output with more accurate and inclusive results of classification. That is, by using this SVM-based accurate classifier, users may be provided with some more resources put into the correct categories at a time. It is believed that easier, more decentralised and engaging access to online resources is the trend of ICT-enhanced education (Fox, 2011). As the designed classifier can enhance the classification results of online education resources, it will thus indirectly improve the retrieval efficiency and usability of such resources.

The limitations of this research may invite further effort on this topic. Firstly, all the samples are of mathematics only. If the algorithm is applied to other disciplines distinctively different from mathematics, e.g., visual arts, literature, the testing results may differ. In the meantime, online materials for cross-disciplinary subjects, such as behavioural finance and computational linguistics, are not considered in this experiment. Secondly, no more major methods apart from SVM are involved in this research, while the synergy of hybrid algorithms may help further improve classification results, e.g., combining SVM with deep learning techniques. All in all, optimal results of classification entail ongoing and iterative optimisation of accurate classifiers, including this SVM-based one. To achieve this goal, programming integrating multiple methods can be tested on materials from multiple disciplines to identify the best solution and best practice.