Assigning students to hand in videos as a part of homework has become a new trend in the context of the continuous development of education in recent years. Normally completing assignments in form of video is based on subjective questions. Compared with examination through texts or audios, this method can more accurately evaluate through expressions, movements, and intonations, which can intuitively and accurately reflect how students understand and apply knowledge flexibly. In some subjects, such as nursing and clinical medicine, video assessment has unparalleled advantages. In some pedagogical experiments, students trained by giving video assignments have been shown to have higher average scores than those trained by traditional methods [1, 2].

As the demands of student’s ability oriented investigation increased, many teachers prefer to assign homework in the form of video. However, against the background of intelligent tutoring, video homework quantities will be considerably huge if the homework is assigned in the form of video because of the large numbers of students who attend online classes. Facing with such a large number of video homework assessments, teachers find it challenging to mark those assignments efficiently and give feedback to the students in a timely manner.

Teachers normally mark video assignments manually. Whiling marking the video assignments, they have to spend lots of time on watching the entire video. Sometimes the marking is divided into several days, whereas the standard might change with time, which might result in inefficient and non-uniform standards. Therefore, an intelligent video assignment marking method is needed in the age of intelligent tutoring.

Previous studies in automatic scoring have mainly been applied to examination or homework in paper form or vocal recording. Most of these studies were designed to examine whether the answer is correct. However, it is more important to examine through video homework whether the students have mastered the knowledge and whether the required ability has been achieved. Therefore, we need to extract features of expression, movement, and voice for indirect evaluation.

Considering the above analysis, we propose a method based on a multi-channel hybrid network combining CNN and LSTM to automatically assess video homework presented with PPT. The method evaluates the performance of students through the image and audio data in the video, considering mainly expressions, actions and tones. We determine whether the student is familiar with the content he or she delivers, or whether the student is repeating mechanically through the tones and expressions of the students (for example, whether the gaze is always focused on the PPT screen or whether the expression is nervous). Additionally, we refer to studies about PowerPoint presentations [3, 4]. Ultimately, video assignments are graded as “qualified” or “unqualified”.

The proposed method is uploaded to the cloud, which can be used by multiple teaching units. These units are separated by permission to realize secure access control of data.

This study offers three contributions as follows:

  1. 1.

    An intelligent approach is proposed to assess video assignments, which is a new topic. The proposed approach can improve the efficiency of correction and contribute to promoting video assignments. Experiments on practical data samples have demonstrated that this method is feasible and potentially valuable.

  2. 2.

    The proposed automated assessment network can simultaneously analyse images and audio in the video and then combine the information for evaluation, which has rarely been mentioned in previous automated grading studies in education. Previous studies mainly use text, audio, images or text-audio and text-images as input. Few studies consider integrating images and audio. In the experiment, this network was shown to be more accurate than using only images or audio as input for analysing video homework.

  3. 3.

    This study presents CIES, a cloud computing solution to put the model into use. The processing of a large amounts of videos is difficult for personal computers. Hence, we adopt a cloud-based system. In this way, the proposed model becomes practical. Meanwhile, the promotion and application of this model are more convenient and cheaper.

The related work on automated grading are reviewed in Related work section, and the details of the proposed approach are introduced in Methodology section. In Experiment section, the experimental results are shown. Finally, we draw conclusions and discuss future research directions in Conclusion section.

Related work

As the rapid development of computer science, applications of intelligent evaluation techniques have greatly developed recently. In the education field, these applications can be categorised as text-based, audio-based, and image-based applications. Regarding text-based applications, the representative applications correct subjective questions. Various models have been proposed for different applications. The study of English writing questions was the earliest application. Project Essay Grade (PEG) was proposed in the 1960s. Another automated scoring system applied to English writing called E-rater was also put into practice on the GMAT examination in 1999 [5]. Recently, research on subjective questions in Chinese has also made progress. Liu et al. described a method of sentence similarity measurement based on simple word matching [6]. Then, other methods based on corpus-based similarity calculation have been developed. Additionally, methods aimed at other types of subjective questions, such as automatic scoring for computer programming subjective questions [7], have also been proposed. In recent years, subjective evaluation methods using machine learning techniques such as [8, 9] have also emerged.

Regarding the automatic evaluation of audio, the main applications are spoken English assessment and Mandarin assessment. Considering the spoken English assessment, many approaches have been proposed for assessing various indexes. Liang et al. proposed an algorithm to evaluate pronunciation accuracy and fluency [10]. Huang et al. proposed an approach to inspect fluency and rhythm [11]. SpeechRater, a system that can comprehensively evaluate fluency, pronunciation, rhythm, vocabulary diversity, grammatical accuracy and complexity, was introduced in the literature [12]. SpeechRater has recently been applied to automatically assess the spoken section of the TOFEL exam. For the Putonghua assessment, Zhang et al. proposed a method to test the tone error [13]. A comprehensive method to inspect phonemes, vowels, tones and rhotic accents has also been proposed [14]. China’s Putonghua proficiency test has adopted automatic scoring technology recently.

For intelligent scoring based on images, automatic scoring of pencil-filled answer sheets has been promoted. Regarding the scoring of answer sheets that are not pencil-filled but handwritten, Deng et al. proposed an approach to automatically recognize handwritten characters [15]. Xu-Yao Zhang et al. proposed a more accurate Chinese character handwriting recognition model by combining conventional methods and a deep convolutional neural network [16]. The “All Discipline Machine Scoring Technology” system launched by iFLYTEK CO. LTD. can automatically give relatively accurate scores from scanned test paper images.

Regarding the automatic evaluation based on image and audio, Luo et al. proposed a classroom teaching evaluation system that can evaluate classroom Teaching & Learning conditions [17]. This research was novel, because it integrates image and audio to made evaluation. However, images and audios were used independently for different evaluations: images for students’ learning conditions, audios for teachers’ teaching conditions. Hence, this method can’t be applied on video evaluation. As for other methods to evaluation, Wei et al. proposed an approach to recognize students’ actions, then evaluate their learning conditions [18]. These studies can be referenced, but a proper approach for evaluating video assignments is still necessary.

In conclusion, automatic evaluation systems for certain aspects of text, audio or images have been developed. Some systems combine texts and audios (such as SpeechRater) or texts and images (such as the system launched by iFLYTEK). A few researches combine audio and images to make automatic evaluations (such as [17]), but these methods can’t be directly used on the evaluation for video. Hence, an approach that simultaneously analyses images and audio to assess video is required.

We also noticed that many novel algorithms on video processing has been proposed recently. On the domain of video segmentation, a new model based on Generative Adversarial Network has been proved to be more accuracy [19]. Ke et al. propose Frame Segmentation Network, improving mAP (mean average precision) and IoU (Intersection over Union) simultaneously [20]. As for action recognition, the S-TPNet proposed in [21] has a good performance. Its accuracy is around 74% on dataset HMDB51, and around 95% on dataset UCF101. However, although existing models could process video efficiently, they are not appropriate for automatic evaluation of video homework. Therefore, proper and efficient algorithms are required.


CIES is composed of CIES platform and CNN-LSTM network. Homework are firstly uploaded to cloud. Then they are processed and distributed to CNN-LSTM network by CIES platform. Finally, CNN-LSTM network makes automatic evaluation.

CIES platform

CIES platform is built to overcome the problem that video processing requires high-performance hardware and make the proposed model practical. CIES is a multi-tenant management system with independent services and unified control. The system includes user authentication, authorization, quota management, and resource control. User authentication is realized by KERBEROS protocol and LDAP, which ensure only identified users can access the system. Authorization ensures only users who have been granted permission can access to assess system. Quota management and resources control limit the frequency of accessing. These 3 parts ensure the security and availability of the system jointly. The structure of CIES platform is shown in Fig. 1.

Fig. 1
figure 1

Structure of CIES platform

CIES reduces the high cost of hardware procurement, operation and maintenance of users because the cloud servers are built by cloud operators. Furthermore, cloud computing architecture can flexibly utilize resources such as cloud storage and parallel computing. The advanced architecture improves the operational efficiency of proposed approach.

With the parallel computing framework of the CIES, video is divided into image and audio parts, the two parts are separately partitioned into data blocks, and then pre-processed. Each data block pre-processing corresponds to one computing task, and is automatically performed on the cluster nodes. Tasks are assigned to nodes and executed, then calculation results are collected. The collected pre-processed results are fed into CNN network. Complex details such as data distribution storage, data communication, and fault-tolerant processing are handled by CIES, improving processing speed and robustness.

Overview of CNN-LSTM network

Long short-term memory (LSTM) is a special kind of recurrent neural network (RNN) that can be used to improve the vanishing gradient problems of RNN. For the samples correlated with times, the LSTM network is used to obtain the sequence characteristics of the samples while CNN extracts features to improve the system’s accuracy [22]. At present, the CNN-LSTM hybrid network is widely used in video analysis. To extract and classify features (such as expressions, actions, tones, and speech rates) in the video simultaneously, we adopt a multi-channel CNN-LSTM network.

Referring to the methods for processing images proposed in [22,23,24], images are first processed to a proper form (discussed in Experiment section part A). We obtain the local spatial features of the video sequences through CNN’s sliding windows and weight sharing and then use them as the inputs to the LSTM layer. We use LSTM to acquire the time characteristics of video data. Then, they are combined to make full use of their respective advantages.

For the processing of audio signals in videos, we apply the same CNN-LSTM network. Audio and images are input simultaneously to two different channels. The difference between audio and image processing is the pre-process (discussed in Experiment section part A). Subsequent processes are the same as the processing of images (Fig. 2).

Fig. 2
figure 2

The architecture of the multi-channel CNN-LSTM network

Convolutional neural network

We adopt the deep residual network ResNet-50 as the convolutional neural network. This multi-channel CNN architecture has 2 separate input channels for audio and images, and they share the same parameter settings. All the strides for the convolutional layers are 2 [25].

In Fig. 3, CONV denotes a convolution operation. Batch Norm denotes batch normalization. ReLU denotes the ReLU activation function.

Fig. 3
figure 3

Structure diagram of ResNet-50

MAX POOL stands for maximum pooling

$$ {R}_n=\mathit{\max}\left({Y}_{ij}\right) $$

Rn represents the feature matrix of the nth image in the sequence image after the convolution and pooling operations. The above operations are performed on the image sequence separately, and the feature matrix of each image frame in the sequence can be represented by R = (R1, R2, …,Rn).

Avg POOL denotes mean pooling. Flatten denotes the flattened layer. ID BLOCK denotes a residual block that does not change dimensions, called the identity block, and CONV BLOCK denotes the residual block that adds dimension. Each residual block includes 3 convolution layers (Fig. 4).

Fig. 4
figure 4

Structure of residual block, identity block (left), and CONV block (right)

Long short-term memory

The entire system contains 1 LSTM, following CNN. We adopt an LSTM structure given in [22]. An output R of the CNN’s pooling layer corresponds to an LSTM input at time t, and the result of each recursive operation is a integration of all previous features and current features. At the t moment, the components of the LSTM unit are updated as follows:

$$ {i}_t=\sigma \left({W}_{ri}{R}_t+{U}_{hi}{h}_{t-1}+{b}_i\right) $$
$$ {f}_t=\sigma \left({W}_{rf}{R}_t+{U}_{hf}{h}_{t-1}+{b}_f\right) $$
$$ \tilde{c}_{t}=\tanh \left({W}_{ri}{R}_t+{U}_{hc}{h}_{t-1}+{b}_c\right) $$
$$ {c}_t={f}_t\odot {c}_{t-1}+{i}_t\odot \tilde{c}_{t} $$
$$ {o}_t=\upsigma \left({W}_{ro}{R}_t+{U}_{ho}{h}_{t-1}+{b}_o\right) $$
$$ {h}_t={o}_t\tanh \left({c}_t\right) $$

where σ denotes the sigmoid activation function, Rt denotes the feature matrix input at time t, Wri, Wrf, Wrc, and Wro denote the weight matrix between the input layer to the input gate, the forget gate, the memory cell, and the output gate, respectively, Uhi, Uhf, Uhcand Uho denote the weight matrix from the hidden layer to the input gate, forget gate, memory cell and output gate, respectively, bi, bf, bc, and bo denote the offset value of the input gate, forget gate, memory cell and output gate, respectively.

Loss function

As the results are two-category classification, we adopt logistic regression in this network.

The loss function is defined as

$$ L\left(\hat{y}-y\right)=- ylog\left(\hat{y}\right)-\Big(1-y\left(\log \left(1-\hat{y}\right)\right) $$

where y denotes the true classification of the sample and \( \hat{y} \) denotes the model recognition result.


We adopt the Adam optimizer in this network, and the implementation process is as follows:

Compute the gradient at moment t:

$$ {g}_t={\nabla}_{\theta}\left({\theta}_{t-1}\right) $$

Update biased first moment estimate:

$$ {m}_t={\beta}_1{m}_{t-1}+\left(1-{\beta}_1\right){g}_t $$

Update biased second raw moment estimate:

$$ {v}_t={\beta}_2{v}_{t-1}+\left(1-{\beta}_2\right){g}_t^2 $$

Compute bias-corrected first moment estimate:

$$ \hat{m_t}={m}_t/\left(1-{\beta}_2^t\right) $$

Compute bias-corrected second raw moment estimate:

$$ \hat{v_t}={v}_t/\left(1-{\beta}_2^t\right) $$

Update parameters:

$$ {\theta}_t={\theta}_{t-1}-\alpha \ast \hat{m_t}/\left(\sqrt{\hat{v_t}}+\varepsilon \right) $$


Since there is no suitable existing datasets, one is constructed by our own. The raw data come from practical video homework submitted by students in the management course offered by Shandong University of Finance and Economics during 2nd semester, 2018/2019. Then videos are assessed by 3 experienced professors respectively, only the same assessments are adopted. These assignments account for 50% of students’ final results. The dataset contains 61 students’ video homework (38 qualified and 23 unqualified) submitted by 48 different students.


To increase the number of samples and specification data, pre-processing is required. The students’ homework contains videos of different lengths, and the video length statistics are shown in Table 1. The videos are first segmented by the same length. New video classification still belongs to the category of the original video after segmenting. After this process, the training set samples reach 9898, and the test set samples reach 122. Then 76 qualified segments and 46 unqualified segments are selected as test samples.

Table 1 Video duration information

Images are also pre-processed. The first step of image pre-processing is frame extraction: one frame is extracted at the intervals of 3 frames. The assessment is based on features such as students’ actions and expressions. However, the PPT occupies a large part of the entire image, as shown in Fig. 5, so the position-sensitive region segmentation method is used in the frame images to extract, then scale the important part to generate a fixed-size image of 224*224 pixels. Finally, the number of images extracted from qualified video is 215,730, and the number of images extracted from unqualified video is 85,230.

Fig. 5
figure 5

Image pre-processing

To pre-processing the audio signals, firstly divide each video into segments by the time interval of 1 s. Then extract the audio signals from the video. At last audio signals are subjected to spectrum analysis to obtain a spectrogram, as shown in Fig. 6. The spectrogram is also scaled into the same fixed size of 432*288 pixels. Spectrograms extracting from qualified video reach 13,500, while the ones from unqualified video is 5850.

Fig. 6
figure 6

Spectrogram of an audio signal

Training and evaluation

We adopt logistic regression to make a two-category (qualified or unqualified) identification, where 1 denotes qualified, and 0 denotes unqualified. The loss function is defined as

$$ L\left(\hat{y}-y\right)=- ylog\left(\hat{y}\right)-\Big(1-y\left(\log \left(1-\hat{y}\right)\right) $$

where y denotes the true classification of the sample and \( \hat{y} \) denotes the model recognition result.

The Adam optimizer is adopted in this network, where the learning rate is 0.01.

To evaluate the superiority of the approach that adopts both audio and images as input, we conducted comparison experiments between the approaches that input only audio and only images. The accuracies are shown in Table 2. As shown in Table 2, the proposed approach which integrates image and audio is feasible and more accurate.

Table 2 Accuracy of different inputs

To verify the advantages of CNN-LSTM hybrid architecture, experiments between CNN-LSTM network and some typical CNN networks are conducted as following. In these experiments, all networks used same dataset which contained both images and audio. The accuracies are shown in Table 3. According to Table 3, CNN-LSTM network has the best performance.

Table 3 Accuracy of different networks

According to the experiments, this model could preliminarily distinguish qualified and unqualified video. As the train set and test set varies, the accuracy is relatively steady. However, the accuracy is not very satisfying for some videos turn out to be too dark to recognize the expression or vague voice because of noises, discontinuous, etc. To further improve accuracy, another algorithm is required to detect and process defective videos automatically.


With the promotion of video homework, an efficient and accurate approach to mark these videos is in demand. However, most proposed intelligent grading studies focus on text, audio or images, rare methods can be used on video homework. Hence, we propose an approach using a multi-channel CNN-LSTM network to assess video homework intelligently. A novel method of integrating image features and audio features on the topic of intelligent evaluation of homework is also presented. This approach preliminarily classifies qualified and unqualified video homework, which has been demonstrated by experiments. In addition, CIES platform improves the computing efficiency and makes it more convenient to use the model.

However, it should also be noted that the accuracy of this approach is not extremely satisfying. The model introduced in [26, 27] is designed to make automatic evaluation (based on text) for English essay. It could reach a pretty high accuracy. In most cases, there is only ±0.25 points error between scores given by human and given by this model. Similarly, the method introduced in [28] is an automated speech scoring system (based on audio). The correlation coefficient between human grading and this method is proved to be 0.97. As for the grading of programs, the method proposed in [29] has the accuracy of 94.48%. Compared with the achievements in automatic grading for other forms of homework, we simply propose a preliminary model to deal with video homework. Its accuracy can still to be improved.

Additionally, the proposed model is only aimed at two-category classification. A model that can assign more specific grades is still to be discovered in the future.