A privacy-preserving student status monitoring system

Timely feedback of students’ listening status is crucial for teaching work. However, it is often difficult for teachers to pay attention to all students at the same time. By leveraging surveillance cameras in the classroom, we are able to assist the teaching work. However, the existing methods either lack the protection of students’ privacy, or they have to reduce the accuracy of success, because they are concerned about the leakage of students’ privacy. We propose federated semi-supervised class assistance system to evaluate the listening status of students in the classroom. Rather than training the semi-supervised model in a centralized manner, we train a semi-supervised model in a federated manner among various monitors while preserving students’ privacy. We also formulate a new loss function according to the difference between the pre-trained initial model and the expected model to restrict the training process of the unlabeled data. By applying the pseudo-label assignment method on the unlabeled data, the class monitors are able to recognize the student class behavior. In addition, simulation and real-world experimental results demonstrate that the performance of the proposed system outperforms that of the baseline models.


Introduction
Increasingly, computers are auxiliary system for many scenes; part of these systems require the participation of record equipment such as cameras [1][2][3]. Ubiquitous camera means that they have access or record to amount of data; much of them are private in nature. In the traditional teaching method, the feedback on the results of teachers' teaching work can only come from the centralized assessment organized by the school, but this method takes a long time (one month to half a year), and the untimely feedback of the results will have a serious impact on the students. Therefore, it is necessary to establish a complete set of aid systems to help teachers carry out teaching work [4,5] 1 analysis and feedback on the students' learning status through the monitoring equipment in the classroom. On the other hand, the system evaluates teachers' work performance. To build a model with good robustness, it is necessary to collect a large number of videos from courses as training data. The teaching aid system that constructed through a large amount of data from multiple classrooms can carry out teaching aids in the classroom. However, it is illegal to collect and store these data that contain students' faces [6]. To protect personal privacy, everyone's face data should not be stored and transmitted without the owner's consent. Meanwhile, labeling data is one of the important functions of data collection; models with high recognition rate heavily rely on a large amount of labeled data. This means that the data of the student's face need to be labeled and used as training data to participate in model training without being observed, stored, and transmitted. As a result, personal privacy protection is an important issue in designing model without compromising accuracy. Cryptographic solutions secure the data against unauthorized access from attackers. However, they are not immediately applicable to preventing authorized agents from the unauthorized abuse of information, which causes privacy breach concerns [7]. If the server is attacked or the back-end data leaks, it will cause serious losses. Some researchers intentionally reduce the resolution of recognition when collecting videos, and try to analyze behavior using unclear videos. The advantage of this is that behavior recognition can be performed while protecting (face) privacy [8][9][10]. The shortcomings of this method are also obvious. First, the blurred video makes the subsequent recognition work more difficult; second, the inability to use the information of the expressions further increases the difficulty of the recognition. Some works use behavioral analysis of groups, which avoids the identification of individuals [11], or volunteers are used to train models. In contrast, to avoid privacy violation, computers conduct the most part of the data processing and model training independently without human participation. In this way, we are able to use more realistic personal behavior data to train the model while protecting privacy. However, this method requires a large amount of data and label support, and in some cases, it is difficult to obtain a large amount of data samples.
To solve the aforementioned problems, we investigate a semi-supervised federated learning method that allows every client to collectively reap the weight of public models trained from all client, without the need to centrally store and label it. We independently build an initial model with the help of volunteers who are filmed and recorded in actual courses and meetings, and use prior knowledge as much as possible to help follow-up training. Transfer learning is not adopted, because our models are created with the same application background. Some commonly used generalized initialization models are not used, because compared to using transfer learning, models created with the same application background have exactly the same domain knowledge [12]. The consistency of the training samples avoids the risk of negative transfer caused by the difference between the target and the source task [13]. The initial model is distributed to each client through the central server, and the client is trained through the local dataset which is never uploaded to the central server to ensure that personal privacy will not be attacked in the back-end and during transmission. Federated learning enables multiple clients to train a global model while providing privacy protection. Unlike classical machine learning approaches, federated learning only exchanges machine learning models between clients and the cloud server. By only communicating the intermediate model updates with the cloud server, the cloud server can improve the global model without sacrificing the client privacy. Although the initial model and the target model have closely domain knowledge, samples are prone to be projected onto an incorrect class space due to the lack of labeling data at the beginning and the differences in individual samples and collection equipment. By minimizing samples Maximum Mean Discrepancy (MMD) distance, we are able to adjust the sample label to the real target domain by supervised learning.
We invite some volunteers to build a dataset of students' classroom performance database (CPD) based on their real performance data in the classroom, and compare the proposed method with some mainstream semi-supervised methods on the dataset. To verify the generalization ability of the proposed scheme, we test it on the SAVEE database and get competitive results.
Our main contributions are summarized as follows: (1) We propose a novel model framework to provide immediate feedback on students' classroom conditions without reveal personal privacy base on federated learning method. (2) Under the premise that the model has consistent regularization, we supervise the unviewable dataset according to the continuity and invariance of the data. At the same time, a new loss function is proposed to improve the semi-supervised model according to the distribution characteristics of the data. (3) We build CPD. To our knowledge, the CPD is the first dataset focusing on the students' behavior in the classroom. In addition, we test the proposed approach on CPD.

Classroom facial expression recognition
The facial expression feedback of classroom students is very important for teaching work. Based on the threshold setting and the number of categories, in some cases, AI recognition and expert recognition have similar results [14]. Tonguç and Ozkara [15] uses facial movements coding system to analyze the facial expressions of students, but the author uses the classic facial expression classification method instead of customized method. At the same time, it is noted that restricted by the research application field, researchers must pay more attention to the facial information of students, which is in conflict with privacy protection. Researchers directly collect facial data for analysis and research, and attackers can obtain facial data through a centralized learning database. By reducing the net parameters and obscuring the training data information, it is difficult for the attacker to recover the training data. However, the price of doing so is to reduce the net performance, so that the recognition accuracy is compromised [16]. Gao et al. [17] using non-labeled datasets to protect privacy, the researchers explore the advantages of unsupervised learning in the absence of labeled face data.
The difficulty of recognition increases as the increases of linear and non-linear natural variables. The author builds a classification model based on semi-supervised sparse representation. Faces are divided into two dictionaries, one of which is composed of several faces, and the other is com-posed of noise interference. However, the author did not consider the situation when a face has no labeled image data in the training set.

Semi-supervised learning
In the pattern recognition, the classes of unlabeled-dataset recognition and estimation have received significant attention [18][19][20]. Facing similar motivation to ours, Gao et al. [17] discusses the advantages of using semi-supervised learning with less face labeled dataset. Obstacles to recognition will increase with data corrupted by linear and non-linear natural variables. A model called semi-supervised sparse representation-based classification was established; faces are delineated in the field of two dictionaries, a gallery dictionary which includes several face data and variation dictionary constituting linear variables. However, they also do not consider someone without label in training data. Nie et al. [21] proposed that a model performs clustering semi-supervised classification and local structure learning simultaneously for unreliable and inaccurate examples which contains noise and outlying entries. The generated optimal feature map can be divided into specific clusters. The model can set the optimal weights for each image automatically, without setting the weights and penalty function again. Using image sequences as data can alleviate the problem of unbalanced image weights configuration to a certain extent [22].
Miyato et al. [23] propose a regularization method based on virtual adversarial loss. A new measure is used to make the input labels evenly and smoothly distributed in the class space. The robustness of the conditional label distribution around each input data point to the disturbance of the local space is taken as the virtual adversarial loss. The computational complexity of virtual confrontation training is relatively small. Enhancement through the implementation of algorithms is based on the principle of entropy minimization. Miyato et al. [23] improves the robustness and generalization ability of the model, but it does not consider the correlation between the samples, and does not improve the virtual label and distribution according to the relationship between the samples.

Federated learning
Federal learning was first proposed by Brendan McMahan et al. of Google in [24]. To make up for the lack of centralized learning (such as the need to centralize private data), they proposed a new model, so that the data can participate in model training through the local user's device. As a framework for machine learning, federated learning can effectively helps multiple institutions to perform data usage and machine learning modeling under the requirements of user privacy protection, data security, and related laws and regulations [6]. As a distributed machine learning paradigm, federated learning can effectively solve the problem of data islands, allowing participants to jointly model on the basis of not sharing data, which can technically break data islands and achieve AI collaboration.
Wang et al. [25] consider the problem of learning model parameters from data distributed across multiple edge nodes, and the models are trained using gradient-descent-based approaches with a generic class of machine learning. With the convergence bound of distributed gradient descent, they determine the best trade-off between local updates and global parameter aggregation to minimize the loss function under a given resource budgets from a theoretical point of view. The distributed structure of federated learning can effectively protect personal privacy, but another challenge faced by this framework is the insufficient or uneven distribution of training data under the client, simultaneously, the framework cost a significant communication overhead during training. The federated learning framework proposed by many solutions has limitations in practical applications, as most of them compress the upstream communication from the client to the server but not the downstream communication; some experiments performed well in independent and identically distributed datasets. Sattler et al. suggest solving this problem with sparse ternary compression (STC), a compression framework that is specifically designed to meet the requirements of the federated learning environment [26]. STC extends the existing compression technique of top-k gradient sparsification to enable downstream compression as well as ternarization and optimal Golomb encoding of the weight updates. These improvements are more aimed at processing the more operable data inside the device, and are usually applied to portable devices such as mobile phones. These methods are not suitable for a federated learning framework that uses huge data such as videos taken for a long time as the application object.

Proposed framework
In this section, we introduce the proposed semi-supervised federated learning framework in detail.

Pre-trained model with semi-supervised learning
We adopt a popular method to achieve the establishment of the initial model. Since labeled data will no longer be used in the training of the model published on the client, the establishment of the pre-training model is very important. Although the W I of the pre-trained model cannot be directly shared, similar background training can be said to provide prior knowledge for the final model and the target weight W F . Choosing an appropriate W I as the initial weight of the client can effectively help the published model training (in some special cases, for example, the number of volunteers is higher than the testers, and the model published by server can be used directly). No doubt, the model must be generalized in the task and cannot be over-fitted on the volunteer dataset. This means that the resulting pre-trained model must meet these two requirements at the same time. Aiming at the application background of the teaching environment, the 3D resnet network [27] is used as the initial model for training. The network is formed by stacking several bottlenecks. Figure 1 is a bottleneck, where S 3 is the size of the convolution kernel, F is the feature dimension, and BN is batch normalization.

Updating of classroom data to model
The server sends pre-trained model to each client. Each client is individually trained based on its own local data. At this time, the data used by the client are natural and private, and we cannot process or add labels to the data. Therefore, the released model will be trained independently in a semisupervised learning. The training process is shown in Fig. 2. The training data are divided into labeled data and unlabeled data through the entropy gate (depending on the confidence of the model output). The two kinds of data obtain the confidence of the output through the network, and some of the data are filtered through the entropy gate again to participate in the next round of training.
Consistency regularization is widely studied in semisupervised learning [28,29]. The main idea is that for any input of the network, if it is subject to a slight disturbance, the output (prediction) results should be consistent. A L2 penalty term should be added to the loss function; its mathematical description is where y is the prediction result, R i (•) is the ith change of the picture, x u and θ is the model parameter, and P model is the probability distribution of the model. This loss item is added to the loss function to constrain the training process of the model.
The input used in this experiment is an human image sequences. Since the mental state and physical actions of a person cannot change rapidly in a short period of time, a sequence of pictures with similar time interval, especially images containing multiple identical pictures, should have similar prediction results. When the number of pictures in the same time increases, the similarity of the output results remains higher. Therefore, Eq. (1) is changed to where x t is the input data at time t; tis a small neighborhood around t, which can be 0; R (•) is the input picture sequence at different times. In the experiment, the picture is sampled at a frequency of 24 FPS, and then, the same interval (1:4) sampling is performed as the input sample. That is to say, at least four sets of input samples can be formed at the same time. These four sets of samples have true and lossless original information and identical labels. This method can expand the sample by four times without any interference or false information. The data collected on the client are divided into two parts according to the confidence of the output results. Since the model on the client is pre-trained, we have reason to believe that some of the high-confidence output results are real results. When the model's entropy of the data prediction result is low enough, the output result with a confidence greater than λ 1 is regarded as the true label of the data. If the output result confidence of a piece of data is greater than λ 2 and less than λ 1 , then this piece of data is regarded as unlabeled data, and then, the pseudo-label method is used to allow the data to participate train. The entropy gate can be controlled by λ 1 and λ 2 . The process of assigning labels to data is shown in Eq. (3) where K is the multiple of the expanded data, and t is the selected time neighborhood. In this experiment, the neighborhood of one data is defined to be within half the range of the current data duration.  According to the empirical model of reference [30], the L2 loss term and cross-entropy loss term are, respectively, proposed for labeled data and unlabeled data where |X | is batch size, |x| is augmented data; p, q are real labels and predicted labels; subscript L represents labeled data; subscript U represents unlabeled data; N is the number of classification categories and H ( p, P model ) is the crossentropy function. The resulting loss function is the weights of the two, as shown in Eq. (6), and λ is the weighting factor of the unsupervised learning loss function Since the method of selecting labeled samples on the client is automatically selected based on the output results of the model, there is a high possibility that there will be problems of uneven sample distribution and over-fitting. The Adaptive Representation Consistency (ARC) method is used to solve the problem of uneven sample distribution [31], which can realize the use of unlabeled target samples to guide the training of labeled samples, so that the model has a stronger generalization ability. The main idea of this method is to narrow the data distribution of unlabeled samples and labeled samples as closely as possible. ARC uses Maximum Mean Discrepancies (MMD) [32] to measure the difference in the distribution of labeled data and unlabeled data. To achieve the consistency of the distribution, it is necessary to minimize the distribution difference. This difference can be used as a part of the loss where Q x represents the distribution of x, and G (•) represents the Gaussian Radial Basis Function (RBF). With integrating Eqs. (6) and (7), the final loss function would become

Model allocation and update
The interaction process between the server and the client is shown in Fig. 3. The server itself collects a dataset composed of volunteers as a basic pre-training model and distributes it to each client, and then no longer maintains the local model. Each client trains a local model according to the previously proposed semi-supervised method. Combined with the application background of the research, unlike some studies, some clients [33] are selected, but all clients participate in the training.
The algorithm flow is shown in Table 1. Where λ represents the weight coefficient, which is determined by the number of targets (not the total number of samples collected) of the collected samples, W 0 is the initial weight of the model and W 1 is the weight after each round of updates.
In this investigation, the client's running time, data collection, and number of collection targets are completely dependent on the local user. Therefore, if the training is carried out in the above-mentioned manner, it may cause the database to be updated at a certain time, the generalization ability of the model is affected, or the database to be updated at a certain time is too large, which reduces the performance of the original model.
Two new aggregation rules are considered here and the aggregation mode of the server model is considered to cal-  Table 1). In our model, we would discard the target or pass them to the three parties encrypt the data, integrate the targets on multiple clients, and perform the same training when there are too few targets on the client. The main reason for adopting this method is to address the problems in the application scenario of this paper. With the change of subject and teachers in teaching, there are various forms of courses. Some courses are characterized by large numbers of students, long-course durations, and traditional teaching methods. This form of data accounts for the vast

11:
Clients update (w t , NU M) (c) Conference majority of the overall sample (either training data or to-beidentified data). The parameters of the model trained from these data should occupy a larger proportion. Conversely, other courses are characterized by small number of students, short course time, and special teaching methods. These data are often unique and targeted, which can affect the robustness of the model on the server. To solve the above problems, these models need to be processed. The processing methods we use are: (1) reduce the weight (line 12); (2) delete the models trained on extreme data (the recognition result is only given in the application, and its data does not participate in subsequent training) (line 7-10); (3) mixed training with other data (line [11][12]. The model aggregation rules are shown in Table 2.

Experiments
In this section, we first introduce the dataset and performance metric. We then show the simulation and real-world experiments.

Dataset configuration
We evaluated the proposed semi-supervised federated learning algorithm. The purpose of the model is to assess and classify the performance of students in the classroom, but at this stage, there is a lack of recognized relevant datasets. Therefore, we build our own data set, and some students provided their true performance. The dataset mainly con-sists of two processes. Parts of dataset are collected from classroom of the taught course <Introduction to Electrical Engineering>. This course is of moderate difficulty. In addition to listening to the class, students also need to think and take notes. Other parts are collected from the meeting and discussion session of the members of the research group. The meeting process is mainly to listen to the report of the researcher with discussion. We will name the database established as the classroom performance database. A total of 14 volunteers participated in the recording, and the total length of the unedited database was close to 4 h. After processing, we intercepted short videos of varying degrees ranging from 30 s to 10 min. At the same time, to protect privacy, we deleted the sound in the video. We divide videos into four categories: listening to lectures, playing with mobile phones, communicating, and writing. The method of classification includes two methods: the method of expert assessment and the volunteers themselves provide labels. In the experiment, the length of data input to the network each time is 8/3 s of video, including 16 frames in total. Some video screenshots in the database are shown in Fig. 4: In addition, to verify the practicality of the model, we are also conducting experiments on the public dataset the Surrey Audio-Visual Emotions (SAVEE) [34][35][36], the database was captured in The Centre for Vision, Speech and Signal Processing (CVSSP)'s 3D vision laboratory over several months during different times of the year from four actors. It contains a total of 480 short videos which were recorded by four actors showing seven different emotions. The length of these videos varied from 3 to 5 s, and they include anger, disgust, fear, happiness, neutral, sadness, and surprise. Classification accuracy for visual and audio-visual data for seven emotion classes over four actors by evaluators is given in Table 3. KL, JK, JE, and DC in the table are the abbreviation of the actor's names. Table 4 is the comparison between the SAVEE database and the database we established. We need to specify the way the data are distributed on the client, so that the data are subject to non-independent and identical distribution (non-iid), because we have to make choices based on the actual background and reality. In most classrooms, the main behavior of students is to listen to teachers' lectures, which is unbalanced in the dataset. A few students play mobile phones or whisper to each other, or some students have never had this behavior. We scrambled the data and divided it into ten clients. Each client received 200 pre-processed data, and some samples are relevant in time domain. At the same time, each set of data we allocate contains samples of all labels.

Results
In the experiment, we use the above two databases to test the proposed network. Randomly sample 10% from the database according to the label as the server training data, and the remaining 90% is allocated to each client as the training and test data. The initial learning rate is 0.001, and the minibatch size of both labeled and unlabeled datasets is 16. Use the cosine learning rate decay [37] as the learning rate plan where η 0 is the initial learning rate; s t is the current number of training steps; s T is the total number of training steps. The weights of the client model is set according to the proportion of the target on the client, as shown in Eq. (10) where s ci is the number of targets on the ith client. C L and C U take 90 and 65, respectively. Table 4 lists the comparison results of our proposed method and some classic methods in CPD. The first line of the table is the amount of data involved in pre-training and the proportion of the total database. It can be seen that the proposed method has achieved better results among the existing methods. In particular, comparing multiple semi-supervised methods, if the same proportion of real labels is used to label the data (as the proportion of real labels decreases, the accuracy of our proposed method will increase over other methods. Especially when the labeled data are reduced from 1000 to 500, our method has less attenuation in accuracy than other network), and our method has achieved a higher recognition rate. This is because the proposed method can continue to use federated learning to make privacy without obtaining private data. Data participate in data training, and most other semi-supervised training either uses the network to work directly after training the labeled data, or can only obtain private data to participate in the training. When We also experiment with the SAVEE dataset to further verify the proposed algorithm. The results are shown in Table 5. As a labeled dataset, SAVEE is composed of seven types of short videos, which is far from the dataset we established in terms of application field and image quality. It is worth noting that we cannot achieve the highest recognition rate on this dataset. The main reason is that the data set is a labeled dataset. We have conducted semi-supervised training on it, and the goal is to evaluate the proposed method. The best accuracy is not achieved on this task. However, it can still be seen that our proposed method has advantages compared with other methods. At the same time, the accuracy of all methods decreases under the same proportion of labeled data. This is because the SAVEE database is an audio-visual database. The sound contained in the database contains important information, and only the image part is used in the experiment. A large part of the information is lost.
At the same time, it can be observed from the table that when semi-supervised learning is used, the database has a certain degree of difficulty. When the labeled data drop from 800 to 400, the accuracy rate drops sharply. This is because some of the tags in the database are similar. Therefore, we compared C L and C U with different parameter values, as shown in Table 6. It can be seen that if modeling is performed for different application backgrounds, the hyperparameters need to be adjusted appropriately.
We fix the total number of targets to be detected N s , and then change the training round E c on the client, the number of detection targets assigned to each client n s , the number of clients needed N c , and the proportion of customers par-ticipating in training C. As shown in Fig. 5, when N s = 1 , it can be regarded as centralized learning. It can be seen from Fig. 5 that when the training is too scattered and the training on the client is insufficient, the accuracy of the model decreases. When keeping N s unchanged, n s and N c are inversely proportional, that is, N c = N s n s . As the number of local training increases, the required communication cost gradually decreases. With the sparse data, greater communication costs are required in training. There is another hidden danger that is not obvious here is that if the number of local samples is small, over-fitting will occur. When the number of samples is small enough, the model should be abandoned.
Comparing Fig. 5e, it can be found that the proposed algorithm is similar in recognition accuracy to that of centralized learning, but requires more training rounds. When N c = 20, E c = 20, it reaches about 150 rounds. 80%; when N c = 1, a similar accuracy rate is reached in 80 rounds.
When there is an extreme number of samples on the client (taking into account the application, here is a small number. When the number is large, there are problems such as occlusion between samples and insufficient resolution), without adding weights, the model on the client has a negative effect on the model on the server. Figure 6 is a comparison in the case of uneven target allocation, where ten clients only contain one detection target.
The assignment process of parameters in practical applications is as follows: C is determined by the application scenario and cannot be assigned. It can be seen from Fig. 5 that as C decreases, the number of clients participating in the training increases, and the trained model performs better. Assigning a value to E should ensure that a larger value is selected as far as the application context allows. Considering the performance of the client and the impact on subsequent courses in teachers, the training time of the client should not exceed one class hour as much as possible. A course in colleges and high schools is now about 90-120 min long [40]. The training duration selected here should not exceed 2 h, and the maximum can be E = 20 during the experiment. It should be noted that the assignment of E is affected by the performance of the device and should be adjusted according to the actual scene.

Conclusion
In this article, we propose a semi-supervised federated learning framework. This framework can identify students' class status while protecting their privacy. We take advantage of the high consistency between the self-built database and the detected task, and propose a new loss function, which can constrain the training of unlabeled data on the client. At the same time, we convene volunteers to establish CPD to complete the task. On the one hand, we use CPD to test the proposed network model, and on the other hand, CPD is used to train the initial model of the system. We have established a complete teaching evaluation system based on the proposed algorithm. With the authorization of the school, the system can work in multiple classrooms, and we guarantee that no one's privacy will be violated. To verify the generalization of the proposed framework, experiments were conducted on the SAVEE database at the same time. Experiments show that our proposed method is competitive with the current mainstream semi-supervised algorithms. In addition, the framework does not require personal information collection, which can effectively protect the personal privacy of students.