1 Introduction

Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by a deficit in social communication and repetitive behavior (American Psychiatric Association 2013). Attentional deficit is one of the conspicuous deficits of ASD, hence, children with ASDrequire extra attentional support during learning. Teachers commonly use subjective attentional assessment to monitor the learning progress of their students. The subjective approach is usully through observational method which is tedious, time-consuming, and requires a long year of experience as a child with ASD exhibits unique attentional behavior (Marcu et al. 2013; Douglas et al. 2022; Wood et al. 2016). For example, a child with ASD may prefer to look at stimuli and listens simultaneously while another child may choose to look at the stimuli and not listen. It is, therefore, challenging for teachers to manage both the learning content and the diverse attentional behavior of their students concurrently (Wood et al. 2016). Research has shown that there is a lack of resources and experienced teachers for children with ASDdue to the increased rate of ASDprevalence in the United States of America, as reported by the Center for Disease Control and Prevention (CDC) (CDC 2022; Lee and Meadan 2021;), The increase in ASDpopulation is not limited to the USA alone but globally including Qatar, the country y of the current study (Alshaban et al. 2019). The experience and feelings of families and friends who have children with ASDshow that the children require a great deal of educational and social support from teachers, parents, siblings, relatives, and friends (Sharabi and Marom-Golan 2018). Sometimes, the support needed by children with ASDspans from childhood to adulthood, which is usually overwhelming and psychologically challenging (Russa et al. 2015). Hence, understanding how stakeholders can benefit from objective attention assessment of children with ASDcan support them in managing their learning experience effectively.

Parents do agree that attention impairment is the root cause of the core deficits in children with ASD(Kinnealey et al. 2012; Ridderinkhof et al. 2020). Thus, existing technological interventions for children with ASDoften focus on using technology to gain the attention of children with ASDto improve core ASDimpairments, such as social and communication skills (Almumen and Almuhareb 2020). While some studies have assessed the attention of children with ASD using these interventions, very few studies, have explored how technology can be utilized in assessing attention objectively. Most importantly, understanding how technologies can be used to measure attention from different perspectives. The common form of attention assessment is mostly through cognitive assessment when students respond correctly to a given attention task. According to the work of Fredricks et al. (Fredricks 2011), engagement or attention in our context can be viewed from three dimensions: behavioral, emotional and cognitive. Several studies have shown that attentional context, especially behavior and emotional engagement are measurable with various kinds of sensors (camera, pressure sensor, eye tracker, etc.) Dewan et al. (Dewan et al. 2019). Emotion is a way of expressing inner feelings toward an action. Individuals express emotions in different ways such as facial expressions, body language or vocalizations. Yet most prominently emotions are express through facial expression (Tsiourti et al. 2019). There are seven basic emotions; anger, sadness, disgust, joy, surprise, fear and contempt (Rosenberg and Ekman 2020). Clusters of facial muscle activities, also known as facial action units (FACS) describes emotions. Categories of emotions include positive, negative and neutral. Emotional states that depict enjoyment or good states such as joy and surprise falls under positive emotion. Negative emotions imply displeasure in something or people such as anger, sadness, disgust, fear and contempt. The neutral emotional state represents indifferent emotion. Educators and psychologists suggest that children's emotions can affect their ability to focus on a task (Escobedo et al. 2014; Neuhaus et al. 2019).

The advent of sensing technologies and machine learning has enhanced intelligent systems that monitor and respond to human behavior. An intelligent system using a machine learning algorithm is being applied by experts in different domains like education (Hutt et al. 2021), transportation (Palazzi et al. 2018), and healthcare (Chen et al. 2021). The development of a machine learning model for detecting the attention in children with ASD is still at infancy due to the difficulty in their generalizing attentional behaviors (Jacob et al. 2019). However, there are existing studies investigating on application of machine learning for detecting autism (Alvari et al. 2021; Carpenter et al. 2021; Thabtah and Peebles 2020). Machine learning methods have revolutionized many research fields such that they are capable of automatically detecting human behavior based on extracted features describing their reaction (Zanker et al. 2019). It is evident from previous research that attentional behavior of typically developing (TD) population can be measured using multiple behavioral data such as physiological behaviors including brain signals and heart rate (Belle et al. 2011; Yulita et al. 2019), brain signals, body movement and heart rate (Sonne et al. 2015; Mamun et al. 2019), facial expression and eye gaze (Aslan et al. 2014) to train a machine learning model. The objective of combining different attentional behavior is to compare the performance of a hybrid method to individual methods of measuring attention. However, the assessment of attention in children with ASDis often done subjectively using video data analysis where researchers code frames of recorded activity sessions as attention or inattention (Kinnealey et al. 2012; Banire et al. 2015; Higuchi, et al. 2018; Lee and Schertz 2020), while very few studies used automated attention detection using a computer vision algorithm to capture facial movement (Egger et al. 2018) and machine learning model built with eye-tracking data as a biomaker of detecting autism rather than detecting their attention (Yaneva 2020). These studies show the potential of facial and eye-tracking features for behavioral assessment. However, to the best of our knowledge, no study has investigated the performance of facial and eye-tracking features for developing a machine learning model capable of detecting the attention of children with ASD.

Thus, this current study proposes a behavioral-based attentional model using a bimodal approach: face and gaze-based models for children with ASD. The performance of each model type will be evaluated and compared. The model comparison can inform the direction of future studies on prominent attentional behavior in children with ASD. Toward the goal of this study, three major research questions are proposed: (1) Is the attentional unimodal sufficient for detecting attention in children with ASD? (2) what are the prominent facial and eye-gaze features for detecting attention? (3) What differentiates the attention model of children with ASD from TD?

In this paper, a novel machine learning model for detecting the attention of children with ASD using facial and eye-tracking features is proposed. Also, the performance of individual channels and hybrid channels are compared to identify the prominent features for detecting attention in children with ASD. Lastly, the machine learning models using these two channels are compared in children with ASD and TD children.

The structure of this paper is as follows: Sect. 2 describes the works that have been done to detect attention in children with ASD. Section 3 describes the methodology that has been used in this paper. The experimental results are discussed in Sect. 4, and Sect. 5 presents the discussion and the direction of future work. Lastly, Sect. 6 summarizes the findings of the paper and Sect. 6 presents the limitation of this study. Lastly, Sect. 7 summarizes the findings of the paper.

2 Related works

This section describes machine learning and its application in attention detection also referred to attentional model in this paper. The attentional model leverages machine learning to predict attentional behavior of students. Researchers support that attentional model can be developed using different physiological behaviors (multimodal methods) such as brain signals and heart rate (Belle et al. 2011) brain signals, body movement, and heart rate (Sonne et al. 2015); and facial expression and eye gaze (Aslan et al. 2014) while other studies use only one physiological behavior (unimodal) such as brain signals (Yang et al. 2015; Ghassemi et al. 2009). The objective of combining different physiological behavior is to compare the performance of a using unimodal and multimodal-based models for purpose of identifying a better approach for attention detection tasks.

2.1 Unimodal attentional model

The unimodal attentional model applies a single channel of physiological behavior to annotate attention. Bosch, D'mello (Bosch et al. 2016) apply facial features using appearance-based methods and body posture to detect learners’ engagement. The authors gave 137 students in 8th and 9th grade an educational game about physics and coded the actions of the participants for on-task and off-task behaviors. The findings from their study show that the model performance using the AUC for off-task detection was above chance i.e., greater than 0.5 for a generalized model (AUC = 0.816). Additionally, the authors found that the face-based model could generalize across temporal, ethnicity and gender.

Similarly, (Whitehill et al. 2014) used the appearance-based method on facial features to recognize students’ engagement. The students played a cognitive skills training software, and a Logitech web camera was used to record the entire session. The authors used a team of labelers consisting of undergraduate and graduate students from computer science, cognitive science, and psychology. These labelers viewed the recorded videos and labeled the learner's facial appearance from 34 participants into four levels of engagement: not engaged, nominally engaged, engaged, and very engaged. The findings in this study revealed that the SVM classifier had the highest performance score among the classifiers for the generalized model (AUC = 0.729). The authors evaluated how a face-based model generalizes across demographic factors, such as ethnicity. The model was trained with Caucasian and Asian-American populations and tested on an African American dataset. They found that the model generalized across ethnicity, and the prediction were above chance (AUC = 0.691).

Aside from facial features, other studies have used eye-tracking features to predict attentional-related behaviors. Bixler and D’Mello (2015) used a classification approach to predict mind wandering in a reading task among college students. This study used fixation, saccade, and pupil diameter features with four different algorithm classifiers. The best classifier algorithm was SVM with 74% accuracy. Lallé and Conati (2018) developed a gaze-based classification model to predict confusion during interactions with a visualization task. The study used gaze patterns, pupil size, and head distance as measurements. The authors achieved an accuracy of 61% using the RF classifier algorithm. They identified variations in user pupil size and head distance as good predictors of attentional state. Chen and Chen (2017) examined the states of attention and inattention of 15 participants during mathematics, continuous performance tasks (CPT), and reading tasks using SVM classifier algorithms. The gaze-based attentional model used achieved 93.1% accuracy. Finally, Shojaeizadeh and Djamasbi (2019) predicted task demand in a visualization task using the RF classifier algorithm using the eye-tracking features such as fixations, saccades, blinks, and pupillary responses. This study achieved an accuracy of 79%, with pupillary responses being the best task predicting feature.

Overall, these studies adopting unimodal attentional model supports the evidence that machine leaning algorithms are effective for automated attention assessment even with one channel of features such as gaze and facial features. The current study explores the two different channels both facial and gaze features independently. The outcome of machine learning model performance will reveal best channel of festures for detecting attention in children with ASD.

2.2 Multimodal attentional model

In multimodal attentional models, two or more sensors are used to record behavioral measures during attention tasks. Asteriadis and Tzouveli (2009) investigated facial and head poses using a webcam to predict the attentive and frustrated state of learners. This study applied the fuzzy rule classification method and achieved an accuracy of 80–88%. In 2010, D’mello & Graesser utilized a multimodal method of engagement recognition and applied it to facial features, body posture, and communication clues. (D’mello and Graesser 2010). Through the model evaluations using kappa scores, it was determined that a two-model combination performs similarly to a three-model method.

Another study on multimodal methods exploited facial features and interaction data to predict engagement and frustration. The authors used a multilayer perceptron (MLP) and achieved an accuracy of 78% (Shaker et al. 2013). A similar study by Monkaresi et al. (2016) utilized facial features and heart rate to detect engagement. This study applied RF and achieved a prediction power above the chance level (AUC = 0.758). The study by Chen and Tsai (2017) used SVM to train head pose and eye movement from 10 participants to develop an attention recognition model. The accuracy achieved in this study was 93.1%. Another multimodal model study used SVM to develop an engagement recognition system and achieved 85% accuracy (Aslan et al. 2014). The hybrid-based model in our research is based on multimodal methods used in the reviewed studies.

Aside studies adopting unimodal attentional models, multimodal models are also effective. Thus, the comparison of multimodel model with the unimodal model is one of the objectives of the current study.

3 Research gap

Related work on the single and multiple attentional models indicates that both methods have the potential to differentiate attention and inattention. However, the multimodal attentional model has a better advantage over the unimodal methods as it relies on multiple attentional behaviors. It is also important to mention that the related work relied on the subjective labeling of students’ attentional behaviors who do not have the neurocognitive disorder. One of the drawbacks of the subjective method of labeling attention is inconsistency due to differences in the background experiences of the experts and ambiguity (Dewan et al. 2019), especially in children with ASD who exhibit unique attentional behavior. However, combining observable attentional behavior with cognitive processing assessments such as performance scores can reduce the ambiguity of annotating attentional behavior in children with ASD. Additionally, the generalizability of the model across demographic information was also explored to evaluate the robustness of the attentional model. The generalizability of the behavioral-based attentional model across groups of children with ASD, TD, and attention task types is worth investigating.

4 Methodology

The framework of the hybrid-based model for attention recognition used in this study is shown in Fig. 1. It consists of the face and eye-tracking data extracted from the video frames captured during the experiment. These two sets of data were adopted based on the evidence from the unimodal studies reviewed in the current paper. Furthermore, the methods of obtaining the data are unobtrusive which prevents children with ASD from experiencing sensory issues due to body contact with the tools. We utilized the geometric distance between 34 facial landmarks and exploited the eye-tracking data of children with ASD for feature extraction. The features were fed into the SVM algorithm to predict attention and inattention. Using four significant steps depicted in Fig. 1, three attentional models: face-based, gaze-based, and hybrid-based (face and gaze), were developed to classify attention and inattention in both children with ASD and TD. In the first step, a series of attention task experiments was conducted to collect and annotate facial and eye-tracking data that describes attention. In the second step, feature extraction and selections were carried out to identify prominent features for training the attentional models. In the third step, the selected facial and gaze features were fed separately and combined into a binary classifier to train three types of attentional models: face-based, gaze-based, and hybrid-based. The six binary classifiers considered in this study are: SVM, RF, KNN, LR, CART and GBC. Lastly, the three attentional models developed: face-based, gaze-based, and hybrid-based, were evaluated and compared to identify the best attentional model for children with ASD as well as its robustness for generalizing it across the autism spectrum. Also, the model for ASD and TD were compared.

Fig. 1
figure 1

Methods of face, gaze and hybrid-based attentional models

4.1 Data collection and annotation

To collect facial and eye-tracking data describing attentional behavior, an experiment was conducted with children with both ASD and TD where they took a series of attention tasks. Prior to the experiment, approval was obtained from the institutional review board committee of the Qatar Biomedical Research Institute-Institutional Review Board approval. A total of forty-six children between the age of seven and eleven years participated in the study. Twenty children with ASD (ASD n = 20, M = 8.57, SD = 1.40) and twenty-six TD children from the same age range (TD n = 26, M = 8.58, SD = 1.36). The ASD group had sixteen boys and four girls with mild to moderate ASD, while the TD group had eighteen boys and eight girls, as shown in Table 1. The ASD participants were recruited through a local autism school in Doha and from known families residing in Qatar. All the ASD participants were clinically diagnosed by medical practitioners using the DSM IV-TR criteria (Huang et al. 2010). The TD participants were recruited from mainstream schools. All the parents of the participants read and signed an informed consent form to allow their children to participate in the study. Further experimental validation steps we took to achieve high data quality were to prevent the participants from eating or drinking during the experiment. The experiment was conducted in a quiet and dimly lit room to avoid distraction and reduce illumination, which may affect the eye-tracking sensor.

Table 1 Description of parameters used in the feature selection (Fig. 4)

The attention tasks adopted in this study were based on the context of CPT (Rosvold et al. 1956). The CPT test is a computer-based assessment that presents stimuli to test users in a repetitive and boring pattern to measure how long they can maintain their attention. The CPT attention tasks were simulated in a virtual classroom with distractions to represent a naturalistic learning environment (Parsons 2014; Rizzo et al. 2009). The target stimuli for the tasks are represented with random alphabets displayed on the board. The tasks had four levels of distractions: level 1—baseline (no distractions), level 2—minimal, level 3—medium, and level 4—extensive. The higher the level, the more the distractions. The distractions used include students raising hands, coughing, chatting, and dropping books. The essence of introducing distractions is to capture robust attentional behavior in different learning scenarios. The participants were instructed to press a clicker on the table when letter X appeared and ignore other letters. The number of correct clicks gives the performance score. A perfect performance score for each level was 40, as the letter X appeared 40 times, along with 214 random letters. Participants took a two-minute break after each attention task level. Two participants could not click and see the screen simultaneously; and thus the experiment was altered such that the letters were verbally called outto them as they randomly appeared. During the task, their facial and eye movements were captured in real-time with a Logitech C920 webcam attached to the top of the smaller screen in Fig. 2, and an eye-tracking device was placed at the base of the same screen. The real-time data capturing was monitored and recorded by the researcher on the bigger monitor (Fig. 3).

Fig. 2
figure 2

Real-time eye and face tracking analysis

Fig. 3
figure 3

Thirty-four facial landmarks with labels

After each participant completed all the attention tasks (levels 1 to 4), the recorded session for each level was divided into separate time slots for annotation. Each time slot consists of a random letter displayed for the duration of 1400 ms. To annotate the video, each slot was labeled as attention for valid response (letter X) or inattention for invalid response (missed letter X). In the ASD group, 18 out of 20 took all the attention tasks except for three who did not attempt level 4 as they wanted to end the experiment and 26 TD children completed level 1 of the attention task. Overall, 95 videos of the attention task sessions for all the participants were annotated and each video length was 5 min long.

4.2 Feature extraction and selection

This section discusses first, the feature extraction and selection methods for face tracking. Second, the feature extraction of the eye tracking data is also reported.

4.2.1 Facial features

In the facial feature extraction, 34 facial landmarks were extracted frame-by-frame from the recorded session for each participant through the webcam. These facial landmarks cover five regions of the face: eyes, eyebrows, nose, lips, and jaw, represented as a pool of feature vectors consisting of x and y coordinates represented in Eq. (1). Suppose \( f_{n}^{i} \) denotes each landmark in the nth video frame, starting with the ith frame.

$$ f_{n}^{i} = \left[ {\begin{array}{*{20}c} {x_{{0,}}^{i} y_{0}^{i} ~~~x_{{1,}}^{i} y_{1}^{i} } & \cdots & {x_{{33}}^{i} y_{{33}}^{i} } \\ { \vdots ~~~~~~~~~~~~~ \vdots } & \cdots & \vdots \\ {x_{{0,}}^{n} y_{0}^{n} ~~~x_{{1,}}^{n} y_{1}^{n} } & \cdots & {x_{{33}}^{n} y_{{33}}^{n} } \\ \end{array} } \right] $$
(1)

To explore the temporal variation of muscular activity across landmarks, we estimated the lengths of distances from all pair distances using the Euclidean distance formula in Eq. (2). This method is common in the literature for exploring differences in posed emotion and neutral face (Huang et al. 2010; Sariyanidi et al. 2014; Chu et al. 2018; Ghimire and Lee 2013).

$$ \left[ {\left( {x_{1} ,y_{1} } \right),\left( {x_{2} ,y_{2} } \right)} \right] = \sqrt {\left( {x_{2} - x_{1} } \right)^{2} + \left( {y_{2} {\text{ }} - y_{1} } \right)^{2} } $$
(2)

where \(x_{1} ,y_{1} \) and \(x_{2} ,y_{2} \) are representing two different facial landmarks. The estimated geometrical information generated was between one landmark to other landmarks and these sum up to 561 geometric-based features distances. These geometric features were reduced from 561 to 20 features by applying the feature selection method (Jain and Singh 2018). Feature selection reduces training samples to those with the best features while maintaining the efficiency of the model. The objective of feature selection is to reduce computational costs. In this model, we selected the best features from the 561 pair distances. Data samples were normalized to ensure comparable data sample range. This normalization is achieved as follows (Eq. 3).

$$ Z = \frac{{x_{i} - {\text{mean}}\left( x \right)}}{{{\text{stdev}}\left( x \right)}} $$
(3)

where Z is the standardized score, and stdev is the standard deviation of the data samples. In particular, the standardization subtracts the mean value of the samples and divides their value by the standard deviation.

0-Right Top Jaw, 1-Right Jaw Angle, 2-Gnathion, 3-Left Jaw Angle, 4-Left Top Jaw, 5-Outer Right Brow, 6-Right Brow Corner, 7-Inner Right Brow Corner, 8-Inner Left Brow Corner, 9-Left Brow Center, 10-Outer Left Brow Corner, 11-Nose Root, 12-Nose Tip, 13-Nose Lower Right Boundary, 14-Nose Bottom Boundary, 15-Nose Lower Left Boundary, 16-Outer Right Eye, 17-Inner Right Eye, 18-Inner Left Eye, 19-Outer Left Eye, 20-Right Lip Corner, 21-Right Apex Upper Lip, 22-Upper Lip Center, 23-Left Apex Upper Lip, 24-Left Lip Corner, 25-Left Edge Lower Lip, 26-Lower Lip Center, 27-Right Edge Lower Lip, 28-Bottom Lower Lip, 29-Top Lower Lip, 30-Upper Corner Right Eye, 31-Lower Corner Right Eye, 32-Upper Corner Left Eye, 33-Lower Corner Left Eye.

To explore the prominent facial features for differentiating attention and inattention, the geometrical information was estimated from all points pairwise using the Euclidean distance formula (Eq. 4) as follows

$$ {\text{Euclidean}}\;{\text{distance}} = \sqrt {((x2 - x1)^{2} + (y2 - y1)^{2} )} $$
(4)

where x1, y1, and x2, y2 are representing two different facial landmarks.

To select the best feature from the facial features, threshold distance was used. The threshold distance is an estimation measurement that describes the changes between facial expression at a neutral frame and expression frame (Al-agha et al. 2017). The threshold distance value is an established method for revealing the information embedded in a dataset. This approach has been successfully applied in differentiating posed emotions from neutral emotions (Asteriadis et al. 2009; Hulse et al. 2012). In this current study, the threshold distance between attention and inattention with higher values represented in Fig. 4 were selected to train several binary classifier algorithms. The parameters describing the feature selection process are described in Table 1.

Fig. 4
figure 4

Facial feature selection process

4.2.2 Eye-tracking features

The gaze-based attentional model consists of six primary eye-tracking features described in Table 1: gaze position, fixation position (FixationY, FixationX), FixationDuration, Ocular distance i.e., head distance to the screen (DistanceLeft, DistanceRight), pupil size (PupilLeft, PupilRight), and interocular distance were collected. The description of these features is provided in Table 2.

Table 2 Description of gaze-based features

Next, we identified the annotated samples labeled as attention, inattention, and unknown. The samples with unknown labels were deleted, leaving us with only samples labeled as attention and inattention. The annotation column with string values: attention and inattention were converted to integers 1 and 0, respectively. Lastly, we normalized each feature to be on the same scale using the StandardScaler library in sci-kit-learn (Eq. 5) using the following equation to achieve a relatively normalized sample distribution (Table 3).

$$ {\text{Z}} = { }\frac{{x_{i} - {\text{mean}}\left( x \right)}}{{{\text{stdev}}\left( {\text{x}} \right)}} $$
(5)
Table 3 datapoints generate ASD group in attention tasks level 1–4

The best features were selected using an embedded method that uses the inherent characteristics of decision tree algorithms such as random forest, and CART (Guyon and Elisseeff 2003) and it is especially encouraged for imbalanced datasets (Liu et al. 2019). We applied an embedded feature selection method using feature permutations of randomly selected samples and by calculating the percentage increase in the misclassification rate to select the best individual eye-tracking features (Lu 2019). (Fig. 5).

Fig. 5
figure 5

Eye-tracking feature selection process

4.3 Binary classification

Three different attentional models were developed using six different classifier algorithms: SVM, CART, LOR, RF, GBC, and KNN. These algorithms were implemented in Scikit-Learn. The performances of these models were compared using tenfold cross-validation to select the best classifier. The performances of these algorithms were compared using tenfold cross-validation, and SVM outperformed the other classifiers with the highest AUC value. Next, we applied hyper-parameter tuning to optimize the SVM parameters; cost (C) and gamma (γ). A large value for C indicates the model will be stricter on classification errors. γ influences the sophistication of the decision boundary. Small values of γ will lead to an increasingly sophisticated boundary that correctly classifies a higher percentage of training data. Thus, inappropriate value selection for these parameters, also known as hyper-parameter tuning, may lead to the poor performance of a model on a new dataset (i.e., overfitting). The parameter values selected for C and γ are from the following sets of values C’s = (Marcu, et al. 2013; Douglas et al. 2022; Wood et al. 2016; Lee and Meadan 2021; CDC 2022; Alshaban et al. 2019; Sharabi and Marom-Golan 2018; Russa et al. 2015; Kinnealey et al. 2012; Ridderinkhof et al. 2020; Almumen and Almuhareb 2020; Fredricks 2011; Dewan et al. 2019; Tsiourti et al. 2019; Rosenberg and Ekman 2020; Escobedo et al. 2014; Neuhaus et al. 2019; Hutt, et al. 2021; Palazzi et al. 2018; Chen et al. 2021; Jacob et al. 2019; Alvari et al. 2021; Carpenter et al. 2021; Thabtah and Peebles 2020; Zanker et al. 2019; Belle et al. 2011) and γ = [0.001, 0.01, 0.1, 1, 10].

4.4 Attentional model

Three attentional models were developed using facial, gaze and hybrid (facial and gaze) features. Each model was trained as participant-specific and generalized. The participant-specific model was based on individual data while the generalized model was developed with the data from all the participants. In total, six attentional models were evaluated using AUC scores to determine their performance in detecting attention. The performance of the participant-dependent model shows a higher performance value than the participant-independent.

5 Results

This section presents the descriptive statistics attentional behavior of children with ASD and TD, results of the face-based, gazed-based and hybrid-based attentional models using six different classifier algorithms, followed by the features selection, participant-specific and generalized models. Among the six binary classifiers, SVM has the highest performance value of other classifiers. The facial and gaze data used for developing the attentional model were generated series from attention tasks with 46 participants (ASD = 20 and TD = 26). The attention tasks consist of four different levels and they include: baseline, easy, medium, and hard. The baseline comprises static social and nonsocial distraction; the easy level presented the target stimuli with fewer dynamic distractions, the easy level with more dynamic distractions, and the hard level with the highest distractions. The degree of distractions levels is to obtain robust forms of attentional behavior. However, the data used for th attentional model were from.

5.1 Descriptive statistics of attentional data in ASD and TD participants

The descriptive statistics of attention and inattention data for both facial and eye gaze features from ASD group for all the attention task ( levels 1–4) represented in Table 4 and Fig. 6. Next, we present the differences between the two groups (ASD and TD) and within each group using independent samples t-test and paired t-test respectively. Despite the difference in the facial and gaze data there is no significant difference between the two data types as shown in Table 5 which shows the performance of the facial and gaze-based model can be compared. The descriptives for facial and gaze features for ASD and TD groups in Table 6 and Fig. 7 show the variation between the data from the two groups. The independent t-test between data from the two groups (Table 7) shows that there is no significant difference between the group except for the inattention data types where ASD group has significant more data sample than the TD group.

Table 4 Descriptive statistics of facial and gaze data in ASD group (Attention task levels 1–4)
Fig. 6
figure 6

Descritive plots for attention and inattention data generated from facial and gaze features

Table 5 Paired samples T-test
Table 6 Descriptives for facial and gaze features for ASD and TD groups (Attention task level 1 only)
Fig. 7
figure 7

Descritive plots for attention and inattention facial data extracted from ASD and TD groups

Table 7 Independent samples T-test for facial and gaze data between ASD and TD groups

5.2 Face-based attentional model

Both participant-specific and generalized models were evaluated using ROC-AUC scores. The performance of the participant-specific model shows a higher performance value than the generalized.

5.2.1 Evaluation of face-based

In the evaluation phase, we compared the performances of models developed with different sets of distance-based features (i.e.,1–10, 1–20, and 1–30 features). The model with 10 features and 30 features had an accuracy score of 0.873 and that of 20 features had an accuracy of 0.889 as shown in Fig. 8. Thus, this study used a model with the best 20 distance-based features due to higher performance and fewer features. Similarly, the SVM model had the best AUC score across other models depicted in Fig. 9. Also, Table 8 describes the 20 distance-based features. Five face regions emerged as prominent facial features in recognizing attention— left jaw, right brow, right eyes, nose, and gnathion (chin).

Fig. 8
figure 8

Comparison of best features and classifier algorithms

Fig. 9
figure 9

Model comparison with Best 20 features

Table 8 Best 20 distance-based features for face-based attentional model

5.2.2 Generalized model

In this model evaluation, the attentional model used data from 17 participants for training. Consequently, the model was tested on one participant who was not part of the training data. The average of the model performance for all participants was slightly above chance (ROC-AUC = 0.541), as shown in Fig. 10. This result implies that generalized performance is weak for a generalized attentional model among children with ASD.

Fig. 10
figure 10

Average performance of participant-specific model

5.2.3 Participant-specific model

The attentional model was trained and tested only on the data samples from each participant in the model. The training and testing data were in the ratio of 80% and 20%. The average model performance for all participants was above chance (ROC-AUC = 0.957), as shown in Fig. 11. This result illustrates that the performance of the participant-specific model was better than that of the generalized model. This finding is supported by the heterogeneity in children with ASD, where each child responds differently to a stimulus (Wetherby and Prizant 2000).

Fig. 11
figure 11

Average performance of the generalized model

5.3 Gaze-based attentional model

This section presents the results of gazed-based attentional model using six different classifier algorithms, followed by the features selection, participant-specific and generalized models.

5.3.1 Evaluation of gaze-based model

The evaluation metrics (ROC-AUC) of tenfold cross-validation for six different classifiers: KNN, CART, GBC, SVM, LOR, and RF, are shown in Fig. 10. These classifiers were trained on the 12 eye-tracking features described in Table 1. The best model classifier was SVM(AUC = 0.990), followed by RF (AUC = 0.989). Lastly, in the feature selection process, the PupilLeft emerged as the best feature among the twelve features, as shown in Fig. 12. It was determined that PupilLeft is the most significant feature, and GazeRightx is the least relevant feature. The combination of these eye-tracking features was evaluated for both the specific and generalized models in their order of importance. For example, the first feature, PupilLeft, was evaluated, then PupilLeft and DistanceLeft were evaluated, and so on until we completed all 12 features. The result shows that the specific model (AUC = 0.998) had better performance than the generalized model (AUC = 0.626). The result also shows that the combination of the first four features (PupilLeft, DistanceLeft, FixationDuration, and FixationX) achieves the best performance, as shown from Fig. 13.

Fig. 12
figure 12

Model selections for attentional model

Fig. 13
figure 13

Eye-tracking features in incremental order for specific and generalized models

The model performance of generalized and participant-specific models of the gaze-based attentional model was compared among ASD and TD groups. The participant-specific had higher performance than the generalized model in ASD for all feature combinations. In contrast, the models for the TD group had an equal performance with the first features, but the participant-specific model had higher performance when the features were less than four.

5.3.2 Generalized model

This section presents the result of the SVM that was conducted using the best four features: PupilLeft, DistanceLeft, FixationDuration, and FixationX. We trained the SVM classifier with samples from all participants except one, which was used for testing. This training and testing approach was repeated for all participants and averaged. Five participants with only one class label i.e., only attention or inattention were exempted from testing. Using the average score from all participants, the model performance was slightly above chance (ROC-AUC = 0.626), as shown in Fig. 14.

Fig. 14
figure 14

Average generalized model performances with four best eye-tracking features

5.3.3 Participant-specific model

Unlike the generalized model, the participant-specific model was only trained and tested on samples from each participant using the splitting ratio of 80% to 20%, respectively. The model performance from each participant was averaged for all participants. The average model performance was above chance (AUC = 0.998), as shown in Fig. 15. This result is far higher than the chance level, which implies that attention recognition among children with ASD can be achieved using personalized eye-tracking features.

Fig. 15
figure 15

Average participant-specific model performances with four best eye-tracking features

5.3.4 Comparison of gaze-based attentional model in ASD and TD groups

The performance of the generalized and specific gaze-based attentional model was compared between the ASD and TD groups. We evaluated the incremental combination of all the features starting with combining the most relevant feature, PupilLeft, with the other 11 features. In the ASD group, the performances of the specific models are all higher than the generalized models. In comparison, in the TD group, the performance of the specific models was only higher for the first three feature combinations and was equal for the other feature combinations, as shown in Figs. 16 and 17.

Fig. 16
figure 16

Gaze-based model performances for generalized and specific models with children with ASD

Fig. 17
figure 17

Model performances for generalized and specific models with TD children

5.4 Hybrid-based attentional model

This section presents the results of a hybrid-based model for participant-specific and generalized models. Also, we compared the hybrid-based, gazed-based, and face-based models across the participant groups. The results show that the hybrid-based participant-specific model performed better than the generalized model. The comparison across the three models (hybrid-based, gazed-based, and face-based) showed that the gazed-based model performed better than the face-based model in the ASD group. In contrast, the face-based model performed better with the TD group.

5.4.1 Generalized model performance

This section presents the results of the hybrid-based model using the best four eye-tracking features and 20 best twenty face-tracking features. Using the leave-one-out evaluation model, the average score from ASD participants gave a performance value close to chance level (ROC-AUC = 0.538), as shown in Fig. 18. The k-fold metric used a higher percentage as a testing sample, where the training and testing data were in the ratios of 78% and 22%. This evaluation was iterated 50 times to ensure performance reliability, and the average result was a little above chance as well (ROC-AUC = 0.561). This result implies that the attentional behavior is diverse and does not have a regular pattern.

Fig. 18
figure 18

Hybrid-based model for generalized model (ASD)

5.4.2 Participant-specific model performance

The participant-specific model was trained on 80% of the data samples and tested with the remaining 20% from each participant. The model performance averaged across all participants was above chance (AUC = 0.996), as shown in Fig. 19. This result is far higher than the chance level, which implies that attention recognition among children with ASD can be achieved with personalized eye-tracking features.

Fig. 19
figure 19

Hybrid-based model for participant-specific model (ASD)

5.5 Confusion matrix metrics for three models

This section presents the model evaluation of the attentional model types developed in the current study with using confusion matrix to compare facial and gaze features from children with ASD. The model evaluation for participant 1 is used as a sample to illustrates the model performance. Figure 20 represents the generalized and personalized models for face and gaze–based models.

Fig. 20
figure 20

Confusion matrix evaluatio metric for P1

5.5.1 Comparison of attentional model types in ASD and TD groups

The comparison of attentional model types between ASD and TD groups was evaluated across the face-based, gaze-based, and hybrid-based models using data from attention task level 1. Figure 21 shows that the generalized model in all model types, face-based, gaze-based, and hybrid-based, works for the TD group more than the ASD group. In the ASD group, the gaze-based model had the highest performance score (AUC = 0.677) when compared to the face-based (AUC = 0.535) and hybrid-based models (AUC = 0.641). In contrast, the hybrid-based model performed better (AUC = 0.963) than the gaze-based (AUC = 0.961) and face-based (AUC = 0.958) models in TD groups. When using participant-specific models for ASD, the model performance was higher than that of the TD group. In the ASD group, the gaze-based model had the highest performance score (AUC = 0.998) when compared to face-based (AUC = 0.957) and hybrid-based models (AUC = 0.996). Conversely, the gaze-based (AUC = 0.950) performed better the hybrid-based model (AUC = 0.943) and face-based models (AUC = 0.934) for TD group.

Fig. 21
figure 21

Comparisons of model types in ASD and TD groups

6 Discussion

6.1 Gaze-based attentional model

This study conducted an attention task experiment with children with ASD to generate eye-tracking features. The feature extraction was based on content and physiological-based data that describes attentional behavior during an attention task. The extracted features were trained with the classifier algorithm to recognize attention and inattention behaviors. Two main types of models, participant-specific and generalized were evaluated between the ASD and TD groups. Results show that the participant-specific model performed better than the generalized model for the ASD group while both models had a similar performance for the TD group with the best four features. This finding aligns with the concept of heterogeneity among children with ASD (Wetherby and Prizant 2000).

The best features identified for the gaze-based attentional model using embedded feature selection include a combination of PupilLeft, DistanceLeft, Fixation Duration, and FixationX. These features showed that the two eye works differently during attention tasks. This finding supports the evidence of side glancing in children with ASD, a characteristic where they tend to look off to the side of a stimulus (Noris et al. 2012; Little 2018). Furthermore, the results of this study show that pupil size is the most significant feature for detecting attention.

Our approach to developing a gaze-based attentional model for children with ASD has added to the body of knowledge with two main contributions. First, this study showed that the personalized gaze-based attentional model is more suitable for children with ASD than the generalized model. Second, we identified that the pupil size of the left eye is the most significant feature for developing gaze-based attentional model for children with ASD.

6.2 Face-based attentional model

The face-based attentional model discussed in this study utilizes 34 facial landmarks with x and y coordinates generated in real-time during the attention task. Consequently, the features were transformed to distance between facial landmarks. Twenty distance-based features were identified as distinctive features for differentiating attention and inattention. Also, the best facial features for recognizing attention were identified based on the distance-based features—jaw, eyebrows, eyes, nose, and gnathion. Lastly, we evaluated model generalization for ASD and TD groups and different attention tasks. According to the results of this study, the performance of the participant-specific and the generalized was above chance. However, the performance of the participant-specific model had a better performance score than the generalized model.

This shows that the participant-specific model works better for children with ASD. Similarly, previous studies also concluded that children with ASD exhibit different face-based attentional behaviors (Bieberich and Morgan 2004; Czapinski and Bryson 2003). Findings from this study show that the face-based attentional model is relatively more generalized among the TD group than in the ASD group. Overall, the participant-specific model outperformed the generalized model. Therefore, a generalized face-based model for children with ASD and different attention tasks needs to be applied cautiously.

The implication of the face-based attentional model for children with ASD showed that the personalized model supports the evidence of heterogeneity in individuals with ASD. This confirms variations in facial features describing attentional behavior in children with ASD. Thus, our result suggests a personalized face-based attentional model instead of the traditional one-size-fits-all machine learning approach. A similar finding was reported by Rudovic and Lee (2018) where the authors evaluated personalized and generalized deep learning models to detect affective states and engagement during human–robot interaction with children with ASD. Their finding showed that the personalized model outperformed the generalized model due to the cultural and individual differences among the participants. Furthermore, the direction of recent research, especially in the ASD field, is heading down the path of personalization such as a personalized autism diagnosis system (Dekhil et al. 2018) personalized and precise intervention for children with ASD (Stevens et al. 2019), and personalized medicine in ASD. The trend of the personalized model is not limited to education and medicine but also to commerce and many other fields.

The application of a personalized face-based attentional model can enhance adaptive learning support or human intervention support. The findings from this study support the evidence that facial features have the potential of assessing attention (Nezami 2019; Dubbaka and Gopalan 2020). The three contributions of this study include 1) a distance-based feature selection method that differentiates attentional behaviors, 2) five face regions that best describe attentional behaviors among children with ASD, and 3) how the face-based attentional model supports the evidence of personalized attentional model than the generalized model.

6.3 Hybrid-based attentional model

The hybrid-based attentional model consists of both face-tracking and eye-tracking datawhich were feed into SVM classifier algorithm to predict attentional behaviors. The two main types of models, participant-specific, and generalized models were evaluated between the ASD and TD groups. The results show that all attentional model types, hybrid-based, face-based, and gaze-based models, cannot be generalized in children with ASD. This finding aligns with the concept of heterogeneity among children with ASD (Wetherby and Prizant 2000). Also, eye-tracking measures aid in the detection of more attentional behaviors than face-tracking measures or the combination of eye-tracking and face-tracking measures. However, it is worth mentioning that our study has showed that face-tracking measure can also be sufficient for attention recognition.

There is a widespread assumption that models with multiple-sensor or multimodal methods give better accuracy than models with a few sensors or single-sensor methods (Aslan, et al. 2014; Asteriadis et al. 2009; Shaker et al. 2013). A previous study on this assumption shows that the multimodal approach is not always the best. According to D’mello and Graesser (D’mello and Graesser 2010), a multimodal approach shows that integrating facial features, body posture, and interaction dialog, had similar performance when compared with combining facial features, body posture and interactive dialog. Similarly, our results show that combining many methods does not guarantee better accuracy. For example, we expected the performance of the hybrid-based attentional model to be higher than the gaze-based or face-based models. The hybrid-based model performed better than other models in the TD group but not in the ASD groups. Instead, the best model type for the ASD group was the gaze-based model.

7 Practical implication

The practical implication of this study will benefit teachers and parents in saving their time while assessing the attention of their students and children respectively. Importantly, aside from providing the stakeholder the attention duration of children with ASD, it will also inform them of the classroom stimuli types that distracts their attention. Recently we have used the model developed in the current study to develop a web-based application platform for personalized engagement assessment using an integrated webcam for face-tracking and eye-tracking. The application has four modules. In the first module, we presented attention tasks for the initial calibration of personalized behavior during learning engagement. The attention task consisted of target stimuli (social and nonsocial visual and auditory) simulating typical classroom distractions such as social, nonsocial, audio, and audiovisual distractions. The distraction levels consists of two different levels (baseline and hard) to capture the degree of the engagement level of each user. The second module will consist of feature extraction from facial and gaze features during the attention task to train the face-based machine learning model and capture their gaze pattern. The third module has an intelligent components which consists of machine learning models that detect when the user pays attention to the target stimuli in the attention task and when they are inattentive. In the fourth module, the personalized attentional report is generated for the satekholdershwing the duration of attention throughout the learning duration as well as the gaze patterns of each student.

8 Limitation

Despite the strength of this study, it still has several limitations. First, attentional behavior of children with severe ASD was not considered as they were not included in the data collection process because it is challenging to get sit for the attention task. Therefore, the application of this attentional model may not be generalized to children with severe ASD or attention deficit hyperactive disorder (ADHD) since neither of these variables was incorporated in the experiment. Second, the experiments were tailored to the experience of children with ASD and not adults, so our findings may not apply to techniques of measuring attention in adults with ASD. Third, not all possible eye-tracking measures were covered in this study. For example, saccades, smooth pursuit, and blink rate. There is a need for further studies on how other eye-tracking metrics can influence the development of an attention recognition system for children with ASD. Four, the sample size used in this study is limited as it is challenging to get the consent of parents due to the fear of unknown reactions from their children. Lastly, the facial feature extraction was primarily done on data from children with ASD and those features were applied on facial data from TD children. Therefore, the attentional model may not give optimal performance with TD children.

9 Conclusion

This study has shown the potential of unobtrusive and non-invasive sensing technology such as webcam and eye-tracking devices for extracting facial and gaze features that best describe the attentional behavior of children with ASD. Thus, designers or researchers can either use a webcam or eye-tracking device as a tool to track the attention of children with ASD. This objective approach is a step toward solving problems of attention assessment in children with ASD. The significant advantage of the objective attention detection system over subjective assessment is the provision for individualized attention assessment which supports the heterogeneity in ASD. Further improvement in the study can include the investigation of behavioral features that describes auditory attention as facial and eye-tracking features that might be limited to visual attention. Also, this work can be extended to other populations with learning disabilities such as dyslexia, down syndrome, and others to achieve desired learning outcomes.