Skip to main content

Table 4 An overview of the method used in the selected articles to address RQ3

From: Automatic engagement estimation in smart education/learning settings: a systematic review of engagement definitions, datasets, and methods

Author

Input device/modality

Input features

Estimation method

Performance metrics

Wang et al. (2010)

Thermal camera

Grayscale images pixels

Feature extraction: PCA, PCA + LDA, AAM, and AAM+LDA. Classification: KNN.Validation: LOOCV

Accuracy

Cocea et al. (2011)

Log file

30 log attributes

WEKA. 8 algorithms: 1) BNs, 2) LR, 3) simple logistic classification (SL), 4) Instance-based classification with Ibk algorithm (IBk), 5) Atrribute selected classification using J48 classifier and Best first search (ASC), 6) Bagging using REP (reduced error pruning) tree classifier (B), 7) Classification via Regression (CvR), 8) DTs

Accuracy (highest 91%)

AlZoubi et al. (2012)

3 sensors (electrocardiogram (ECG), facial electromyogram (EMG), galvanic skin response (GSR)), webcam, screen recorder.

117 features (EEG, currogator musle EMG, finger tips GSR)

Preprocess: low/high pass filter. Feature extraction: using Augsburg Biosignal Toolbox (Wagner et al.). Classification: PRTools 4.0 (de Ridder et al. 2017), a pattern recognition library for Matlab. 9 classifiers: 1) SVM with linear kernel (SVM1), 2). SVM with polynomial (SVM2), 3) KNN (\(k=3\)). 4) KNN (\(k=5\)), 5) KNN (\(k=7\)). 6) NB, 7) Linear Bayes Normal Classifier (LBNC), 8) Multinomial LR, 9) C4.5 DT. Validation: 10-fold cross validation with 20:7 train:test ratio

Kappa statistic and F1-scores. (KNN and LBNC yielded the best detection)

S-Syun et al. (2012)

Microphone, camera, Depth sensor

Oculesic, kinesic, proxemic, vocalic, person identity cue features

Oculesic (gaze direction), Kinesic (facial expression, movement, body posture/gesture), proxemic (body posture/gesture, spatial relation), vocalic (user call), person identity cue (Spatial relation, face identification). Feature extraction: OpenNI library. Binary classifications (inattention and attention): Fuzzy-based classification algorithm (FMMNN classifier). Fuzzy min-max neural networks (FMMNN) with 7 input nodes. Validation: 7:3 training:test samples

Accuracy 86%

Whitehill et al. (2014)

Camera

Facial features

Feature extraction: using CERT. Binary Classification: Boost (BF), SVM (Gabor), MLR (CERT). Validation: 4-fold cross-validation

2-alternative forced choice (2AFC)

Schiavo et al. (2014)

Camera

Head movement and face features

Features extraction: using face actions and expression recognition (Joho et al. 2011). 3-class classification: SVM. Validation: LOOCV

Accuracy=73%, F-score = 63%

Woo-Han Yun et al. (2015)

Camera

55 features of face and head information

Pre-processing: median filtering and aggregation method (mean, median, max, min, standard deviation (STD), range, rate of zero crossings (ZCR). 4-class classification: relevance vector classifier (RVC), a spare version of Bayesian kernel logistic regression or Gaussian process classification (GPC).

Accuracy = 78.53%, Balanced Accuracy = 70.64%

Gupta et al. (2016)

Camera

Image pixels

Classification: InceptionNet, C3D, LRCN.

Accuracy

Zaletelj et al. (2017)

Kinect one sensor

2D and 3D gaze point and body posture data

3-class classification: DT (simple and medium), KNN (coarse, medium, and weight), Bagged Trees, Subspace KNN

Accuracy = 75.3%

Monkaresi et al. (2017)

Kinect face tracker and ECG sensors (BIOPAC MP150 system)

kinect face tracker features, LBP-TOP, heart rate data

Pre-process: RELIEF-F for feature selection, Synthetic Minority Oversampling Technique (SMOTE) to handle the data imbalanced. Classifications using WEKA: Updateable NB, BN, LR, classification via cLustering, rotation forest, dagging. Validation: LOOCV.

AUC = 0.758 and 0.733.

Youssef et al. (2017)

No estimation method. Only proposed dataset.

Zhalehpour et al. (2017)

Camera

Images

Face tracking: CHEHRA tracker. Classification: SVM. Accuracy: 5-class classification = 75.32%, 8-class = 65.84%

 

Hussain et al. (2018)

Log file

Number of clicks and activity types

Activity types includes dataplus, forumng, glossary, oucollaborate, oucontent, resource, subpage, homepage, and URL. Binary classification: decision tree (DT), J48 (belongs to DT family), CART, JRIP decision rules, GBDT, NB. Validation: 10-fold cross validation

Accuracy, Recall, AUC, Kappa

Psaltis et al. (2018)

Kinect face tracker

Facial expression, Body motion features, average time of responsiveness.

Feature for emotional engagement: facial expression and body motion. Feature for behavioral engagement: average time of responsiveness. Binary classification: unimodal ANN classifiers. Validation: 4-fold validation. Testing on: three primary schools.

Accuracy = 85%

Rudovic et al. (2018b)

Audiovisual sensors from NAO robot and physiological sensors to provide heart-rate, electrodermal activity, body temperature, and accelerometer data.

Face, body, physiology features, CARS, the demographic features (culture and gender)

Pre-process: OpenFace, OpenPose openSMILE (Eyben et al. 2013), and self-built tools for feature extraction. DeepLift for feature selection. Regression: personalized perception of affect network (PPA-net) whis based on ANN and clustering using t-SNE.

Intra-class correlation (ICC) \(= 65\% \pm 24\) (average ± SD)

Ninaus et al. (2019)

Webcam

Image frames

Pre-process: Microsoft’s Emotion-API classifying the prevalence of the 6 basic emotions for each frame of the captured videos (’fear’ and ’disgust’ are excluded to enhance the quality of the data). Classification: SVM ensembles using “classyfire” package in R statistical environment. Questionnaires were analyzed using separate multivariate ANOVAs

Accuracy \(\approx 64.18\%\)

Yue et al. (2019)

Microsoft LifeCam webcam and Tobii Eye Tracker 4C

Video/images, eye movement, and click stream data.

Fine-tuning parameters by transfer learning for CNN: VGG16, InceptionResNetv2. Classification: CNN and LSTM. Regression: CART, random forest, GBDT. Validation: 10-fold cross validation.

Accuracy = 76.08% for facial expressions recognition, 81% for eye movement behavior. R2 metric = 0.98 ofof course performance prediction.

Mollahosseini et al. (2019)

N/A

Images

CNN (AlexNet) and SVR on Valnce and Arousal labels

RMSE, CORR, SAGR, CCC.

Celiktutan et al. (2019)

Cameras (2 static & 2 dynamic), 2 biosensors

Image, sensor data

Binary classifictions: SVMs. Validation: a double LOOCV.

 

Youssef et al. (2019)

Robot’s camera

Distance; head, gaze and face streams; speech;looking and listening.

Feature extraction: OpenFace and Pepper OKAO software. Binary classification: LR, DNN, GRU, LSTM. Validation: 3-fold cross validation

Accuracy, F1-Score, AUC

Olivetti et al. (2019)

Camera

Images (geometrical description)

3-class classification: SVM

The classification result was compared with the questionnaire.

Ashwin et al. (2020b)

Camera

299x299x3 image with RGB with facial expressions, hand gestures and body postures present

Pre-processing = data augmentation. Classification: transfer learning with inception v3. Hybrid CNN = CNN-1 + CNN-2. CNN-1 for single student in as single image frame. CNN-2 for multiple students in a single image frame. Validation: 10-fold cross validation

Posed: accuracy = 86%, recall = 89%, precision = 91%, F1-score = 84%, AUC = 90%. Spontaneous: accuracy = 70%, recall = 72%, precision = 77%, F1-score = 62%, AUC = 69%

Ashwin et al. (2020a)

Camera

Images with facial expressions, hand gestures and body postures present

Classification: CNN with pre-trained on GoogleNet architecture(Krizhevsky et al. 2017). Validation: 10-fold cross validation.

Accuracy = 76%

Pabba et al. (2022)

Camera

48x48 image pixels

Add additional public dataset: BAUM-1,DAiSEE, and Yawning Detection Dataset (YawDD)\(^{a}\) . Pre-process: face and head detection (using multi-task cascade CNN (MTCNN)), face alignment, data augmentation. 6-class classification: CNN.

Accuracy = 76.9%

Duchetto et al. (2020)

Head camera of the robot

RGB frame-by-frame image

Face detection: CNN. Regression: LSTM. Build the model using TOGURO dataset and evaluated on UE-HRI.

AUC=0.89

Yun et al. (2020)

Camera, Kinect V2

Facial features

Classification: CNN with fine tunning by using a pre-trained network (VGG-3D model). Validation: 6-fold cross-validation, leave-one-labeler-out cross-validation (LOLOCV).

Accuracy, AUC of ROC (ROC), AUC of PRs (PRs), MCC, F1-score, balanced accuracy, specificity (true positive and negative rate).

Zhang et al. (2020)

Camera

grayscale image (100 x 100 pixel)

Feature extraction: adaptive weighted LGCP. Binary classification: fast sparse representation (AWLGCP &FSR). Validation: 10-fold validation. Compare: the four methods (CLBP-SRC, Gabor-SVM, active shape model-SVM, and AWLGCP &FSR).

 

Liao et al. (2021)

N/A

DAiSEE and EmotiW images

Face detection: MTCNN. Pre-process: resize images to 224\(\times\)224 and pre-trained on VGGFace2. 4-class classification and regression: Deep Facial Spatiotemporal Network (DFSTN) = pretrained SE-ResNet-50 (SENet) for extracting facial spatial features, and LSTM Network with Global Attention (GALN). Validation: 5-fold cross-validation.

Accuracy = 58.84% and MSE = 0.0422 on DAiSEE. MSE = 0.0736 on EmotiW.

Li et al. (2021)

Camera, log file

Facial features (Gaze, Pose, FAU) and 8 clinical behaviors

Performance (correcteness) labelling: for problem solving process (Measure cognitive engagement). Feature extraction: using OpenFace. Calculate mean and std of each facial features. Feature selection: recursive feature elimination random forest (RFE-RF). Binary classification: NB, KNN, DT, RF, SVM. Validation: 10-fold-cv for feature selection. Use students self-reports of cognitive engagement states as the ground-truth

 

Bhardwaj et al. (2021)

FER-2013 dataset (image), and MES dataset

images

Face detection: OpenCV. Binary classification: CNN. First, calculating weights matrix of emotions, then calculating MES and detecting engagement.

 

Goldberg et al. (2021)

3 Cameras

Eye gaze, head pose, and facial expressions.

Feature extraction: OpenFace. Regression: Model 1: multiple linear regression. Model 2: two additional linear regression. Model 3: add learning prerequisites.

MSE= 0.05. Pearson correclation coefficient between manual annotations’ mean level and prediction models r = .70 , p = 0

Chatterjee et al. (2021)

electrocardiography, skin conductance, respiration, skin temperature, Yeti X microphone, webcams

Electrocardiography, skin conductance, respiration, skin temperature signals

Pre-process: lowpass/highpass filter using MATLAB/Simulink. Regression: a binary decision tree, leastsquares boosting, and random forest implemented in MATLAB 2020b. Validation: LOOCV

 

Youssef et al. (2021)

Robot’s camera

Distance; head, gaze and face streams; Speech;Laser

Face detection: NAOqi People Perception.Face extraction: OKAO Vision sofware. Imbalanced issue: undersampling “No breakdown”, oversampling “Breakdown” class using SMOTE. Binary classification: LR, LDA, RF, and MLP. Validation: 5-fold cross validation.

AUC \(\approx\) 0.72

Sümer et al. (2021)

Camera

Face features, head pose (without facial landmarks)

Face detection: RetinaFace. Multi channel settings : training Attention-Net for head pose estimation and Affect-Net for facial expression recognition CNN. Pre-Process: : PCA (for SVM). 3-class classification: SVM (use majority voting), RF, MLP, LSTM with fine tunning (transfer learning) with AffectNet for facial expression and Attention-Net (300W-LP) for head pose with ResNet-50. Tested using different fusion strategies using RF engagement classifiers. Use of self-supervision and representation learning on unlabelled classroom data.

AUC = 0.84 (with personalization). Attention-Net better than Affect, given that the criteria for the manual annotation of engagement is not directly related to gaze direction or facial expression.

Trindade et al. (2021)

Log file

Teacher and students attributes

WEKA. Random Forest generated the best result.

AUC

Ma et al. (2021)

Use DAiSEE

Eye gaze, facial action unit, head pose (117 dimensions); and body pose (60 dimensions)

Feature extraction: OpenFace 2.0. Pre-process: 640x640 resolution at 10fps. Feature Fusion: Neural Turing Machine (NTM) architecture, which contains two basic components: a neural network controller and a memory bank. NTM workflow: read heads and write heads.

Accuracy = 60.2%

Thiruthvanathan et al. (2021)

Indian origin faces datasets DAiSEE, iSAFE, ISED

508 images from ISED and iSAFE, 5295 images from DAiSEE.

Feature extraction: light weight ResNet. Classification: ResNet classifier (CNN with 50 layers deep).

Accuracy, Precision, Recall, Sensitivity, Specificity and F1 score

Altuwairqi 2021 et al. (2021b)

Camera, mouse, keyboard behaviour

Key frame facial expressions.

Transfer learning using FER2013 and real-world affective faces (RAF). 3-class classification: Naive Bayes (NB) classifier.

Accuracy and MSE.

Vanneste et al. (2021)

Camera

Upper body keypoints, eye gaze direction

Feature for individual classification: upper body keypoints (from 2s clips), for collective classifications: eye gaze direction. Classification: i3D model (CNN based) (Carreira and Zisserman 2017). Multilevel regression: to investigate how the engagement cues relate to the engagement scores. Calculate the CST (collective state transition) to measure classroom engagement.

Recall and Precision. Hand-raising and note-taking are not related to students individual self-reported engagement scores.

Hasnine et al. (2021)

Camera

Video

Face detection: Dlib. 3-class classification: training with FER2013, then calculate the concentration index (CI) based on eye gaze and emotion weights. CI = (Emotion Weight x Gaze Weight) / 4.5

Accuracy = 68%

Delgado et al. (2021)

Camera

Images

Classification: utilizing CNN family including MobileNet (Mobilenets: Efficient convolutional neural networks for mobile vision applications), VGG (Very deep convolutional network for large-scale image recognition), Xception: Deep learning with depth-wise separable convolutions.

 

Engwall et al. (2022)

Cameras and microphone

Audio and visual features

Feature extraction: OpenFace 2.0. Feature selection: verbal classifications using bag-of-words representations, accoustic-based classification,video-based classification. Engagement classification through acoustic and visual: classification using SVM, DT, Conditional Random Fields, KNN, HMM, Gaussian model, BN, and ANN. Engagement classification through vocal arousal: bidirectional LSTM network Speech Emotion Recognition implementation in the Matlab Deep Learning Toolbox. Output: anger and happines = High, neutral = Neutral, boredom and sadness = Low. Engagement classification through face expression: two SVM with linear and radial basis function (RBF) as kernel.

Listener engagement classification reached 65% balanced accuracy

Mehta et al. (2022)

Use DAiSEE and Emoti-W dataset

Images

Pre-processing: Dlib face detector. 4-class classification and regression: 3D CNN with self-attention module, which enhances the discovery of new patterns in data by allowing models to learn deeper correlations between spatial or temporal dependencies between any two points in the input feature maps.

Classification accuracy = 63.59% on DAiSEE, regression MSE = 0.0347 on DAiSEE and 0.0877 = Emoti-W

Dubovi et al. (2022)

Eye tracker, EDA wearable wristband sensor, and webcam

Facial expression, eye-tracking, and EDA data

The stream data was collected and analysed using iMotion 9.0 with 7 basic emotions annotation (joy, anger, surprise, contempt, fear, sadness, and disgust). Emotional engagement: a Linear Mixed Effects Model (LMM) was established to estimate the self-reported changes in the PANAS self-report. Cognitive engagement: ANOVA was performed to assess the eye-tracking metrics differences.

 

Thomas et al. (2022)

Use existed dataset\(^{b}\)

Visual and verbal features

Pre-process: slide area and figure detection using RetinaNet, unique slide detection using Siamese network, text detection using Character-Region Awareness For Text detection (CRAFT) model. Prediction: pretrained with pretrained VGG-16 network. Supervised: LR with three classes (visual, verbal, or balanced). Unsupervised: clustering model with two clusters (visual, verbal). Binary lassification: sequential modeling using Temporal Convolutional Network (TCN) pre-trained with Micro-Macro-Motion (MIMAMO) Net model (Deng et al. 2020).

At the segment level: accuracy = 76%, F1-score = 0.82, MSE = 0.04. At video level (binary classification: engaged/distracted): accuracy = 95%, F1-score = 0.97, MSE = 0.15

Shen et al. (2022)

Use JAFFE, CK+, RAF-DB dataset

Images

Pre-process: MK-MMD to calculate the distribution distance between the extracted features. Transfer learning: Domain adaptation technique was used to explore the additional facial images. Imbalanced issue: undersampling, and data augmentation. 4-class classification: lightweight attention convolutional network for face expression recognition. Soft attention module (SE) was adopted to reduce the impact of the complex background.

Accuracy = 56%

Apicella et al. (2022)

EEG

EEG Signal

Pipeline: Filter bank, Common Spatial Pattern, SVM. Pre-process: artifact removal using indpendent componenet analysis (ICA), namely Runica module of the EEGLab tool. Feature extraction: 12-component Filter Bank. Imbalanced problem: Stratified leace-2-trials out. Binary classification: SVM, Linear Discriminant Analysis (LDA), KNN, shallow ANN, DNN, CNN (pre-trained in Common spatial pattern (CSP)).

SVM achieved the highest score accuracy = 76.9% for cognitive engagement, and 76.7% for emotional engagement.

  1. PCA principle component analysis; LDA linear discriminant analysis; AAM active appearance model; LOOCV leave-one-subject-out cross validation; KNN K-nearest neighbors; BNs Bayesian Nets;LR logistic regression; DTs Decision Trees; NB Naive Bayes; LBP-TOP three orthogonal planes; GBDT gradient boosting trees; CART classification and regression tree; CARS - childhood autism rating scale; GRU - gated recurrent unit; LSTM long-short term memory; CNN convolutional neural network; KNN K-nearest neighbors; BNs Bayesian Nets; LR logistic regression; DTs - Decision Trees; NB Naive Bayes
  2. LOOCV leave-one-subject-out cross validation; LR logistic regression; RF - random forest; LDA linear discriminant analysis; MLP Multi-layer Perceptron; DT decision tree; BN Bayesian Network; HMM -Hiden Markov Models; LSTM long short time memory; EDA electrodermal activity; MK-MMD Kernel Maximum Mean Discrepancies
  3. \(^{a}\) https://dx.doi.org/10.21227/e1qm-hb90
  4. \(^{b}\) ClassX, LectureVideoDB, IIIT-AR-13K, IIITB Online Lecture, IIITB Classroom Lecture dataset