Author | Input device/modality | Input features | Estimation method | Performance metrics |
---|---|---|---|---|
Wang et al. (2010) | Thermal camera | Grayscale images pixels | Feature extraction: PCA, PCA + LDA, AAM, and AAM+LDA. Classification: KNN.Validation: LOOCV | Accuracy |
Cocea et al. (2011) | Log file | 30 log attributes | WEKA. 8 algorithms: 1) BNs, 2) LR, 3) simple logistic classification (SL), 4) Instance-based classification with Ibk algorithm (IBk), 5) Atrribute selected classification using J48 classifier and Best first search (ASC), 6) Bagging using REP (reduced error pruning) tree classifier (B), 7) Classification via Regression (CvR), 8) DTs | Accuracy (highest 91%) |
AlZoubi et al. (2012) | 3 sensors (electrocardiogram (ECG), facial electromyogram (EMG), galvanic skin response (GSR)), webcam, screen recorder. | 117 features (EEG, currogator musle EMG, finger tips GSR) | Preprocess: low/high pass filter. Feature extraction: using Augsburg Biosignal Toolbox (Wagner et al.). Classification: PRTools 4.0 (de Ridder et al. 2017), a pattern recognition library for Matlab. 9 classifiers: 1) SVM with linear kernel (SVM1), 2). SVM with polynomial (SVM2), 3) KNN (\(k=3\)). 4) KNN (\(k=5\)), 5) KNN (\(k=7\)). 6) NB, 7) Linear Bayes Normal Classifier (LBNC), 8) Multinomial LR, 9) C4.5 DT. Validation: 10-fold cross validation with 20:7 train:test ratio | Kappa statistic and F1-scores. (KNN and LBNC yielded the best detection) |
S-Syun et al. (2012) | Microphone, camera, Depth sensor | Oculesic, kinesic, proxemic, vocalic, person identity cue features | Oculesic (gaze direction), Kinesic (facial expression, movement, body posture/gesture), proxemic (body posture/gesture, spatial relation), vocalic (user call), person identity cue (Spatial relation, face identification). Feature extraction: OpenNI library. Binary classifications (inattention and attention): Fuzzy-based classification algorithm (FMMNN classifier). Fuzzy min-max neural networks (FMMNN) with 7 input nodes. Validation: 7:3 training:test samples | Accuracy 86% |
Whitehill et al. (2014) | Camera | Facial features | Feature extraction: using CERT. Binary Classification: Boost (BF), SVM (Gabor), MLR (CERT). Validation: 4-fold cross-validation | 2-alternative forced choice (2AFC) |
Schiavo et al. (2014) | Camera | Head movement and face features | Features extraction: using face actions and expression recognition (Joho et al. 2011). 3-class classification: SVM. Validation: LOOCV | Accuracy=73%, F-score = 63% |
Woo-Han Yun et al. (2015) | Camera | 55 features of face and head information | Pre-processing: median filtering and aggregation method (mean, median, max, min, standard deviation (STD), range, rate of zero crossings (ZCR). 4-class classification: relevance vector classifier (RVC), a spare version of Bayesian kernel logistic regression or Gaussian process classification (GPC). | Accuracy = 78.53%, Balanced Accuracy = 70.64% |
Gupta et al. (2016) | Camera | Image pixels | Classification: InceptionNet, C3D, LRCN. | Accuracy |
Zaletelj et al. (2017) | Kinect one sensor | 2D and 3D gaze point and body posture data | 3-class classification: DT (simple and medium), KNN (coarse, medium, and weight), Bagged Trees, Subspace KNN | Accuracy = 75.3% |
Monkaresi et al. (2017) | Kinect face tracker and ECG sensors (BIOPAC MP150 system) | kinect face tracker features, LBP-TOP, heart rate data | Pre-process: RELIEF-F for feature selection, Synthetic Minority Oversampling Technique (SMOTE) to handle the data imbalanced. Classifications using WEKA: Updateable NB, BN, LR, classification via cLustering, rotation forest, dagging. Validation: LOOCV. | AUC = 0.758 and 0.733. |
Youssef et al. (2017) | No estimation method. Only proposed dataset. | |||
Zhalehpour et al. (2017) | Camera | Images | Face tracking: CHEHRA tracker. Classification: SVM. Accuracy: 5-class classification = 75.32%, 8-class = 65.84% | |
Hussain et al. (2018) | Log file | Number of clicks and activity types | Activity types includes dataplus, forumng, glossary, oucollaborate, oucontent, resource, subpage, homepage, and URL. Binary classification: decision tree (DT), J48 (belongs to DT family), CART, JRIP decision rules, GBDT, NB. Validation: 10-fold cross validation | Accuracy, Recall, AUC, Kappa |
Psaltis et al. (2018) | Kinect face tracker | Facial expression, Body motion features, average time of responsiveness. | Feature for emotional engagement: facial expression and body motion. Feature for behavioral engagement: average time of responsiveness. Binary classification: unimodal ANN classifiers. Validation: 4-fold validation. Testing on: three primary schools. | Accuracy = 85% |
Rudovic et al. (2018b) | Audiovisual sensors from NAO robot and physiological sensors to provide heart-rate, electrodermal activity, body temperature, and accelerometer data. | Face, body, physiology features, CARS, the demographic features (culture and gender) | Pre-process: OpenFace, OpenPose openSMILE (Eyben et al. 2013), and self-built tools for feature extraction. DeepLift for feature selection. Regression: personalized perception of affect network (PPA-net) whis based on ANN and clustering using t-SNE. | Intra-class correlation (ICC) \(= 65\% \pm 24\) (average ± SD) |
Ninaus et al. (2019) | Webcam | Image frames | Pre-process: Microsoft’s Emotion-API classifying the prevalence of the 6 basic emotions for each frame of the captured videos (’fear’ and ’disgust’ are excluded to enhance the quality of the data). Classification: SVM ensembles using “classyfire” package in R statistical environment. Questionnaires were analyzed using separate multivariate ANOVAs | Accuracy \(\approx 64.18\%\) |
Yue et al. (2019) | Microsoft LifeCam webcam and Tobii Eye Tracker 4C | Video/images, eye movement, and click stream data. | Fine-tuning parameters by transfer learning for CNN: VGG16, InceptionResNetv2. Classification: CNN and LSTM. Regression: CART, random forest, GBDT. Validation: 10-fold cross validation. | Accuracy = 76.08% for facial expressions recognition, 81% for eye movement behavior. R2 metric = 0.98 ofof course performance prediction. |
Mollahosseini et al. (2019) | N/A | Images | CNN (AlexNet) and SVR on Valnce and Arousal labels | RMSE, CORR, SAGR, CCC. |
Celiktutan et al. (2019) | Cameras (2 static & 2 dynamic), 2 biosensors | Image, sensor data | Binary classifictions: SVMs. Validation: a double LOOCV. | |
Youssef et al. (2019) | Robot’s camera | Distance; head, gaze and face streams; speech;looking and listening. | Feature extraction: OpenFace and Pepper OKAO software. Binary classification: LR, DNN, GRU, LSTM. Validation: 3-fold cross validation | Accuracy, F1-Score, AUC |
Olivetti et al. (2019) | Camera | Images (geometrical description) | 3-class classification: SVM | The classification result was compared with the questionnaire. |
Ashwin et al. (2020b) | Camera | 299x299x3 image with RGB with facial expressions, hand gestures and body postures present | Pre-processing = data augmentation. Classification: transfer learning with inception v3. Hybrid CNN = CNN-1 + CNN-2. CNN-1 for single student in as single image frame. CNN-2 for multiple students in a single image frame. Validation: 10-fold cross validation | Posed: accuracy = 86%, recall = 89%, precision = 91%, F1-score = 84%, AUC = 90%. Spontaneous: accuracy = 70%, recall = 72%, precision = 77%, F1-score = 62%, AUC = 69% |
Ashwin et al. (2020a) | Camera | Images with facial expressions, hand gestures and body postures present | Classification: CNN with pre-trained on GoogleNet architecture(Krizhevsky et al. 2017). Validation: 10-fold cross validation. | Accuracy = 76% |
Pabba et al. (2022) | Camera | 48x48 image pixels | Add additional public dataset: BAUM-1,DAiSEE, and Yawning Detection Dataset (YawDD)\(^{a}\) . Pre-process: face and head detection (using multi-task cascade CNN (MTCNN)), face alignment, data augmentation. 6-class classification: CNN. | Accuracy = 76.9% |
Duchetto et al. (2020) | Head camera of the robot | RGB frame-by-frame image | Face detection: CNN. Regression: LSTM. Build the model using TOGURO dataset and evaluated on UE-HRI. | AUC=0.89 |
Yun et al. (2020) | Camera, Kinect V2 | Facial features | Classification: CNN with fine tunning by using a pre-trained network (VGG-3D model). Validation: 6-fold cross-validation, leave-one-labeler-out cross-validation (LOLOCV). | Accuracy, AUC of ROC (ROC), AUC of PRs (PRs), MCC, F1-score, balanced accuracy, specificity (true positive and negative rate). |
Zhang et al. (2020) | Camera | grayscale image (100 x 100 pixel) | Feature extraction: adaptive weighted LGCP. Binary classification: fast sparse representation (AWLGCP &FSR). Validation: 10-fold validation. Compare: the four methods (CLBP-SRC, Gabor-SVM, active shape model-SVM, and AWLGCP &FSR). | |
Liao et al. (2021) | N/A | DAiSEE and EmotiW images | Face detection: MTCNN. Pre-process: resize images to 224\(\times\)224 and pre-trained on VGGFace2. 4-class classification and regression: Deep Facial Spatiotemporal Network (DFSTN) = pretrained SE-ResNet-50 (SENet) for extracting facial spatial features, and LSTM Network with Global Attention (GALN). Validation: 5-fold cross-validation. | Accuracy = 58.84% and MSE = 0.0422 on DAiSEE. MSE = 0.0736 on EmotiW. |
Li et al. (2021) | Camera, log file | Facial features (Gaze, Pose, FAU) and 8 clinical behaviors | Performance (correcteness) labelling: for problem solving process (Measure cognitive engagement). Feature extraction: using OpenFace. Calculate mean and std of each facial features. Feature selection: recursive feature elimination random forest (RFE-RF). Binary classification: NB, KNN, DT, RF, SVM. Validation: 10-fold-cv for feature selection. Use students self-reports of cognitive engagement states as the ground-truth | |
Bhardwaj et al. (2021) | FER-2013 dataset (image), and MES dataset | images | Face detection: OpenCV. Binary classification: CNN. First, calculating weights matrix of emotions, then calculating MES and detecting engagement. | |
Goldberg et al. (2021) | 3 Cameras | Eye gaze, head pose, and facial expressions. | Feature extraction: OpenFace. Regression: Model 1: multiple linear regression. Model 2: two additional linear regression. Model 3: add learning prerequisites. | MSE= 0.05. Pearson correclation coefficient between manual annotations’ mean level and prediction models r = .70 , p = 0 |
Chatterjee et al. (2021) | electrocardiography, skin conductance, respiration, skin temperature, Yeti X microphone, webcams | Electrocardiography, skin conductance, respiration, skin temperature signals | Pre-process: lowpass/highpass filter using MATLAB/Simulink. Regression: a binary decision tree, leastsquares boosting, and random forest implemented in MATLAB 2020b. Validation: LOOCV | |
Youssef et al. (2021) | Robot’s camera | Distance; head, gaze and face streams; Speech;Laser | Face detection: NAOqi People Perception.Face extraction: OKAO Vision sofware. Imbalanced issue: undersampling “No breakdown”, oversampling “Breakdown” class using SMOTE. Binary classification: LR, LDA, RF, and MLP. Validation: 5-fold cross validation. | AUC \(\approx\) 0.72 |
Sümer et al. (2021) | Camera | Face features, head pose (without facial landmarks) | Face detection: RetinaFace. Multi channel settings : training Attention-Net for head pose estimation and Affect-Net for facial expression recognition CNN. Pre-Process: : PCA (for SVM). 3-class classification: SVM (use majority voting), RF, MLP, LSTM with fine tunning (transfer learning) with AffectNet for facial expression and Attention-Net (300W-LP) for head pose with ResNet-50. Tested using different fusion strategies using RF engagement classifiers. Use of self-supervision and representation learning on unlabelled classroom data. | AUC = 0.84 (with personalization). Attention-Net better than Affect, given that the criteria for the manual annotation of engagement is not directly related to gaze direction or facial expression. |
Trindade et al. (2021) | Log file | Teacher and students attributes | WEKA. Random Forest generated the best result. | AUC |
Ma et al. (2021) | Use DAiSEE | Eye gaze, facial action unit, head pose (117 dimensions); and body pose (60 dimensions) | Feature extraction: OpenFace 2.0. Pre-process: 640x640 resolution at 10fps. Feature Fusion: Neural Turing Machine (NTM) architecture, which contains two basic components: a neural network controller and a memory bank. NTM workflow: read heads and write heads. | Accuracy = 60.2% |
Thiruthvanathan et al. (2021) | Indian origin faces datasets DAiSEE, iSAFE, ISED | 508 images from ISED and iSAFE, 5295 images from DAiSEE. | Feature extraction: light weight ResNet. Classification: ResNet classifier (CNN with 50 layers deep). | Accuracy, Precision, Recall, Sensitivity, Specificity and F1 score |
Altuwairqi 2021 et al. (2021b) | Camera, mouse, keyboard behaviour | Key frame facial expressions. | Transfer learning using FER2013 and real-world affective faces (RAF). 3-class classification: Naive Bayes (NB) classifier. | Accuracy and MSE. |
Vanneste et al. (2021) | Camera | Upper body keypoints, eye gaze direction | Feature for individual classification: upper body keypoints (from 2s clips), for collective classifications: eye gaze direction. Classification: i3D model (CNN based) (Carreira and Zisserman 2017). Multilevel regression: to investigate how the engagement cues relate to the engagement scores. Calculate the CST (collective state transition) to measure classroom engagement. | Recall and Precision. Hand-raising and note-taking are not related to students individual self-reported engagement scores. |
Hasnine et al. (2021) | Camera | Video | Face detection: Dlib. 3-class classification: training with FER2013, then calculate the concentration index (CI) based on eye gaze and emotion weights. CI = (Emotion Weight x Gaze Weight) / 4.5 | Accuracy = 68% |
Delgado et al. (2021) | Camera | Images | Classification: utilizing CNN family including MobileNet (Mobilenets: Efficient convolutional neural networks for mobile vision applications), VGG (Very deep convolutional network for large-scale image recognition), Xception: Deep learning with depth-wise separable convolutions. | |
Engwall et al. (2022) | Cameras and microphone | Audio and visual features | Feature extraction: OpenFace 2.0. Feature selection: verbal classifications using bag-of-words representations, accoustic-based classification,video-based classification. Engagement classification through acoustic and visual: classification using SVM, DT, Conditional Random Fields, KNN, HMM, Gaussian model, BN, and ANN. Engagement classification through vocal arousal: bidirectional LSTM network Speech Emotion Recognition implementation in the Matlab Deep Learning Toolbox. Output: anger and happines = High, neutral = Neutral, boredom and sadness = Low. Engagement classification through face expression: two SVM with linear and radial basis function (RBF) as kernel. | Listener engagement classification reached 65% balanced accuracy |
Mehta et al. (2022) | Use DAiSEE and Emoti-W dataset | Images | Pre-processing: Dlib face detector. 4-class classification and regression: 3D CNN with self-attention module, which enhances the discovery of new patterns in data by allowing models to learn deeper correlations between spatial or temporal dependencies between any two points in the input feature maps. | Classification accuracy = 63.59% on DAiSEE, regression MSE = 0.0347 on DAiSEE and 0.0877 = Emoti-W |
Dubovi et al. (2022) | Eye tracker, EDA wearable wristband sensor, and webcam | Facial expression, eye-tracking, and EDA data | The stream data was collected and analysed using iMotion 9.0 with 7 basic emotions annotation (joy, anger, surprise, contempt, fear, sadness, and disgust). Emotional engagement: a Linear Mixed Effects Model (LMM) was established to estimate the self-reported changes in the PANAS self-report. Cognitive engagement: ANOVA was performed to assess the eye-tracking metrics differences. | |
Thomas et al. (2022) | Use existed dataset\(^{b}\) | Visual and verbal features | Pre-process: slide area and figure detection using RetinaNet, unique slide detection using Siamese network, text detection using Character-Region Awareness For Text detection (CRAFT) model. Prediction: pretrained with pretrained VGG-16 network. Supervised: LR with three classes (visual, verbal, or balanced). Unsupervised: clustering model with two clusters (visual, verbal). Binary lassification: sequential modeling using Temporal Convolutional Network (TCN) pre-trained with Micro-Macro-Motion (MIMAMO) Net model (Deng et al. 2020). | At the segment level: accuracy = 76%, F1-score = 0.82, MSE = 0.04. At video level (binary classification: engaged/distracted): accuracy = 95%, F1-score = 0.97, MSE = 0.15 |
Shen et al. (2022) | Use JAFFE, CK+, RAF-DB dataset | Images | Pre-process: MK-MMD to calculate the distribution distance between the extracted features. Transfer learning: Domain adaptation technique was used to explore the additional facial images. Imbalanced issue: undersampling, and data augmentation. 4-class classification: lightweight attention convolutional network for face expression recognition. Soft attention module (SE) was adopted to reduce the impact of the complex background. | Accuracy = 56% |
Apicella et al. (2022) | EEG | EEG Signal | Pipeline: Filter bank, Common Spatial Pattern, SVM. Pre-process: artifact removal using indpendent componenet analysis (ICA), namely Runica module of the EEGLab tool. Feature extraction: 12-component Filter Bank. Imbalanced problem: Stratified leace-2-trials out. Binary classification: SVM, Linear Discriminant Analysis (LDA), KNN, shallow ANN, DNN, CNN (pre-trained in Common spatial pattern (CSP)). | SVM achieved the highest score accuracy = 76.9% for cognitive engagement, and 76.7% for emotional engagement. |