Date of Award

2026

Document Type

Thesis

Degree Name

Master of Science in Artificial Intelligence

Department

Digital Engineering

Committee Chair and Members

Reda Nacif Elalaoui

Keywords

Emotion detection, GoEmotions, Mental health and AI, Multi-Label classification, Threshold optimization, Transformer models

Abstract

This study investigates the effectiveness of five community fine-tuned transformer models for fine-grained emotion detection on the GoEmotions dataset: SamLowe/roberta- base-go_emotions (RoBERTa-base), cirimus/modernbert-base-go-emotions (ModernBERT), mrm8488/deberta-v3-base-goemotions (DeBERTa-v3-base), bhadresh-savani/bert-base-go- emotion (BERT-base-cased),and tasinhoque/distilbert-go-emotions (DistilBERT) . While the original GoEmotions research by Demszky et al. (2020) established a BERT-base baseline with a macro-F1 of 0.46, this thesis extends that work through independent empirical evaluation of five derivative models, systematic per-label threshold optimization, and comparative analysis of architectural trade-offs across the full transformer model family. Using the GoEmotions simplified test split (5,427 examples across 28 categories), all five models were evaluated at a fixed 0.5 threshold and then subjected to a threshold sweep (0.05–0.95 in increments of 0.05) to identify per-label optimal decision boundaries. At the standard 0.5 decision boundary, ModernBERT achieved the highest observed macro-F1 at the standard 0.5 threshold of 0.471, followed by BERT-base-cased (0.460), RoBERTa (0.450), DistilBERT (0.423), and DeBERTa (0.012). The anomalously low DeBERTa result reflects a threshold calibration mismatch in the mrm8488 community checkpoint, where sigmoid logit scores are compressed far below 0.5 across nearly all labels.

Per-label threshold optimization substantially improved all models. RoBERTa improved from 0.450 to 0.541 (20.2% gain), ModernBERT from 0.471 to 0.554 (17.5%), BERT-base-cased from 0.460 to 0.491 (6.8%), DistilBERT from 0.423 to 0.508 (20.0%), and DeBERTa from 0.012 to 0.084 (615.5%). The findings demonstrate that simple post-hoc threshold calibration can substantially improve fine-grained emotion detection without retraining. ModernBERT achieved the highest observed optimized macro-F1, consistent with the architectural advantages of its modern design. This work contributes to original experimental evidence on threshold sensitivity across five architectures in multi-label emotion classification.

Beyond performance gains, this study documents a significant calibration failure in the DeBERTa-v3 community checkpoint, which achieved a macro-F1 of only 0.012 at default thresholds. Even after exhaustive threshold optimization, recovery was limited (0.084 macro-F1), serving as a cautionary case study on the necessity of independent validation for community-released model weights.

Keywords: Emotion Detection, Natural Language Processing, Fine-Grained Emotion Classification, RoBERTa, DistilBERT, GoEmotions, Multi-label Classification, Threshold Optimization, Per-label Calibration, Transformer Models

Share

COinS