Date of Award
2026
Document Type
Thesis
Degree Name
Master of Science in Artificial Intelligence
Department
Digital Engineering
Committee Chair and Members
Reda Nacif Elalaoui
Keywords
Emotion detection, GoEmotions, Mental health and AI, Multi-Label classification, Threshold optimization, Transformer models
Abstract
This study investigates the effectiveness of five community fine-tuned transformer models for fine-grained emotion detection on the GoEmotions dataset: SamLowe/roberta- base-go_emotions (RoBERTa-base), cirimus/modernbert-base-go-emotions (ModernBERT), mrm8488/deberta-v3-base-goemotions (DeBERTa-v3-base), bhadresh-savani/bert-base-go- emotion (BERT-base-cased),and tasinhoque/distilbert-go-emotions (DistilBERT) . While the original GoEmotions research by Demszky et al. (2020) established a BERT-base baseline with a macro-F1 of 0.46, this thesis extends that work through independent empirical evaluation of five derivative models, systematic per-label threshold optimization, and comparative analysis of architectural trade-offs across the full transformer model family. Using the GoEmotions simplified test split (5,427 examples across 28 categories), all five models were evaluated at a fixed 0.5 threshold and then subjected to a threshold sweep (0.05–0.95 in increments of 0.05) to identify per-label optimal decision boundaries. At the standard 0.5 decision boundary, ModernBERT achieved the highest observed macro-F1 at the standard 0.5 threshold of 0.471, followed by BERT-base-cased (0.460), RoBERTa (0.450), DistilBERT (0.423), and DeBERTa (0.012). The anomalously low DeBERTa result reflects a threshold calibration mismatch in the mrm8488 community checkpoint, where sigmoid logit scores are compressed far below 0.5 across nearly all labels.
Per-label threshold optimization substantially improved all models. RoBERTa improved from 0.450 to 0.541 (20.2% gain), ModernBERT from 0.471 to 0.554 (17.5%), BERT-base-cased from 0.460 to 0.491 (6.8%), DistilBERT from 0.423 to 0.508 (20.0%), and DeBERTa from 0.012 to 0.084 (615.5%). The findings demonstrate that simple post-hoc threshold calibration can substantially improve fine-grained emotion detection without retraining. ModernBERT achieved the highest observed optimized macro-F1, consistent with the architectural advantages of its modern design. This work contributes to original experimental evidence on threshold sensitivity across five architectures in multi-label emotion classification.
Beyond performance gains, this study documents a significant calibration failure in the DeBERTa-v3 community checkpoint, which achieved a macro-F1 of only 0.012 at default thresholds. Even after exhaustive threshold optimization, recovery was limited (0.084 macro-F1), serving as a cautionary case study on the necessity of independent validation for community-released model weights.
Keywords: Emotion Detection, Natural Language Processing, Fine-Grained Emotion Classification, RoBERTa, DistilBERT, GoEmotions, Multi-label Classification, Threshold Optimization, Per-label Calibration, Transformer Models
Recommended Citation
Patchipulusu, Sai Puneet Naga Venkata Subramanyam, "Beyond fixed thresholds: Per-label calibration for fine-grained emotion detection on the GoEmotions dataset" (2026). Selected Full-Text Master Theses 2021-. 54.
https://digitalcommons.liu.edu/brooklyn_fulltext_master_theses/54