Improving Estimation Of Valence And Arousal Emotion Dimensions Based On Emotion Unit

21-04-2021 02:02

The objective of this research is to improve the estimation accuracy of emotion dimensions: valence and arousal. Former studies for speech emotion recognition (SER) mostly supposed that the affective content is stable and unchangeable through the entire utterance. Thus, these studies have been conducted based on the entire utterance as one unit for estimating these dimensions. However, this assumption is not fulfilled especially for long utterance because emotion is dynamic and may fluctuate through the long utterances. Consequently, the extracted low-level descriptors from such utterances are less effective for SER systems since they are mixture of different affective states. Most of these research ignored the investigation for the proper time scale to be used when extracting features. Therefore, a novel emotion unit based on voiced segments is proposed for improving the estimation accuracy. To evaluate the proposed method, SER system based on the dimensional approach using support vector regression is used. For validating it, the EMO-DB database is used. To measure the accuracy, mean absolute error (MAE) for the estimated values of valence and arousal is used as a metric. Results revealed that the emotion unit that contains three and four voiced segments gives the best MAE for valence and arousal, respectively. It is found that the performance of the proposed method using voiced related emotion unit outperforms the conventional method using utterance unit for both valence and arousal. The improvement in terms of MAE is from 0.68 to 0.51 for valence dimension, and from 0.34 to 0.21 for arousal dimension.