The Comparison of Thai Speech Emotional Features for LSTM Classifier

Main Article Content

Choopan Rattanapoka
Mongkon Duangdoaw
Noppanut Phetponpun

Abstract

The human voice, in its various tones, can effectively express human emotions. It is undeniable that human emotions significantly inuence our daily lives. Many studies have attempted to improve machines' ability to comprehend human emotion to develop better Human-Computer Interaction (HCI) applications. As a result, this study presents the design and development of models for emotion recognition from Thai male speech. We examined the utilization of the chromagram (Chroma), Mel spectrogram, and Mel frequency cepstral coefficient (MFCC) with seven Long-Short Term Memory (LSTM) networks to distinguish four emotions: anger, happy, sad, and neutral. Additionally, we created a dataset consisting of 1,000 audio files recorded from 11 Thai males, divided into 250 audio files per emotion. Subsequently, we trained our seven models using this dataset. Our findings revealed that the model utilizing only the MFCC feature yielded the best results, with precision, recall, and F1 scores of 0.730, 0.739, and 0.732, respectively.

Article Details

How to Cite
[1]
C. Rattanapoka, M. Duangdoaw, and N. Phetponpun, “The Comparison of Thai Speech Emotional Features for LSTM Classifier”, ECTI-CIT Transactions, vol. 17, no. 4, pp. 500–509, Nov. 2023.
Section
Research Article

References

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, Art. no. 7553, May 2015.

S. S. Liew, M. Khalil-Hani, S. Ahmad Radzi, and R. Bakhteri, “Gender classification: a convolutional neural network approach,” Turkish Journal of Electrical Engineering and Computer Sciences, vol. 24, no. 3, Art. no. 40, pp. 1248-1264, 2016.

J. Liu and X. Wang, “Tomato Diseases and Pests Detection Based on Improved Yolo V3 Convolutional Neural Network,” Frontiers in Plant Science, vol. 11, 2020, Accessed: Feb. 02, 2022. [Online]. Available: https://www.frontiersin.org/article/10.3389/fpls.2020.00898

Y. Ji, S. Kim, Y.-J. Kim, and K.-B. Lee, “Human-like sign-language learning method using deep learning,” ETRI Journal, vol. 40, no. 4, pp. 435–445, 2018.

David E. Rumelhart; James L. McClelland, “Learning Internal Representations by Error Propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations, MIT Press, pp.318-362, 1987.

W. Khan, A. Daud, F. Alotaibi, N. Aljohani, and S. Arafat, “Deep recurrent neural networks with word embeddings for Urdu named entity recognition,” ETRI Journal, vol. 42, no. 1, pp. 90–100, 2020.

K. Cho et al., “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” arXiv:1406.1078 [cs, stat], Sep. 2014, Accessed: Feb. 02, 2022. [Online]. Available: http://arxiv.org/abs/1406.1078

S. Hochreiter and J. Schmidhuber, “Long ShortTerm Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.

Y. R. Oh, K. Park, H.-B. Jeon, and J. G. Park, “Automatic proficiency assessment of Korean speech read aloud by non-natives using bidirectional LSTM-based speech recognition,” ETRI

Journal, vol. 42, no. 5, pp. 761–772, 2020.

F. Ertam, “An effective gender recognition approach using voice data via deeper LSTM networks,” Applied Acoustics, vol. 156, pp. 351–358, Dec. 2019.

E. R. Swedia, A. B. Mutiara, M. Subali, and Ernastuti, “Deep Learning Long-Short Term Memory (LSTM) for Indonesian Speech Digit Recognition using LPC and MFCC Feature,” in 2018 Third International Conference on Informatics and Computing (ICIC), pp. 1–5, Oct. 2018.

R. N. Shepard, “Circularity in Judgments of Relative Pitch,” The Journal of the Acoustical Society of America, vol. 36, no. 12, pp. 2346–2353, Dec. 1964.

S. S. Stevens, J. Volkmann, and E. B. Newman, “A Scale for the Measurement of the Psychological Magnitude Pitch,” The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185–190, Jan. 1937.

P. Mermelstein, “Distance measures for speech recognition, psychological and instrumental,” Pattern Recognition and Artificial Intelligence, pp. 374–388, 1976.

M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” arXiv:1603.04467 [cs], Mar. 2016, Accessed: Feb. 02, 2022. [Online]. Available: http://arxiv.org/abs/1603.04467

F. Chollet and others, “Keras,” 2015. https://github.com/fchollet/keras

B. McFee et al., librosa/librosa: 0.8.0. Zenodo, 2020. doi: 10.5281/ZENODO.3955228.

Y. Xie, R. Liang, Z. Liang, C. Huang, C. Zou, and B. Schuller, “Speech Emotion Classification Using Attention-Based LSTM,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1675–1685, Nov. 2019.

S. Pan, J. Tao, and Y. Li, “The CASIA audio emotion recognition method for audio/visual emotion challenge 2011,” in Proceedings of the 4th international conference on Affective computing and intelligent interaction Volume Part II, Berlin, Heidelberg, pp. 388–395, Oct. 2011.

O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The eNTERFACE’ 05 Audio-Visual Emotion Database,” in 22nd International Conference on Data Engineering Workshops (ICDEW’06),Atlanta, GA, USA, pp. 8–8, Apr. 2006.

T. B ̈anziger, M. Mortillaro, and K. R. Scherer, “Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception,” Emotion, vol. 12, no. 5, pp. 1161–1179, Oct. 2012.

B. T. Atmaja and M. Akagi, “Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model,” in 2019 IEEE International Conference on Signals and Systems (ICSigSys), pp. 40–44, Jul. 2019.

Z. Zhu, W. Dai, Y. Hu, and J. Li, “Speech emotion recognition model based on Bi-GRU and Focal Loss,” Pattern Recognition Letters, vol. 140, pp. 358–365, Dec. 2020.

J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D & 2D CNN LSTM networks,” Biomedical Signal Processing and Control, vol. 47, pp. 312–323, Jan. 2019.

W. Lim, D. Jang, and T. Lee, “Speech emotion recognition using convolutional and Recurrent Neural Networks,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4, Dec. 2016.

U. Garg, S. Agarwal, S. Gupta, R. Dutt and D. Singh, “Prediction of Emotions from the Audio Speech Signals using MFCC, MEL and Chroma,” 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN), Bhimtal, India, pp. 87-91, 2020.

F. Abri, L. F. Guti ́errez, A. Siami Namin, D. R. W. Sears and K. S. Jones, “Predicting Emotions Perceived from Sounds,” 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, pp. 2057-2064, 2020.

A. Jaratrotkamjorn and A. Choksuriwong, “Bimodal Emotion Recognition Using Deep Belief Network,” ECTI-CIT Transactions, vol. 15, no. 1, pp. 73–81, Jan. 2021.

E. Bisong, “Google Colaboratory,” in Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, E. Bisong, Ed. Berkeley, CA: Apress, pp. 59–64, 2019.

K. M. Ting, “Confusion Matrix,” in Encyclopedia of Machine Learning and Data Mining, C. Sammut and G. I. Webb, Eds. Boston, MA: Springer US, pp. 260–260, 2017.