Bangla dataset and MMFCC in text-dependent speaker identification
Main Article Content
Abstract
Automatic Speaker Identification (SID) is a challenging research topic that is mostly done based on either text-dependent or text-independent speech materials. Generally, an automatic SID system is designed based on English speech. The main goal of this study is to present a text-dependent dataset based on Bangla speech. We explored three different feature extractors as a front-end processor: the Mel-frequency Cepstral Coefficient (MFCC), the Gammatone Frequency Cepstral Coefficient (GFCC), and a newly developed feature – a Modified MFCC (MMFCC) to simulate SID accuracy. The SID accuracies were simulated under clean and noisy conditions. Four types of noises were added to clean signals to generate noisy signals for a range of signal to noise ratios (SNRs) from -5 dB to 15 dB. A standard dataset based on English speech is also presented to compare the SID accuracies with the presented Bangla dataset SID accuracies. The second goal of this study is to examine MMFCC and introduce its novelty in a text-dependent SID system. It is seen from the results of this study, the MMFCC-based method results significantly outperform the MFCC and GFCC-based methods under noisy conditions and produce comparable results in a clean environment.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Islam MA, Jassim WA, Cheok NS, Zilany MSA. A robust speaker identification system using the responses from a model of the auditory periphery. PloS one. 2016;11(7):e0158520. doi: 10.1371/journal.pone.0158520.
Reynolds DA, Rose RC. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE trans speech audio process. 1995;3(1):72-83.
Das D. Utterance based speaker identification using ANN. Int J Comput Sci Eng Appl. 2014;4(4):15-28.
Davis SB, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980;28(4):357-66.
Shao Y, Srinivasan S, Wang D. Incorporating auditory feature uncertainties in robust speaker identification. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing; 2007 Apr 15-20; Honolulu, USA. USA: IEEE; 2007. p. 277-80.
Makhoul J. Linear prediction: a tutorial review. Proceedings of the IEEE. 1975;63(4):561-80.
Ganapathy S, Thomas S, Hermansky H. Feature extraction using 2-D autoregressive models for speaker recognition. Odyssey 2012: The Speaker and Language Recognition Workshop; 2012 Jun 25-28; Singapore.
Xiaojia Z, Wang De. Analyzing noise robustness of MFCC and GFCC features in speaker identification. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing; 2013 May 26-31; Vancouver, Canada. USA: IEEE; 2013. p. 7204-8.
Islam MA. Modified mel-frequency cepstral coefficients (MMFCC) in robust text-dependent speaker identification. 2017 4th International Conference on Advances in Electrical Engineering (ICAEE); 2017 Sep 28-30; Dhaka, Bangladesh. USA: IEEE; 2017. p. 505-9.
Islam MA, Zilany MSA, Wissam AJ. Neural-response-based text-dependent speaker identification under noisy conditions. In: Ibrahim F, Usman J, Mohktar M, Ahmad M, editors. International Conference for Innovation in Biomedical Engineering and Life Sciences, ICIBEL 2015; 2015 Dec 6-8; Putrajaya, Malaysia. Singapore: Springer; 2015. p. 11-4.
Zhao X, Shao Y, Wang D. CASA-based robust speaker identification. IEEE Trans Audio Speech Lang Process. 2012;20(5):1608-16.
Hansen JH, Hasan T. Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag. 2015;32(6):74-99.
Bilmes JA. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. International Computer Science Institute. 1998;4(510):126.
Togneri R, Pullella D. An overview of speaker identification: accuracy and robustness issues. IEEE Circ Syst Mag. 2011;11(2):23-61.
Ghahabi O, Hernando J. I-vector modeling with deep belief networks for multi-session speaker recognition. Odyssey 2014: The Speaker and Language Recognition Workshop; 2014 Jun 16-19; Joensuu, Finland. p. 305-10.
Zilany MS, Bruce IC. Modeling auditory-nerve responses for high sound pressure levels in the normal and impaired auditory periphery. J Acoust Soc Am. 2006;120(3):1446-66.
Zilany MS. A novel neural feature for a text-dependent speaker identification system. Eng Appl Sci Res. 2018;45(2):112-9.
Stevens SS. On the psychophysical law. Psychol Rev. 1957;64(3):153-81.
Stevens S. Perceived level of noise by Mark VII and decibels (E). J Acoust Soc Am. 1972;51(2B):575-601.
Li Q, Huang Y. An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions. IEEE Trans Audio Speech Lang Process. 2011;19(6):1791-801.