A novel neural feature for a text-dependent speaker identification system

Main Article Content

Muhammad S. A. Zilany

Abstract

A novel feature based on the simulated neural response of the auditory periphery was proposed in this study for a speaker identification system. A well-known computational model of the auditory-nerve (AN) fiber by Zilany and colleagues, which incorporates most of the stages and the relevant nonlinearities observed in the peripheral auditory system, was employed to simulate neural responses to speech signals from different speakers. Neurograms were constructed from responses of inner-hair-cell (IHC)-AN synapses with characteristic frequencies spanning the dynamic range of hearing. The synapse responses were subjected to an analytical function to incorporate the effects of absolute and relative refractory periods. The proposed IHC-AN neurogram feature was then used to train and test the text-dependent speaker identification system using standard classifiers. The performance of the proposed method was compared to the results from existing baseline methods for both quiet and noisy conditions. While the performance using the proposed feature was comparable to the results of existing methods in quiet environments, the neural feature exhibited a substantially better classification accuracy in noisy conditions, especially with white Gaussian and street noises. Also, the performance of the proposed system was relatively independent of various types of distortions in the acoustic signals and classifiers. The proposed feature can be employed to design a robust speech recognition system.

Article Details

How to Cite
Zilany, M. S. A. (2018). A novel neural feature for a text-dependent speaker identification system. Engineering and Applied Science Research, 45(2), 112–119. Retrieved from https://ph01.tci-thaijo.org/index.php/easr/article/view/81617
Section
ORIGINAL RESEARCH
Author Biography

Muhammad S. A. Zilany, Department of Computer Engineering, Faculty of Computer Science and Engineering, University of Hail, Hail 2440, Saudi Arabia

Assistant Professor

Dept. of Computer Engineering

University of Hail, KSA

References

Wenndt SJ, Mitchell RL. Machine recognition vs human recognition of voices. in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference (pp. 4245-4248).

Davis SB, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing. 1980; 28:357-366.

Makhoul J. Linear prediction: A tutorial review. Proceedings of the IEEE, 1975; 63:561-580.

Nakagawa S, Wang L, Ohtsuka S. Speaker identification and verification by combining MFCC and phase information. Audio, Speech, and Language Processing, IEEE Transactions on. 2012;20(4):1085–95.

Wu Q, Zhang L. Auditory sparse representation for robust speaker recognition based on tensor structure. EURASIP Journal on Audio, Speech, and Music Processing. 2008; 2008:1-9.

Chi T-S, Lin T-H, Hsu C-C. Spectro-temporal modulation energy based mask for robust speaker identification. The Journal of the Acoustical Society of America. 2012;131(5):EL368–EL74.

Wu Q, Zhang L, Shi G. Robust Feature Extraction for Speaker Recognition Based on Constrained Nonnegative Tensor Factorization. Journal of Computer Science and Technology. 2010; 25(4): 745–754.

Wang J, Wang C, Chin Y, Chang P. Spectro-temporal receptive fields and MFCC balanced feature extraction for robust speaker recognition. Multimedia Tools and Application. 2016; 76(3).

Shao Y, Srinivasan S, Wang D. Incorporating auditory feature uncertainties in robust speaker identification. in Acoustics, Speech and Signal Processing (ICASSP), 2007 IEEE International Conference (pp. IV-277-IV-280).

Ganapathy S, Thomas S, Hermansky H. Feature extraction using 2-D autoregressive models for speaker recognition. Odyssey 2012—The Speaker and Language Recognition Workshop, 2012.

Zhao X, Shao Y, Wang D. CASA-based robust speaker identification. Audio, Speech, and Language Processing, IEEE Transactions on. 2012;20(5):1608–16.

Zilany MS, Bruce IC, Carney LH. Updated parameters and expanded simulation options for a model of the auditory periphery. The Journal of the Acoustical Society of America. 2014 Jan;135(1):283-6.

Miller MI, Barta PE, Sachs MB. Strategies for the representation of a tone in background noise in the temporal aspects of the discharge patterns of auditory‐nerve fibers. The Journal of the Acoustical Society of America. 1987; 81: 665-679.

Hines A, Harte N. Speech intelligibility prediction using a neurogram similarity index measure. Speech Communication. 2012;54(2):306–20.

Mamun N, Jassim WA, Zilany MS. Prediction of speech intelligibility using a neurogram orthogonal polynomial measure (NOPM). IEEE Transactions on Audio, Speech, and Language Processing. 2015 Apr 1;23(4):760-73.

Alam MS, Zilany MS, Jassim WA, Ahmad MY. Phoneme classification using the auditory neurogram. IEEE Access. 2017; pp(99):1-1 (doi: 10.1109/ACCESS.2016.2647229).

Islam MA, Zilany MS, Wissam AJ. Neural-Response-Based Text-Dependent Speaker Identification Under Noisy Conditions. In International Conference for Innovation in Biomedical Engineering and Life Sciences 2015 Dec 6 (pp. 11-14). Springer Singapore.

Islam MA, Jassim WA, Cheok NS, Zilany MS. A Robust Speaker Identification System Using the Responses from a Model of the Auditory Periphery. PloS one. 2016 Jul 8;11(7):e0158520.

Brookes M. Voicebox: Speech processing toolbox for matlab. Software, [Mar 2011].

Zilany MS, Bruce IC. Modeling auditory-nerve responses for high sound pressure levels in the normal and impaired auditory periphery. The Journal of the Acoustical Society of America. 2006 Sep;120(3):1446-66.

Studebaker GA, Sherbecoe RL, McDaniel DM, Gwaltney CA. Monosyllabic word recognition at higher-than-normal speech and noise levels. The Journal of the Acoustical Society of America. 1999;105(4):2431–44.

Dubno JR, Horwitz AR, Ahlstrom JB. Word recognition in noise at higher-than-normal levels: Decreases in scores and increases in masking. The Journal of the Acoustical Society of America. 2005;118(2):914–22.

Krishna BL, Semple MN. Auditory temporal processing: Responses to sinusoidally amplitude-modulated tones in the inferior colliculus. Journal of Neurophysiology. 2000; 1978; 84: 255-273.

Liberman MC. Auditory-nerve response from cats raised in a low-noise chamber. The Journal of the Acoustical Society of America. 1978;63(2):442–55.

Roffo G. Feature Selection Library (MATLAB Toolbox). 2016; arXiv preprint arXiv:1607.01327.

Ellis DP. PLP and RASTA (and MFCC, and inversion) in Matlab. 2005.

Ganapathy S, Thomas S, Hermansky H. Front-end for far-field speech recognition based on frequency domain linear prediction. Interspeech 2008.

Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011; 2: 27.

Bilmes JA. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. International Computer Science Institute. 1998;4(510):126.

Reynolds DA, Quatieri TF, Dunn RB. Speaker verification using adapted Gaussian mixture models. Digital signal processing. 2000;10(1):19–41.