Data-Driven Temporal Filters and Alternatives to GMM in Speaker Verification |
| |
Affiliation: | 1. Oregon Graduate Institute of Science and Technology, Portland, Oregon;2. International Computer Science Institute, Berkeley, California;3. Indian Institute of Technology Madras, Chennai, India;1. Dept. of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain;2. Dept. of Computer Science, University of Sheffield, Sheffield, UK |
| |
Abstract: | Malayath, Narendranath, Hermansky, Hynek, Kajarekar, Sachin, and Yegnanarayana, B., Data-Driven Temporal Filters and Alternatives to GMM in Speaker Verification, Digital Signal Processing10(2000), 55–74.This paper discusses the research directions pursued jointly at the Anthropic Signal Processing Group of the Oregon Graduate Institute and at the Speech and Vision Laboratory of the Indian Institute of Technology Madras. Current methods for speaker verification are based on modeling the speaker characteristics using Gaussian mixture models (GMM). The performance of these systems significantly degrades if the target speakers use a telephone handset that is different from that used while training. Conventional methods for channel normalization include utterance-based mean subtraction (MS) and RelAtive SpecTrAl (RASTA) filtering. In this paper we introduce a novel method for designing filters that are capable of normalizing the variability introduced by different telephone handsets. The design of the filter is based on the estimated second-order statistics of handset variability. This filter is applied on the logarithmic energy outputs of Mel spaced filter banks. We also demonstrate the effectiveness of the proposed channel normalizing filter in improving speaker verification performance in mismatched conditions. GMM-based systems often use thousands of mixture components and hence require a large number of parameters to characterize each target speaker. In order to address this issue we propose an alternative to GMM for modeling speaker characteristics. The alternative is based on speaker-specific mapping and it relies on a speaker-independent representation of speech. |
| |
Keywords: | |
本文献已被 ScienceDirect 等数据库收录! |
|