期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Text-independent speaker recognition using LSTM-RNN and speech enhancement

El-Moneim Samia Abd Nassar M. A. Dessouky Moawad I. Ismail Nabil A. El-Fishawy Adel S. Abd El-Samie Fathi E. 《Multimedia Tools and Applications》2020,79(33-34):24013-24028

Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on text-dependent speaker recognition. In contrast, text-independent speaker recognition is more advantageous as the client can talk freely to the system. In this paper, text-independent speaker recognition is considered in the presence of some degradation effects such as noise and reverberation. Mel-Frequency Cepstral Coefficients (MFCCs), spectrum and log-spectrum are used for feature extraction from the speech signals. These features are processed with the Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) as a classification tool to complete the speaker recognition task. The network learns to recognize the speakers efficiently in a text-independent manner, when the recording circumstances are the same. The recognition rate reaches 95.33% using MFCCs, while it is increased to 98.7% when using spectrum or log-spectrum. However, the system has some challenges to recognize speakers from different recording environments. Hence, different speech enhancement techniques, such as spectral subtraction and wavelet denoising, are used to improve the recognition performance to some extent. The proposed approach shows superiority, when compared to the algorithm of R. Togneri and D. Pullella (2011).

相似文献

2.

Text-independent speaker verification using ant colony optimization-based selected features

Shahla Nemati Mohammad Ehsan Basiri 《Expert systems with applications》2011,38(1):620-630

With the growing trend toward remote security verification procedures for telephone banking, biometric security measures and similar applications, automatic speaker verification (ASV) has received a lot of attention in recent years. The complexity of ASV system and its verification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. In this paper, we concentrate on optimizing dimensionality of feature space by selecting relevant features. At present there are several methods for feature selection in ASV systems. To improve performance of ASV system we present another method that is based on ant colony optimization (ACO) algorithm. After feature reduction phase, feature vectors are applied to a Gaussian mixture model universal background model (GMM-UBM) which is a text-independent speaker verification model. The performance of proposed algorithm is compared to the performance of genetic algorithm on the task of feature selection in TIMIT corpora. The results of experiments indicate that with the optimized feature set, the performance of the ASV system is improved. Moreover, the speed of verification is significantly increased since by use of ACO, number of features is reduced over 80% which consequently decrease the complexity of our ASV system. 相似文献

3.

Text-independent speaker identification based on selection of the most similar feature vectors

Mohammad Soleymanpour Hossein Marvi 《International Journal of Speech Technology》2017,20(1):99-108

The speaker recognition has been one of the interesting issues in signal and speech processing over the last few decades. Feature selection is one of the main parts of speaker recognition system which can improve the performance of the system. In this paper, we have proposed two methods to find MFCCs feature vectors with the highest similar that is applied to text independent speaker identification system. These feature vectors show individual properties of each person’s vocal tract that are mostly repeated. They are used to build speaker’s model and to specify decision boundary. We applied MFCC of each window over main signal as a feature vector and used clustering to obtain feature vectors with the highest similar. The Speaker identification experiments are performed using the ELSDSR database that consists of 22 speakers (12 male and 10 female) and Neural Network is used as a classifier. The effect of three main parameters have been considered in two proposed methods. Experimental results indicate that the performance of speaker identification system has been improved in accuracy and time consumption term. 相似文献

4.

Text-independent writer identification and verification using textural and allographic features

Bulacu M Schomaker L 《IEEE transactions on pattern analysis and machine intelligence》2007,29(4):701-717

The identification of a person on the basis of scanned images of handwriting is a useful biometric modality with application in forensic and historic document analysis and constitutes an exemplary study area within the research field of behavioral biometrics. We developed new and very effective techniques for automatic writer identification and verification that use probability distribution functions (PDFs) extracted from the handwriting images to characterize writer individuality. A defining property of our methods is that they are designed to be independent of the textual content of the handwritten samples. Our methods operate at two levels of analysis: the texture level and the character-shape (allograph) level. At the texture level, we use contour-based joint directional PDFs that encode orientation and curvature information to give an intimate characterization of individual handwriting style. In our analysis at the allograph level, the writer is considered to be characterized by a stochastic pattern generator of ink-trace fragments, or graphemes. The PDF of these simple shapes in a given handwriting sample is characteristic for the writer and is computed using a common shape codebook obtained by grapheme clustering. Combining multiple features (directional, grapheme, and run-length PDFs) yields increased writer identification and verification performance. The proposed methods are applicable to free-style handwriting (both cursive and isolated) and have practical feasibility, under the assumption that a few text lines of handwritten material are available in order to obtain reliable probability estimates 相似文献

5.

Copy-move forgery detection technique based on discrete cosine transform blocks features

Armas Vega Esteban Alejandro González Fernández Edgar Sandoval Orozco Ana Lucila García Villalba Luis Javier 《Neural computing & applications》2021,33(10):4713-4727

Neural Computing and Applications - With the increasing number of software applications that allow altering digital images and their ease of use, they weaken the credibility of an image. This... 相似文献

6.

Enhanced approximate discrete cosine transforms for image compression and multimedia applications

Ezhilarasi R. Venkatalakshmi K. Khanth B. Pradeep 《Multimedia Tools and Applications》2020,79(13-14):8539-8552

Multimedia Tools and Applications - Energy compaction property of the Discrete Cosine Transform (DCT) leads its usage in image and video compression applications. Nowadays power consumption is the... 相似文献

7.

Affect-insensitive speaker recognition systems via emotional speech clustering using prosodic features

Dongdong Li Yubo Yuan Zhaohui Wu Yingchun Yang 《Neural computing & applications》2015,26(2):473-484

相似文献

8.

Higher order information set based features for text-independent speaker identification

Medikonda Jeevan Madasu Hanmandlu 《International Journal of Speech Technology》2018,21(3):451-461

In this paper Type-2 Information Set (T2IS) features and Hanman Transform (HT) features as Higher Order Information Set (HOIS) based features are proposed for the text independent speaker recognition. The speech signals of different speakers represented by Mel Frequency Cepstral Coefficients (MFCC) are converted into T2IS features and HT features by taking account of the cepstral and temporal possibilistic uncertainties. The features are classified by Improved Hanman Classifier (IHC), Support Vector Machine (SVM) and k-Nearest Neighbours (kNN). The performance of the proposed approaches is tested in terms of speed, computational complexity, memory requirement and accuracy on three datasets namely NIST-2003, VoxForge 2014 speech corpus and VCTK speech corpus and compared with that of the baseline features like MFCC, ?MFCC, ??MFCC and GFCC under white Gaussian noisy environment at different signal-to-noise ratios. The proposed features have the reduced feature size, computational time, and complexity and also their performance is not degraded under the noisy environment. 相似文献

9.

Speaker-specific-text based speaker verification system using spectral and phase based features

B. Bharathi 《International Journal of Speech Technology》2017,20(3):465-474

In speaker recognition tasks, one of the reasons for reduced accuracy is due to closely resembling speakers in the acoustic space. In order to increase the discriminative power of the classifier, the system must be able to use only the unique features of a given speaker with respect to his/her acoustically resembling speaker. This paper proposes a technique to reduce the confusion errors, by finding speaker-specific phonemes and formulate a text using the subset of phonemes that are unique, for speaker verification task using i-vector based approach. In this paper spectral features such as linear prediction cepstral co-efficients (LPCC), perceptual linear prediction co-efficients (PLP) and phase feature such as modified group delay are experimented to analyse the importance of speaker-specific-text in speaker verification task. Experiments have been conducted on speaker verification task using speech data of 50 speakers collected in a laboratory environment. The experiments show that the equal error rate (EER) has been decreased significantly using i-vector approach with speaker-specific-text when compared to i-vector approach with random-text using different spectral and phase based features. 相似文献

10.

ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score

K P Bharath M Rajesh Kumar 《Multimedia Tools and Applications》2020,79(39-40):28859-28883

In current scenario, speaker recognition under noisy condition is the major challenging task in the area of speech processing. Due to noise environment there is a significant degradation in the system performance. The major aim of the proposed work is to identify the speaker’s under clean and noise background using limited dataset. In this paper, we proposed a multitaper based Mel frequency cepstral coefficients (MFCC) and power normalization cepstral coefficients (PNCC) techniques with fusion strategies. Here, we used MFCC and PNCC techniques with different multitapers to extract the desired features from the obtained speech samples. Then, cepstral mean and variance normalization (CMVN) and Feature warping (FW) are the two techniques applied to normalize the obtained features from both the techniques. Furthermore, as a system model low dimension i-vector model is used and also different fusion score strategies like mean, maximum, weighted sum, cumulative and concatenated fusion techniques are utilized. Finally extreme learning machine (ELM) is used for classification in order to increase the system identification accuracy (SIA) intern which is having a single layer feedforward neural network with less complexity and time consuming compared to other neural networks. TIMIT and SITW 2016 are the two different databases are used to evaluate the proposed system under limited data of these databases. Both clean and noisy backgrounds conditions are used to check the SIA.

相似文献

11.

Video coding using geometry based block partitioning and reordering discrete cosine transform

Yi-xiong ZHANG Jiang-hong SHI Wei-dong WANG 《浙江大学学报:C卷英文版》2012,(1):71-82

Geometry based block partitioning(GBP) has been shown to achieve better performance than the tree structure based block partitioning(TSBP) of H.264.However,the residual blocks of GBP mode after motion compensation still present some nonvertical/non-horizontal orientations,and the conventional discrete cosine transform(DCT) may generate many high-frequency coefficients.To solve this problem,in this paper we propose a video coding approach by using GBP and reordering DCT(RDCT) techniques.In the proposed approach,GBP is first applied to partition the macroblocks.Then,before performing DCT,a reordering operation is used to adjust the pixel positions of the residual macroblocks based on the partition information.In this way,the partition information of GBP can be used to represent the reordering information of RDCT,and the bitrate can be reduced.Experimental results show that,compared to H.264/AVC,the proposed method achieves on average 6.38% and 5.69% bitrate reductions at low and high bitrates,respectively. 相似文献

12.

Principles and typical computational limitations of sparse speaker separation based on deterministic speech features

Kern A Stoop R 《Neural computation》2011,23(9):2358-2389

The separation of mixed auditory signals into their sources is an eminent neuroscience and engineering challenge. We reveal the principles underlying a deterministic, neural network-like solution to this problem. This approach is orthogonal to ICA/PCA that views the signal constituents as independent realizations of random processes. We demonstrate exemplarily that in the absence of salient frequency modulations, the decomposition of speech signals into local cosine packets allows for a sparse, noise-robust speaker separation. As the main result, we present analytical limitations inherent in the approach, where we propose strategies of how to deal with this situation. Our results offer new perspectives toward efficient noise cleaning and auditory signal separation and provide a new perspective of how the brain might achieve these tasks. 相似文献

13.

Multimedia document retrieval using speech and speaker recognition

Mahesh Viswanathan Homayoon S.M. Beigi Satya Dharanipragada Fereydoun Maali Alain Tritschler 《International Journal on Document Analysis and Recognition》2000,2(4):147-162

Speech and speaker recognition systems are rapidly being deployed in real-world applications. In this paper, we discuss the details of a system and its components for indexing and retrieving multimedia content derived from broadcast news sources. The audio analysis component calls for real-time speech recognition for converting the audio to text and concurrent speaker analysis consisting of the segmentation of audio into acoustically homogeneous sections followed by speaker identification. The output of these two simultaneous processes is used to abstract statistics to automatically build indexes for text-based and speaker-based retrieval without user intervention. The real power of multimedia document processing is the possibility of Boolean queries in the form of combined text- and speaker-based user queries. Retrieval for such queries entails combining the results of individual text and speaker based searches. The underlying techniques discussed here can easily be extended to other speech-centric applications and transactions. 相似文献

14.

Speech fragment decoding techniques for simultaneous speaker identification and speech recognition 总被引：1，自引：1，他引：1

Jon Barker Ning Ma Andr Coy Martin Cooke 《Computer Speech and Language》2010,24(1):94-111

This paper addresses the problem of recognising speech in the presence of a competing speaker. We review a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper investigates the performance of the system on a recognition task employing artificially mixed target and masker speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance that is produced by the interplay between energetic and informational masking. However, at around 0 dB the performance is generally quite poor. An analysis of the errors shows that a large number of target/masker confusions are being made. The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs. This component is combined with the recognition system to produce significant improvements. When the target and masker utterance have the same gender, the recognition system has a performance at 0 dB equal to that of humans; in other conditions the error rate is roughly twice the human error rate. 相似文献

15.

Text-independent speaker recognition based on the Hurst parameter and the multidimensional fractional Brownian motion model

Sant'Ana R. Coelho R. Alcaim A. 《IEEE transactions on audio, speech, and language processing》2006,14(3):931-940

In this paper, a text-independent automatic speaker recognition (ASkR) system is proposed-the SR/sub Hurst/-which employs a new speech feature and a new classifier. The statistical feature pH is a vector of Hurst (H) parameters obtained by applying a wavelet-based multidimensional estimator (M/spl I.bar/dim/spl I.bar/wavelets ) to the windowed short-time segments of speech. The proposed classifier for the speaker identification and verification tasks is based on the multidimensional fBm (fractional Brownian motion) model, denoted by M/spl I.bar/dim/spl I.bar/fBm. For a given sequence of input speech features, the speaker model is obtained from the sequence of vectors of H parameters, means, and variances of these features. The performance of the SR/sub Hurst/ was compared to those achieved with the Gaussian mixture models (GMMs), autoregressive vector (AR), and Bhattacharyya distance (dB) classifiers. The speech database-recorded from fixed and cellular phone channels-was uttered by 75 different speakers. The results have shown the superior performance of the M/spl I.bar/dim/spl I.bar/fBm classifier and that the pH feature aggregates new information on the speaker identity. In addition, the proposed classifier employs a much simpler modeling structure as compared to the GMM. 相似文献

16.

Detecting individual role using features extracted from speaker diarization results

Benjamin Bigot Isabelle Ferrané Julien Pinquier Régine André-Obrecht 《Multimedia Tools and Applications》2012,60(2):347-369

In the field of automatic audiovisual content-based indexing and structuring, finding events like interviews, debates, reports, or live commentaries requires to bridge the gap between low-level feature extraction and such high-level event detection. In our work, we consider that detecting speaker roles like Anchor, Journalist and Other is a first step to enrich interaction sequences between speakers. Our work relies on the assumption of the existence of clues about speaker roles in temporal, prosodic and basic signal features extracted from audio files and from speaker segmentations. Each speaker is therefore represented by a 36-feature vector. Contrarily to most of the state-of-the-art propositions we do not use the structure of the document to recognize the roles of the interveners. We investigate the influence of two dimensionality reduction techniques (Principal Component Analysis and Linear Discriminant Analysis) and different classification methods (Gaussian Mixture Models, K-nearest neighbours and Support Vectors Machines). Experiments are done on the 13-h corpus of the ESTER2 evaluation campaign. The best result reaches about 82% of well recognized roles. This corresponds to more than 89% of speech duration correctly labelled. 相似文献

17.

Performance of speaker identification using CSM and TM

R.?Visalakshi Email author P.?Dhanalakshmi 《International Journal of Speech Technology》2016,19(3):457-465

The main objective of this paper is to develop the system of speaker identification. Speaker identification is a technology that allows a computer to automatically identify the person who is speaking, based on the information received from speech signal. One of the most difficult problems in speaker recognition is dealing with noises. The performance of speaker recognition using close speaking microphone (CSM) is affected in background noises. To overcome this problem throat microphone (TM) which has a transducer held at the throat resulting in a clean signal and unaffected by background noises is used. Acoustic features namely linear prediction coefficients, linear prediction cepstral coefficients, Mel frequency cepstral coefficients and relative spectral transform-perceptual linear prediction are extracted. These features are classified using RBFNN and AANN and their performance is analyzed. A new method was proposed for identification of speakers in clean and noisy using combined CSM and TM. The identification performance of the combined system is increased than individual system due to complementary nature of CSM and TM. 相似文献

18.

Multimodal speaker identification using an adaptive classifier cascade based on modality reliability

Engin Erzin Yemez Y. Tekalp A.M. 《Multimedia, IEEE Transactions on》2005,7(5):840-852

We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The order of the classifiers in the cascade is adaptively determined based on the reliability of each modality combination. A novel reliability measure, that genuinely fits to the open-set speaker identification problem, is also proposed to assess accept or reject decisions of a classifier. A formal framework is developed based on probability of correct decision for analytical comparison of the proposed adaptive rule with other classifier combination rules. The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions. Experimental results that support this assertion are provided. 相似文献

19.

Modified dense convolutional networks based emotion detection from speech using its paralinguistic features

Dhiman Ritika Kang Gurkanwal Singh Gupta Varun 《Multimedia Tools and Applications》2021,80(21-23):32041-32069

Multimedia Tools and Applications - Emotion recognition through speech is one of the fundamental approaches for human interaction. Speech modulations stipulate different emotions and context. In... 相似文献

20.

Cancellable template generation for speaker recognition based on spectrogram patch selection and deep convolutional neural networks

El-Moneim Samia A. Nassar M. A. Dessouky Moawad I. Ismail Nabil A. El-Fishawy Adel S. El-Samie Fathi E. Abd 《International Journal of Speech Technology》2022,25(3):689-696

International Journal of Speech Technology - Nowadays, biometric systems have replaced password-or token-based authentication systems in many fields to improve the security level. However,... 相似文献