共查询到20条相似文献,搜索用时 0 毫秒
1.
Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on text-dependent speaker recognition. In contrast, text-independent speaker recognition is more advantageous as the client can talk freely to the system. In this paper, text-independent speaker recognition is considered in the presence of some degradation effects such as noise and reverberation. Mel-Frequency Cepstral Coefficients (MFCCs), spectrum and log-spectrum are used for feature extraction from the speech signals. These features are processed with the Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) as a classification tool to complete the speaker recognition task. The network learns to recognize the speakers efficiently in a text-independent manner, when the recording circumstances are the same. The recognition rate reaches 95.33% using MFCCs, while it is increased to 98.7% when using spectrum or log-spectrum. However, the system has some challenges to recognize speakers from different recording environments. Hence, different speech enhancement techniques, such as spectral subtraction and wavelet denoising, are used to improve the recognition performance to some extent. The proposed approach shows superiority, when compared to the algorithm of R. Togneri and D. Pullella (2011). 相似文献
2.
为解决在说话人识别方法的矢量量化(Vector Quantization,VQ)系统中,K-均值法的码本设计很容易陷入局部最优,而且初始码本的选取对最佳码本设计影响很大的问题,将遗传算法(Genetic Algorithm,GA)与基于非参数模型的VQ相结合,得到1种VQ码本设计的GA-K算法.该算法利用GA的全局优化能力得到最优的VQ码本,避免LBG算法极易收敛于局部最优点的问题;通过GA自身参数,结合K-均值法收敛速度快的优点,搜索出训练矢量空间中全局最优的码本.实验结果表明,GA-K算法优于LBG算法,可以很好地协调收敛性和识别率之间的关系. 相似文献
3.
With the growing trend toward remote security verification procedures for telephone banking, biometric security measures and similar applications, automatic speaker verification (ASV) has received a lot of attention in recent years. The complexity of ASV system and its verification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. In this paper, we concentrate on optimizing dimensionality of feature space by selecting relevant features. At present there are several methods for feature selection in ASV systems. To improve performance of ASV system we present another method that is based on ant colony optimization (ACO) algorithm. After feature reduction phase, feature vectors are applied to a Gaussian mixture model universal background model (GMM-UBM) which is a text-independent speaker verification model. The performance of proposed algorithm is compared to the performance of genetic algorithm on the task of feature selection in TIMIT corpora. The results of experiments indicate that with the optimized feature set, the performance of the ASV system is improved. Moreover, the speed of verification is significantly increased since by use of ACO, number of features is reduced over 80% which consequently decrease the complexity of our ASV system. 相似文献
4.
The speaker recognition has been one of the interesting issues in signal and speech processing over the last few decades. Feature selection is one of the main parts of speaker recognition system which can improve the performance of the system. In this paper, we have proposed two methods to find MFCCs feature vectors with the highest similar that is applied to text independent speaker identification system. These feature vectors show individual properties of each person’s vocal tract that are mostly repeated. They are used to build speaker’s model and to specify decision boundary. We applied MFCC of each window over main signal as a feature vector and used clustering to obtain feature vectors with the highest similar. The Speaker identification experiments are performed using the ELSDSR database that consists of 22 speakers (12 male and 10 female) and Neural Network is used as a classifier. The effect of three main parameters have been considered in two proposed methods. Experimental results indicate that the performance of speaker identification system has been improved in accuracy and time consumption term. 相似文献
5.
The identification of a person on the basis of scanned images of handwriting is a useful biometric modality with application in forensic and historic document analysis and constitutes an exemplary study area within the research field of behavioral biometrics. We developed new and very effective techniques for automatic writer identification and verification that use probability distribution functions (PDFs) extracted from the handwriting images to characterize writer individuality. A defining property of our methods is that they are designed to be independent of the textual content of the handwritten samples. Our methods operate at two levels of analysis: the texture level and the character-shape (allograph) level. At the texture level, we use contour-based joint directional PDFs that encode orientation and curvature information to give an intimate characterization of individual handwriting style. In our analysis at the allograph level, the writer is considered to be characterized by a stochastic pattern generator of ink-trace fragments, or graphemes. The PDF of these simple shapes in a given handwriting sample is characteristic for the writer and is computed using a common shape codebook obtained by grapheme clustering. Combining multiple features (directional, grapheme, and run-length PDFs) yields increased writer identification and verification performance. The proposed methods are applicable to free-style handwriting (both cursive and isolated) and have practical feasibility, under the assumption that a few text lines of handwritten material are available in order to obtain reliable probability estimates 相似文献
6.
提出了一种基于二次离散小波变换(DWT)的语音增强算法。该算法首先对带噪语音信号进行离散小波变换,提取离散细节信号,并对其进行第二次离散小波变换。再按照不同的规则选取阈值,对信号进行去噪处理。最后再对出来后的语音信号进行合并。对比实验结果表明,该方法具有良好的消除噪声的效果,提高了语音的清晰度和可懂度。 相似文献
7.
Multimedia Tools and Applications - Energy compaction property of the Discrete Cosine Transform (DCT) leads its usage in image and video compression applications. Nowadays power consumption is the... 相似文献
9.
Neural Computing and Applications - With the increasing number of software applications that allow altering digital images and their ease of use, they weaken the credibility of an image. This... 相似文献
10.
提出一种改进的语音增强方法,将带噪语音信号进行子带分解,再对子带信号进行离散分数余弦变换(DFRCT)域滤波,利用了DFRCT良好的正交特性,且自适应滤波采用最小均方(LMS)算法。对滤波后的信号进行DFRCT逆变换得到增强后的子带语音信号,合成增强后的语音信号。仿真结果表明,该算法在减少输入信号自相关程度的基础上,提高了收敛速度,减少了计算时间(约10 s),增强后的语音信号的分段信噪比(SegSNR)和PESQ值都有所提高,具有良好的语音增强效果。 相似文献
11.
In this paper Type-2 Information Set (T2IS) features and Hanman Transform (HT) features as Higher Order Information Set (HOIS) based features are proposed for the text independent speaker recognition. The speech signals of different speakers represented by Mel Frequency Cepstral Coefficients (MFCC) are converted into T2IS features and HT features by taking account of the cepstral and temporal possibilistic uncertainties. The features are classified by Improved Hanman Classifier (IHC), Support Vector Machine (SVM) and k-Nearest Neighbours (kNN). The performance of the proposed approaches is tested in terms of speed, computational complexity, memory requirement and accuracy on three datasets namely NIST-2003, VoxForge 2014 speech corpus and VCTK speech corpus and compared with that of the baseline features like MFCC, ?MFCC, ??MFCC and GFCC under white Gaussian noisy environment at different signal-to-noise ratios. The proposed features have the reduced feature size, computational time, and complexity and also their performance is not degraded under the noisy environment. 相似文献
12.
Alcoholic and non-alcoholic fatty liver disease is one of the leading causes of chronic liver diseases and mortality in Western countries and Asia. Ultrasound image assessment is most commonly and widely used to identify the Non-Alcoholic Fatty Liver Disease (NAFLD). It is one of the faster and safer non-invasive methods of NAFLD diagnosis available in imaging modalities. The diagnosis of NAFLD using biopsies is expensive, invasive, and causes anxiety to the patients. The advent of advanced image processing and data mining techniques have helped to develop faster, efficient, objective, and accurate decision support system for fatty liver disease using ultrasound images. This paper proposes a novel feature extraction models based on Radon Transform (RT) and Discrete Cosine Transform (DCT). First, Radon Transform (RT) is performed on the ultrasound images for every 1 degree to capture the low frequency details. Then 2D-DCT is applied on the Radon transformed image to obtain the frequency features (DCT coefficients). Further the 2D-DCT frequency coefficients (features) obtained are converted to 1D coefficients vector in zigzag fashion. This 1D array of DCT coefficients are subjected to Locality Sensitive Discriminant Analysis (LSDA) to reduce the number of features. Then these features are ranked using minimum Redundancy and Maximum Relevance (mRMR) ranking method. Finally, highly ranked minimum numbers of features are fused using Decision Tree (DT), k-Nearest Neighbour (k-NN), Probabilistic Neural Network (PNN), Support Vector Machine (SVM), Fuzzy Sugeno (FS) and AdaBoost classifiers to get the highest classification performance. In this work, we have obtained an average accuracy, sensitivity and specificity of 100% in the detection of NAFLD using FS classifier. Also, we have devised an integrated index named as Fatty Liver Disease Index (FLDI) by fusing two significant LSDA components to distinguish normal and FLD class with single number. 相似文献
13.
In speaker recognition tasks, one of the reasons for reduced accuracy is due to closely resembling speakers in the acoustic space. In order to increase the discriminative power of the classifier, the system must be able to use only the unique features of a given speaker with respect to his/her acoustically resembling speaker. This paper proposes a technique to reduce the confusion errors, by finding speaker-specific phonemes and formulate a text using the subset of phonemes that are unique, for speaker verification task using i-vector based approach. In this paper spectral features such as linear prediction cepstral co-efficients (LPCC), perceptual linear prediction co-efficients (PLP) and phase feature such as modified group delay are experimented to analyse the importance of speaker-specific-text in speaker verification task. Experiments have been conducted on speaker verification task using speech data of 50 speakers collected in a laboratory environment. The experiments show that the equal error rate (EER) has been decreased significantly using i-vector approach with speaker-specific-text when compared to i-vector approach with random-text using different spectral and phase based features. 相似文献
14.
In current scenario, speaker recognition under noisy condition is the major challenging task in the area of speech processing. Due to noise environment there is a significant degradation in the system performance. The major aim of the proposed work is to identify the speaker’s under clean and noise background using limited dataset. In this paper, we proposed a multitaper based Mel frequency cepstral coefficients (MFCC) and power normalization cepstral coefficients (PNCC) techniques with fusion strategies. Here, we used MFCC and PNCC techniques with different multitapers to extract the desired features from the obtained speech samples. Then, cepstral mean and variance normalization (CMVN) and Feature warping (FW) are the two techniques applied to normalize the obtained features from both the techniques. Furthermore, as a system model low dimension i-vector model is used and also different fusion score strategies like mean, maximum, weighted sum, cumulative and concatenated fusion techniques are utilized. Finally extreme learning machine (ELM) is used for classification in order to increase the system identification accuracy (SIA) intern which is having a single layer feedforward neural network with less complexity and time consuming compared to other neural networks. TIMIT and SITW 2016 are the two different databases are used to evaluate the proposed system under limited data of these databases. Both clean and noisy backgrounds conditions are used to check the SIA. 相似文献
15.
The separation of mixed auditory signals into their sources is an eminent neuroscience and engineering challenge. We reveal the principles underlying a deterministic, neural network-like solution to this problem. This approach is orthogonal to ICA/PCA that views the signal constituents as independent realizations of random processes. We demonstrate exemplarily that in the absence of salient frequency modulations, the decomposition of speech signals into local cosine packets allows for a sparse, noise-robust speaker separation. As the main result, we present analytical limitations inherent in the approach, where we propose strategies of how to deal with this situation. Our results offer new perspectives toward efficient noise cleaning and auditory signal separation and provide a new perspective of how the brain might achieve these tasks. 相似文献
16.
Geometry based block partitioning(GBP) has been shown to achieve better performance than the tree structure based block partitioning(TSBP) of H.264.However,the residual blocks of GBP mode after motion compensation still present some nonvertical/non-horizontal orientations,and the conventional discrete cosine transform(DCT) may generate many high-frequency coefficients.To solve this problem,in this paper we propose a video coding approach by using GBP and reordering DCT(RDCT) techniques.In the proposed approach,GBP is first applied to partition the macroblocks.Then,before performing DCT,a reordering operation is used to adjust the pixel positions of the residual macroblocks based on the partition information.In this way,the partition information of GBP can be used to represent the reordering information of RDCT,and the bitrate can be reduced.Experimental results show that,compared to H.264/AVC,the proposed method achieves on average 6.38% and 5.69% bitrate reductions at low and high bitrates,respectively. 相似文献
17.
Speech and speaker recognition systems are rapidly being deployed in real-world applications. In this paper, we discuss the details of a system and its components for indexing and retrieving multimedia content derived from broadcast news sources. The audio analysis component calls for real-time speech recognition for converting the audio to text and concurrent speaker analysis consisting of the segmentation of audio into acoustically homogeneous sections followed by speaker identification. The output of these two simultaneous processes is used to abstract statistics to automatically build indexes for text-based and speaker-based retrieval without user intervention. The real power of multimedia document processing is the possibility of Boolean queries in the form of combined text- and speaker-based user queries. Retrieval for such queries entails combining the results of individual text and speaker based searches. The underlying techniques discussed here can easily be extended to other speech-centric applications and transactions. 相似文献
18.
In this paper, a text-independent automatic speaker recognition (ASkR) system is proposed-the SR/sub Hurst/-which employs a new speech feature and a new classifier. The statistical feature pH is a vector of Hurst (H) parameters obtained by applying a wavelet-based multidimensional estimator (M/spl I.bar/dim/spl I.bar/wavelets ) to the windowed short-time segments of speech. The proposed classifier for the speaker identification and verification tasks is based on the multidimensional fBm (fractional Brownian motion) model, denoted by M/spl I.bar/dim/spl I.bar/fBm. For a given sequence of input speech features, the speaker model is obtained from the sequence of vectors of H parameters, means, and variances of these features. The performance of the SR/sub Hurst/ was compared to those achieved with the Gaussian mixture models (GMMs), autoregressive vector (AR), and Bhattacharyya distance (dB) classifiers. The speech database-recorded from fixed and cellular phone channels-was uttered by 75 different speakers. The results have shown the superior performance of the M/spl I.bar/dim/spl I.bar/fBm classifier and that the pH feature aggregates new information on the speaker identity. In addition, the proposed classifier employs a much simpler modeling structure as compared to the GMM. 相似文献
19.
This paper addresses the problem of recognising speech in the presence of a competing speaker. We review a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper investigates the performance of the system on a recognition task employing artificially mixed target and masker speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance that is produced by the interplay between energetic and informational masking. However, at around 0 dB the performance is generally quite poor. An analysis of the errors shows that a large number of target/masker confusions are being made. The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs. This component is combined with the recognition system to produce significant improvements. When the target and masker utterance have the same gender, the recognition system has a performance at 0 dB equal to that of humans; in other conditions the error rate is roughly twice the human error rate. 相似文献
20.
In the field of automatic audiovisual content-based indexing and structuring, finding events like interviews, debates, reports, or live commentaries requires to bridge the gap between low-level feature extraction and such high-level event detection. In our work, we consider that detecting speaker roles like Anchor, Journalist and Other is a first step to enrich interaction sequences between speakers. Our work relies on the assumption of the existence of clues about speaker roles in temporal, prosodic and basic signal features extracted from audio files and from speaker segmentations. Each speaker is therefore represented by a 36-feature vector. Contrarily to most of the state-of-the-art propositions we do not use the structure of the document to recognize the roles of the interveners. We investigate the influence of two dimensionality reduction techniques (Principal Component Analysis and Linear Discriminant Analysis) and different classification methods (Gaussian Mixture Models, K-nearest neighbours and Support Vectors Machines). Experiments are done on the 13-h corpus of the ESTER2 evaluation campaign. The best result reaches about 82% of well recognized roles. This corresponds to more than 89% of speech duration correctly labelled. 相似文献
|