首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In this paper we propose a method for improving the performance of the segmentation of speech waveforms to phonetic units. The proposed method is based on the well known Viterbi time-alignment algorithm and utilizes the phonetic boundary predictions from multiple speech parameterization techniques. Specifically, we utilize the most appropriate, with respect to boundary type, phone transition position prediction as initial point to start Viterbi time-alignment for the prediction of the successor phonetic boundary. The proposed method was evaluated on the TIMIT database, with the exploitation of several, well known in the area of speech processing, Fourier-based and wavelet-based speech parameterization algorithms. The experimental results for the tolerance of 20 milliseconds indicated an improvement of the absolute segmentation accuracy of approximately 0.70%, when compared to the baseline speech segmentation scheme.  相似文献   

2.
3.
Non-negative Tucker decomposition (NTD) is applied to unsupervised training of discrete density HMMs for the discovery of sequential patterns in data, for segmenting sequential data into patterns and for recognition of the discovered patterns in unseen data. Structure constraints are imposed on the NTD such that it shares its parameters with the HMM. Two training schemes are proposed: one uses NTD as a regularizer for the Baum–Welch (BW) training of the HMM, the other alternates between initializing the NTD with the BW output and vice versa. On the task of unsupervised spoken pattern discovery from the TIDIGITS database, both training schemes are observed to improve over BW training in terms of pattern purity, accuracy of the segmentation boundaries and accuracy for speech recognition. Furthermore, we experimentally observe that the alternative training of NTD and BW outperforms the NTD regularized BW, BW training and BW training with simulated annealing.  相似文献   

4.
当前的语音识别模型在英语、法语等表音文字中已取得很好的效果。然而,汉语是一种典型的表意文字,汉字与语音没有直接的对应关系,但拼音作为汉字读音的标注符号,与汉字存在相互转换的内在联系。因此,在汉语语音识别中利用拼音作为解码时的约束,可以引入一种更接近语音的归纳偏置。该文基于多任务学习框架,提出一种基于拼音约束联合学习的汉语语音识别方法,以端到端的汉字语音识别为主任务,以拼音语音识别为辅助任务,通过共享编码器,同时利用汉字与拼音识别结果作为监督信号,增强编码器对汉语语音的表达能力。实验结果表明,相比基线模型,该文提出的方法取得了更优的识别效果,词错误率降低了2.24%。  相似文献   

5.
In this paper, we propose a novel approach to automatic speech segmentation for unit-selection based text-to-speech systems. Instead of using a single automatic segmentation machine (ASM), we make use of multiple independent ASMs to produce a final boundary time-mark. Specifically, given multiple boundary time-marks provided by separate ASMs, we first compensate for the potential ASM-specific context-dependent systematic error (or a bias) of each time-mark and then compute the weighted sum of the bias-removed time-marks, yielding the final time-mark. The bias and weight parameters required for the proposed method are obtained beforehand for each phonetic context (e.g., /p/-/a/) through a training procedure where manual segmentations are utilized as the references. For the training procedure, we first define a cost function in order to quantify the discrepancy between the automatic and manual segmentations (or the error) and then minimize the sum of costs with respect to bias and weight parameters. In case a squared error is used for the cost, the bias parameters are easily obtained by averaging the errors of each phonetic context and then, with the bias parameters fixed, the weight parameters are simultaneously optimized through a gradient projection method which is adopted to overcome a set of constraints imposed on the weight parameter space. A decision tree which clusters all the phonetic contexts is utilized to deal with the unseen phonetic contexts. Our experimental results indicate that the proposed method improves the percentage of boundaries that deviate less than 20 ms with respect to the reference boundary from 95.06% with a HMM-based procedure and 96.85% with a previous multiple-model based procedure to 97.07%.  相似文献   

6.
In the present study, we propose a regression-based scheme for the direct estimation of the height of unknown speakers from their speech. In this scheme every speech input is decomposed via the openSMILE audio parameterization to a single feature vector that is fed to a regression model, which provides a direct estimation of the persons’ height. The focus in this study is on the evaluation of the appropriateness of several linear and non-linear regression algorithms on the task of automatic height estimation from speech. The performance of the proposed scheme is evaluated on the TIMIT database, and the experimental results show an accuracy of 0.053 meters, in terms of mean absolute error, for the best performing Bagging regression algorithm. This accuracy corresponds to an averaged relative error of approximately 3%. We deem that the direct estimation of the height of unknown people from speech provides an important additional feature for improving the performance of various surveillance, profiling and access authorization applications.  相似文献   

7.
Private predictions on hidden Markov models   总被引:1,自引:0,他引:1  
Hidden Markov models (HMMs) are widely used in practice to make predictions. They are becoming increasingly popular models as part of prediction systems in finance, marketing, bio-informatics, speech recognition, signal processing, and so on. However, traditional HMMs do not allow people and model owners to generate predictions without disclosing their private information to each other. To address the increasing needs for privacy, this work identifies and studies the private prediction problem; it is demonstrated with the following scenario: Bob has a private HMM, while Alice has a private input; and she wants to use Bob’s model to make a prediction based on her input. However, Alice does not want to disclose her private input to Bob, while Bob wants to prevent Alice from deriving information about his model. How can Alice and Bob perform HMMs-based predictions without violating their privacy? We propose privacy-preserving protocols to produce predictions on HMMs without greatly exposing Bob’s and Alice’s privacy. We then analyze our schemes in terms of accuracy, privacy, and performance. Since they are conflicting goals, due to privacy concerns, it is expected that accuracy or performance might degrade. However, our schemes make it possible for Bob and Alice to produce the same predictions efficiently while preserving their privacy.  相似文献   

8.
Optimal representation of acoustic features is an ongoing challenge in automatic speech recognition research. As an initial step toward this purpose, optimization of filterbanks for the cepstral coefficient using evolutionary optimization methods is proposed in some approaches. However, the large number of optimization parameters required by a filterbank makes it difficult to guarantee that an individual optimized filterbank can provide the best representation for phoneme classification. Moreover, in many cases, a number of potential solutions are obtained. Each solution presents discrimination between specific groups of phonemes. In other words, each filterbank has its own particular advantage. Therefore, the aggregation of the discriminative information provided by filterbanks is demanding challenging task. In this study, the optimization of a number of complementary filterbanks is considered to provide a different representation of speech signals for phoneme classification using the hidden Markov model (HMM). Fuzzy information fusion is used to aggregate the decisions provided by HMMs. Fuzzy theory can effectively handle the uncertainties of classifiers trained with different representations of speech data. In this study, the output of the HMM classifiers of each expert is fused using a fuzzy decision fusion scheme. The decision fusion employed a global and local confidence measurement to formulate the reliability of each classifier based on both the global and local context when making overall decisions. Experiments were conducted based on clean and noisy phonetic samples. The proposed method outperformed conventional Mel frequency cepstral coefficients under both conditions in terms of overall phoneme classification accuracy. The fuzzy fusion scheme was shown to be capable of the aggregation of complementary information provided by each filterbank.  相似文献   

9.
Accurate modeling of prosody is prerequisite for the production of synthetic speech of high quality. Phone duration, as one of the key prosodic parameters, plays an important role for the generation of emotional synthetic speech with natural sounding. In the present work we offer an overview of various phone duration modeling techniques, and consequently evaluate ten models, based on decision trees, linear regression, lazy-learning algorithms and meta-learning algorithms, which over the past decades have been successfully used in various modeling tasks. Furthermore, we study the opportunity for performance optimization by applying two feature selection techniques, the RReliefF and the Correlation-based Feature Selection, on a large set of numerical and nominal linguistic features extracted from text, such as: phonetic, phonologic and morphosyntactic ones, which have been reported successful on the phone and syllable duration modeling task. We investigate the practical usefulness of these phone duration modeling techniques on a Modern Greek emotional speech database, which consists of five categories of emotional speech: anger, fear, joy, neutral, sadness. The experimental results demonstrated that feature selection significantly improves the accuracy of phone duration prediction regardless of the type of machine learning algorithm used for phone duration modeling. Specifically, in four out of the five categories of emotional speech, feature selection contributed to the improvement of the phone duration modeling, when compared to the case without feature selection. The M5p trees based phone duration model was observed to achieve the best phone duration prediction accuracy in terms of RMSE and MAE.  相似文献   

10.
An advanced image and video segmentation system is proposed. The system builds on existing work, but extends it to achieve efficiency and robustness, which are the two major shortcomings of segmentation methods developed so far. Six different schemes containing several approaches tailored for diverse applications constitute the core of the system. The first two focus on very-low complexity image segmentation addressing real-time applications under specific assumptions. The third scheme is a highly efficient implementation of the powerful nonlinear diffusion model. The other three schemes address the more complex task of physical object segmentation using information about the scene structure or motion. These techniques are based on an extended diffusion model and morphology. The main objective of this work has been to develop a robust and efficient segmentation system for natural video and still images. This goal has been achieved by advancing the state-of-art in terms of pushing forward the frontiers of current methods to meet the challenges of the segmentation task in different situations under reasonable computational cost. Consequently, more efficient methods and novel strategies to issues for which current approaches fail are developed. The performance of the presented segmentation schemes has been assessed by processing several video sequences. Qualitative and quantitative result of this assessment are also reported  相似文献   

11.
12.
As forecasting is increasingly becoming important, hidden Markov models (HMMs) are widely used for prediction in many applications such as finance, marketing, bioinformatics, speech recognition, and so on. After creating an HMM, the model owner can start providing predictions. When the model is owned by one party, predictions can be easily provided. However, it becomes a challenge when the model is horizontally or vertically distributed between various parties, even competing companies. The parties want to integrate the split models they own for better forecasting purposes. Due to privacy, financial, and legal reasons; however, they do not want to share their models. We investigate how such parties produce predictions on the distributed model without violating their privacy. We then analyze our proposed schemes in terms of accuracy, privacy, and performance; and finally present our findings.  相似文献   

13.
14.
15.
This paper presents an integrated approach to spot the spoken keywords in digitized Tamil documents by combining word image matching and spoken word recognition techniques. The work involves the segmentation of document images into words, creation of an index of keywords, and construction of word image hidden Markov model (HMM) and speech HMM for each keyword. The word image HMMs are constructed using seven dimensional profile and statistical moment features and used to recognize a segmented word image for possible inclusion of the keyword in the index. The spoken query word is recognized using the most likelihood of the speech HMMs using the 39 dimensional mel frequency cepstral coefficients derived from the speech samples of the keywords. The positional details of the search keyword obtained from the automatically updated index retrieve the relevant portion of text from the document during word spotting. The performance measures such as recall, precision, and F-measure are calculated for 40 test words from the four groups of literary documents to illustrate the ability of the proposed scheme and highlight its worthiness in the emerging multilingual information retrieval scenario.  相似文献   

16.
基于隐马尔可夫链的广播新闻分割分类   总被引:4,自引:2,他引:4  
提出了使用具有模拟随机时序数据良好能力的隐马尔可夫链来完成广播新闻分割分类的算法,首先使用含隐藏语义状态的隐马尔可夫链把原始广播新闻粗略分类成开始/结束和语音两部分,其次应用3个隐马尔可夫链,按照最大似然概率法把语音片段预识别为主持人介绍、广告和天气预报,最后由语义变化速率识别出新闻现场报道,完成广播新闻的精细分割分类任务。  相似文献   

17.
To improve recognition performance in noisy environments, multicondition training is usually applied in which speech signals corrupted by a variety of noise are used in acoustic model training. Published hidden Markov modeling of speech uses multiple Gaussian distributions to cover the spread of the speech distribution caused by noise, which distracts the modeling of speech event itself and possibly sacrifices the performance on clean speech. In this paper, we propose a novel approach which extends the conventional Gaussian mixture hidden Markov model (GMHMM) by modeling state emission parameters (mean and variance) as a polynomial function of a continuous environment-dependent variable. At the recognition time, a set of HMMs specific to the given value of the environment variable is instantiated and used for recognition. The maximum-likelihood (ML) estimation of the polynomial functions of the proposed variable-parameter GMHMM is given within the expectation-maximization (EM) framework. Experiments on the Aurora 2 database show significant improvements of the variable-parameter Gaussian mixture HMMs compared to the conventional GMHMMs  相似文献   

18.
In this paper we report our recent research whose goal is to improve the performance of a novel speech recognizer based on an underlying statistical hidden dynamic model of phonetic reduction in the production of conversational speech. We have developed a path-stack search algorithm which efficiently computes the likelihood of any observation utterance while optimizing the dynamic regimes in the speech model. The effectiveness of the algorithm is tested on the speech data in the Switchboard corpus, in which the optimized dynamic regimes computed from the algorithm are compared with those from exhaustive search. We also present speech recognition results on the Switchboard corpus that demonstrate improvements of the recognizer’s performance compared with the use of the dynamic regimes heuristically set from the phone segmentation by a state-of-the-art hidden Markov model (HMM) system.  相似文献   

19.
Segmentation plays vital role in speech recognition systems. An automatic segmentation of Tamil speech into syllable has been carried out using Vowel Onset Point (VOP) and Spectral Transition Measure (STM). VOP is a phonetic event used to identify the beginning point of the vowel in speech signals. Spectral Transition Measure is performed to find the significant spectral changes in speech utterances. The performance of the proposed syllable segmentation method is measured corresponding to manual segmentation and compared with the exiting syllable method using VOP and Vowel Offset Point (VOF). The result of the experiments shows the effectiveness of the proposed system.  相似文献   

20.
Although Hidden Markov Models (HMMs) are still the mainstream approach towards speech recognition, their intrinsic limitations such as first-order Markov models in use or the assumption of independent and identically distributed frames lead to the extensive use of higher level linguistic information to produce satisfactory results. Therefore, researchers began investigating the incorporation of various discriminative techniques at the acoustical level to induce more discrimination between speech units. As is known, the k-nearest neighbour (k-NN) density estimation is discriminant by nature and is widely used in the pattern recognition field. However, its application to speech recognition has been limited to few experiments. In this paper, we introduce a new segmental k-NN-based phoneme recognition technique. In this approach, a group-delay-based method generates phoneme boundary hypotheses, and an approximate version of k-NN density estimation is used for the classification and scoring of variable-length segments. During the decoding, the construction of the phonetic graph starts from the best phoneme boundary setting and progresses through splitting and merging segments using the remaining boundary hypotheses and constraints such as phoneme duration and broad-class similarity information. To perform the k-NN search, we take advantage of a similarity search algorithm called Spatial Approximate Sample Hierarchy (SASH). One major advantage of the SASH algorithm is that its computational complexity is independent of the dimensionality of the data. This allows us to use high-dimensional feature vectors to represent phonemes. By using phonemes as units of speech, the search space is very limited and the decoding process fast. Evaluation of the proposed algorithm with the sole use of the best hypothesis for every segment and excluding phoneme transitional probabilities, context-based, and language model information results in an accuracy of 58.5% with correctness of 67.8% on the TIMIT test dataset.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号