首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
The focus of this paper is to automatically segment and label continuous speech signal into syllable-like units for Indian languages. In this approach, the continuous speech signal is first automatically segmented into syllable-like units using group delay based algorithm. Similar syllable segments are then grouped together using an unsupervised and incremental training (UIT) technique. Isolated style HMM models are generated for each of the clusters during training. During testing, the speech signal is segmented into syllable-like units which are then tested against the HMMs obtained during training. This results in a syllable recognition performance of 42·6% and 39·94% for Tamil and Telugu. A new feature extraction technique that uses features extracted from multiple frame sizes and frame rates during both training and testing is explored for the syllable recognition task. This results in a recognition performance of 48·7% and 45·36%, for Tamil and Telugu respectively. The performance of segmentation followed by labelling is superior to that of a flat start syllable recogniser (27·8%and 28·8%for Tamil and Telugu respectively).  相似文献   

2.
Tibetan language has very limited resource for conventional automatic speech recognition so far. It lacks of enough data, sub-word unit, lexicons and word inventories for some dialects. And speech content recognition and dialect classification have been treated as two independent tasks and modeled respectively in most prior works. But the two tasks are highly correlated. In this paper, we present a multi-task WaveNet model to perform simultaneous Tibetan multi-dialect speech recognition and dialect identification. It avoids processing the pronunciation dictionary and word segmentation for new dialects, while, in the meantime, allows training speech recognition and dialect identification in a single model. The experimental results show our method can simultaneously recognize speech content for different Tibetan dialects and identify the dialect with high accuracy using a unified model. The dialect information used in output for training can improve multi-dialect speech recognition accuracy, and the low-resource dialects got higher speech content recognition rate and dialect classification accuracy by multi-dialect and multi-task recognition model than task-specific models.  相似文献   

3.
浊音端点检测在语音处理中占有重要地位,在语音编解码、语音识别、语音增强处理中都需要用到端点检测技术。常规的以短时能量、过零率等作为判决特征参数的方法无法在低信噪比的系统中满足应用需求。本文以信号的共振峰和基音周期检测为基础检测浊音端点,算法首先提取语音信号的第一共振峰以及基音周期信息并以此为判决依据判断浊音的起点和终点。实验证明,这种方法在噪声环境中相对于传统的基于能量检测的或AMR_WB标准中的端点检测算法有更高的正确率。  相似文献   

4.
语音识别技术可以为要求双手同时作业的操作人员和残疾人提供一种便捷的控制方法.本文提出了一种通过结合FF2(Second-order Frequency Filtering)和RASTA(RelAtive SpecTrAl)技术来增强语音识别鲁棒性的方法,并将这种方法成功应用于机器人化护理床的控制系统中,增强了识别系统在医院、工厂等非稳定噪声环境下语音识别的鲁棒性.通过将HMM/GMM混合模型的传统Mel频率倒谱系数为特征值的识别系统与HMM/GMM混合模型的RASTA-FF2(RelAtive SpectrAl-Second-order Frequency Filtering)为特征值的识别系统进行比较,并分别在纯语音和带噪语音条件下进行测试,得出:经过二阶频率滤波后的FF2特征值再经过RASTA滤波器滤波,特别是在非稳定噪声环境下,以RASTA-FF2为特征值的识别系统比传统的识别系统的识别率更高.这表明FF2特征值与RASTA滤波器技术相结合,一个作用于频域,一个作用于时间域,可以有效地消除语音信号中的不同噪声成份.  相似文献   

5.
朱敏  姜芃旭  赵力 《声学技术》2021,40(5):645-651
语音情感识别是人机交互的热门研究领域之一。然而,由于缺乏对语音中时频相关信息的研究,导致情感信息挖掘深度不够。为了更好地挖掘语音中的时频相关信息,提出了一种全卷积循环神经网络模型,采用并行多输入的方式组合不同模型,同时从两个模块中提取不同功能的特征。利用全卷积神经网络(Fully Convolutional Network,FCN)学习语音谱图特征中的时频相关信息,同时,利用长短期记忆(Long Short-Term Memory,LSTM)神经网络来学习语音的帧级特征,以补充模型在FCN学习过程中缺失的时间相关信息,最后,将特征融合后使用分类器进行分类,在两个公开的情感数据集上的测试验证了所提算法的优越性。  相似文献   

6.
复倒谱域语音信号去混响研究   总被引:3,自引:0,他引:3       下载免费PDF全文
语音信号去混响技术能明显提高语音通信系统和识别系统的性能。简要介绍了语音复倒谱法去混响原理,对复倒谱域去混响滤波特性进行了仿真研究。根据多种去混响评价指标,确定复倒谱域“低通滤波器”的最高截止点、过渡带宽和过渡带的曲线特性等参数,发现在通常混响时间范围内,“低通滤波器”最高截止点与混响时间无关,复倒谱域滤波前加高斯窗可以改善去混响效果。  相似文献   

7.
语音感知研究是语音学主要研究内容之一。为了解学习者对普通话元音和辅音的感知情况,并为汉语教学提供借鉴,拓宽言语学习模型的应用领域,减少教学的盲目性,根据第二语言习得理论模式,采用实验语音学和统计学的方法,设计辨认实验和区分实验,分别对普通话水平处于高级和初级水平的20名维吾尔族大学生元音、塞音、擦音和塞擦音的感知情况进行研究。辨认实验考察学习者对元音和辅音的感知反应时间和感知准确率。在区分实验中,计算元音对的频谱距离,辅音对的频谱距离,以及它们的时长差异,对学习者的区分能力进行分析。实验结果显示高级水平学习者的感知元音和辅音的能力明显高于初级水平学习者。学习者对发音部位靠后的元音反应时间快且感知准确率高,对于发音部位靠前的元音反应时间较慢且感知准确率低,对塞擦音的辨认准确率高,对擦音的辨认准确度最低。元音对的频谱距离和辅音对的时长差异会影响到学习者的区分能力,但辅音对的频谱距离和区分情况不存在密切联系。  相似文献   

8.
Phillips RL  Andrews LC 《Applied optics》1983,22(23):3833-3836
Atmospheric turbulence causes random fading in any open-air optical communication channel. When the transmitted message is in the form of a block code, a binary union decoding system consisting of one storage register can be used to enhance the reliability of this type of fading channel. To illustrate the effectiveness of the binary union decoder, we compare probabilities of detecting one word out of N + 1 received words for both a binary union decoder and a simple word recognition decoder system. Finally, the binary union decoder is analyzed for three different fading conditions of the channel corresponding to conditions of atmospheric turbulence typical of weak, moderate, and superstrong scattering. Our findings show that the worst channel conditions for an optical communication system exist when the turbulence is moderate  相似文献   

9.
A multistage decoder for the internally convolutionally coded fibre-optic time-hopping code division multiple access system recently introduced is considered. In this system, the decoder consists of several stages. The first stage is implemented using one of the single-user decoders introduced. The following stages are maximum likelihood (ML) decoders each of which use the decisions made by the previous stage. The performance of the proposed decoder is evaluated by a Monte Carlo simulation. Numerical results show that a multistage decoder with only two stages greatly outperforms the single-stage decoder with negligible increase in complexity. The authors also derive the Chernoff bound for the ML decoder with known interference, which is the ultimate performance of the multistage decoder.  相似文献   

10.
The author discusses the applications of voice technology in modern automobiles. A major goal of this technology is to add artificial intelligence to the instruments of the vehicle to improve the safety and reliability of the vehicle. Two aspects of speech technology are discussed, namely, speech synthesis and speech recognition. Speech synthesis is mainly used to generate verbal warning messages whenever a problem is detected in the vehicle. In addition, verbal messages can be generated to advise the driver to perform certain tasks to improve the performance of the automobile. In the second application of voice technology, namely, speech recognition, the basic goal is to create hands-free control of some of the automobile functions, taking direction from the driver's voice  相似文献   

11.
In the field of information security, a gap exists in the study of coreference resolution of entities. A hybrid method is proposed to solve the problem of coreference resolution in information security. The work consists of two parts: the first extracts all candidates (including noun phrases, pronouns, entities, and nested phrases) from a given document and classifies them; the second is coreference resolution of the selected candidates. In the first part, a method combining rules with a deep learning model (Dictionary BiLSTM-Attention-CRF, or DBAC) is proposed to extract all candidates in the text and classify them. In the DBAC model, the domain dictionary matching mechanism is introduced, and new features of words and their contexts are obtained according to the domain dictionary. In this way, full use can be made of the entities and entity-type information contained in the domain dictionary, which can help solve the recognition problem of both rare and long entities. In the second part, candidates are divided into pronoun candidates and noun phrase candidates according to the part of speech, and the coreference resolution of pronoun candidates is solved by making rules and coreference resolution of noun phrase candidates by machine learning. Finally, a dataset is created with which to evaluate our methods using information security data. The experimental results show that the proposed model exhibits better performance than the other baseline models.  相似文献   

12.
Automatic recognition of human emotions in a continuous dialog model remains challenging where a speaker’s utterance includes several sentences that may not always carry a single emotion. Limited work with standalone speech emotion recognition (SER) systems proposed for continuous speech only has been reported. In the recent decade, various effective SER systems have been proposed for discrete speech, i.e., short speech phrases. It would be more helpful if these systems could also recognize emotions from continuous speech. However, if these systems are applied directly to test emotions from continuous speech, emotion recognition performance would not be similar to that achieved for discrete speech due to the mismatch between training data (from training speech) and testing data (from continuous speech). The problem may possibly be resolved if an existing SER system for discrete speech is enhanced. Thus, in this work the author’s existing effective SER system for multilingual and mixed-lingual discrete speech is enhanced by enriching the cepstral speech feature set with bi-spectral speech features and a unique functional set of Mel frequency cepstral coefficient features derived from a sine filter bank. Data augmentation is applied to combat skewness of the SER system toward certain emotions. Classification using random forest is performed. This enhanced SER system is used to predict emotions from continuous speech with a uniform segmentation method. Due to data scarcity, several audio samples of discrete speech from the SAVEE database that has recordings in a universal language, i.e., English, are concatenated resulting in multi-emotional speech samples. Anger, fear, sad, and neutral emotions, which are vital during the initial investigation of mentally disordered individuals, are selected to build six categories of multi-emotional samples. Experimental results demonstrate the suitability of the proposed method for recognizing emotions from continuous speech as well as from discrete speech.  相似文献   

13.
近年来,基于稀疏表示的分类技术在模式识别中取得一定的成功。该框架中,字典的学习和分类器的训练通常是两个独立的模块,降低了方法的识别精度。针对以上问题,提出了一种特征提取和模式识别相融合的改进判别字典学习模型,将重构误差项、稀疏编码判别项及分类误差项进行了整合,并用K奇异值分解算法对目标函数进行优化,实现了字典和分类器的同步学习。该方法先对原始信号进行经验模态分解,并从分解的本征模态函数中提取时、频特征,形成故障样本;然后将训练样本输入改进模型用K奇异值分解优化;最后用习得字典及分类器权重对测试样本进行识别。实验结果表明:该算法不但适用于小样本故障问题,而且鲁棒性和分类性能都明显高于其它算法。      相似文献   

14.
描述了用自然语言控制空间遥控机械手执行远程操作的自然语言接口系统。该系统以孤立词语音识别为核心,着重于系统的可靠性和实用性。为此,在系统的开发过程中,考虑了人类语音交互的听觉感知特点、汉语的一字一音节特点、具体识识时语音的声学模型不必与其语言学模型严格一致的特点以及环境噪声对系统性能的影响等,提出以过渡段+韵母段作为识别基元,采取多层识别策略进行识别。识别实验与仿真实验结果表明,系统达到了预期性能  相似文献   

15.
鲁棒语音识别技术在人机交互、智能家居、语音翻译系统等方面有重要应用。为了提高在噪声和语音干扰等复杂声学环境下的语音识别性能,基于人耳听觉系统的掩蔽效应和鸡尾酒效应,利用不同声源的空间方位,提出了基于双耳声源分离和丢失数据技术的鲁棒语音识别算法。该算法首先根据目标语音的空间方位信息,在双耳声信号的等效矩形带宽(Equivalent Rectangular Bandwidth,ERB)子带内进行混合语音信号的分离,从而得到目标语音的数据流。针对分离后目标语音在频域存在频谱数据丢失的问题,利用丢失数据技术修正基于隐马尔科夫模型的概率计算,再进行语音识别。仿真实验表明,由于双耳声源分离方法得到的目标语音数据去除了噪声和干扰的影响,所提出的算法显著提高了复杂声学环境下的语音识别性能。  相似文献   

16.
Speaker separation in complex acoustic environment is one of challenging tasks in speech separation. In practice, speakers are very often unmoving or moving slowly in normal communication. In this case, the spatial features among the consecutive speech frames become highly correlated such that it is helpful for speaker separation by providing additional spatial information. To fully exploit this information, we design a separation system on Recurrent Neural Network (RNN) with long short-term memory (LSTM) which effectively learns the temporal dynamics of spatial features. In detail, a LSTM-based speaker separation algorithm is proposed to extract the spatial features in each time-frequency (TF) unit and form the corresponding feature vector. Then, we treat speaker separation as a supervised learning problem, where a modified ideal ratio mask (IRM) is defined as the training function during LSTM learning. Simulations show that the proposed system achieves attractive separation performance in noisy and reverberant environments. Specifically, during the untrained acoustic test with limited priors, e.g., unmatched signal to noise ratio (SNR) and reverberation, the proposed LSTM based algorithm can still outperforms the existing DNN based method in the measures of PESQ and STOI. It indicates our method is more robust in untrained conditions.  相似文献   

17.
希尔伯特边际谱在语音情感识别中的应用   总被引:2,自引:0,他引:2       下载免费PDF全文
谢珊  曾以成  蒋阳波 《声学技术》2009,28(2):148-152
利用希尔伯特.黄变换(Hilbert-Huang Transform,HHT)对情感语音进行处理,得到其边际谱,然后对比分析四种情感即高兴、生气、厌恶、无情感语音信号边际谱的特征,提出四个特征量:子带能量(SE)、子带能量的一阶差分(DSE)、子带能量倒谱系数(SECC)、子带能量倒谱系数的一阶差分(DSECC)用于情感识别。用它们作说话人无关,文本无关的语音情感识别,得到最高90%的识别率,比基于傅立叶变换的梅尔频率倒谱系数(MFCC)高22个百分点。实验结果表明,基于HHT边际谱的特征能够较好地反映语音信号中的情感信息。  相似文献   

18.
毛维  曾庆宁  龙超 《声学技术》2018,37(3):253-260
针对复杂噪声环境下识别性能显著降低的问题,提出一种用于说话人识别系统前端的双微阵列语音增强算法。该算法采用的是相干滤波和频域宽带最小方差无畸变响应波束形成器后置结合改进的维纳滤波器。其基本原理是首先求出双微麦克风阵列信号中两个相邻通道间的相干函数,再利用通道间信号的相干性来进行初始噪声抑制。其次,通过一个频域宽带最小方差无畸变响应(Minimum Variance Distortionless Response,MVDR)波束形成器保留目标声源方向的信号并抑制其他方向的信号干扰,再通过改进的维纳滤波器去除噪声残留提升语音质量。最后,使用梅尔频率倒谱系数(Mel Frequency Cepstral Coefficients,MFCC)和伽马通滤波器组频率倒谱系数(Gammatone Filter-bank Frequency Cepstral Coefficients,GFCC)对增强后的语音信号做特征参数提取并进行说话人识别。仿真过程采用声学人工头模拟双耳采集数据,实验结果表明,该语音增强算法在复杂噪声环境下能够获得较好的增强效果,能有效提升说话人识别系统的识别率。  相似文献   

19.
目的 为提高连续手语识别准确率,缓解听障人群与非听障人群的沟通障碍。方法 提出了基于全局注意力机制和LSTM的连续手语识别算法。通过帧间差分法对视频数据进行预处理,消除视频冗余帧,借助ResNet网络提取特征序列。通过注意力机制加权,获得全局手语状态特征,并利用LSTM进行时序分析,形成一种基于全局注意力机制和LSTM的连续手语识别算法,实现连续手语识别。结果 实验结果表明,该算法在中文连续手语数据集CSL上的平均识别率为90.08%,平均词错误率为41.2%,与5种算法相比,该方法在识别准确率与翻译性能上具有优势。结论 基于全局注意力机制和LSTM的连续手语识别算法实现了连续手语识别,并且具有较好的识别效果及翻译性能,对促进听障人群无障碍融入社会方面具有积极的意义。  相似文献   

20.
语音、图像编码技术是现代数字通信的核心技术,通常是分开进行的,提出了一种利用二维DCT频谱图和小波变换图像编码的语音编码算法.深入分析了算法实现的关键技术——语音信号的二维DCT频谱图,仿真实现了该编码算法,对一段语音的处理结果表明,小波分解系数置零率为83.1941%,压缩后保存能量92.0867%,合成语音达到“优...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号