首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
耳语音是一种语言方式,是指声带轻微振动或者不振动的轻声说话。本文对已经收集形成的语音库的基础之上进行了一系列研究,在此基础上就正常音和耳语音对共振峰位置带宽进行研究计算,得出其相应的变化比例,最终获得了耳语音在共振峰的基本特点。  相似文献   

2.
The authors take a modeling approach to studying representation of formant frequencies of spoken speech and speech in noise in the temporal responses of the peripheral auditory system. On the basis of the properties of the representation, they have devised and evaluated a cross-channel correlation algorithm and an interpeak interval analysis for automatic formant extraction of speech which is strongly dynamic in acoustic characteristics and is embedded in noise. The basilar membrane model used in this study contains laterally coupled damping elements, which are made monotonically dependent on the spatial distribution of the short-term power in the outputs of the model. Efficient digital implementation and the related salient numerical properties of the model are described. Simulation results from the model in response to speech and speech in noise illustrate temporal response patterns that are tonotopically organized in relation to speech formant parameters with little influence by the noise level. By utilizing such relations the devised cross-channel correlation algorithm is shown to be capable of accurately tracking formant movements in spoken syllables and sentences  相似文献   

3.
提出了一种结合韵律信息的高性能汉语连续数字语音识别算法,该识别算法基于CHMM(连续隐马尔可夫模型),采用MFCC(MEL频率倒谱系数)为主要语音特征参数,结合韵律信息进行连续数字精确分割,能够有效区分易混数字。算法采用两级识别框架来提高语音识别率,其中,第1级对连续数字分割,在此基础上进行数字语音识别,输出各候选结果,第2级在候选结果中确定易混数字对,并运用韵律信息进一步选择正确结果。实验表明,最终汉语连续数字语音识别率有很大提高。  相似文献   

4.
汉语语音正弦模型特征分析和听觉辨识   总被引:1,自引:0,他引:1  
张毅楠  肖熙 《电声技术》2011,35(8):38-41
为了研究汉语语音的声学特征,将语音信号的正弦模型应用于语音的特征提取和分析,通过对语音的模型参数应用峰值匹配算法,得到了基于正弦模型的语谱图.该语谱图能直观地反映出语音信号中基音频率及共振峰的细节及其变化规律,为语音信号的分析提供了可视化的工具.在此基础上,对汉语单韵母音节的前两个共振峰进行了分析,在控制使用少数几个主...  相似文献   

5.
语音识别隐马尔可夫模型的改进   总被引:7,自引:1,他引:6  
由于在语音识别中被广泛应用的隐马尔可夫模型是一重马尔可夫模型,它不能充分地描述语音信号的时间相依性。虽然理论上可将HMM扩展成多重马尔可夫模型,但由于所需运算量和存储量将成指数增长而使其难以应用。因此,本文提出一种新模型,它是由HMM与一个能描述语音信号时间相依性的多维高斯密度函数相结合构成的。本文从理论上论证了新模型的合理性。对汉语不计声调的全部409个单音节的识别实验结果表明:新模型的识别率显  相似文献   

6.
关存太  陈永彬 《电子学报》1995,23(12):52-58
本文给出了一个极低码率的60b/s的主意编码系统-汉语识别声码器,以32句话共267个音节作实验,其音节识别率平均为74.14%,句子平均可懂度为91.9%,介绍了其系统结构,给出了实验结果。  相似文献   

7.
赵毅  尹雪飞  陈克安 《信号处理》2012,28(3):352-360
共振峰是语音信号的一个重要特征,对提高耳聋患者的语言识别率具有重要意义。然而,目前数字助听器领域常用的响度补偿算法(多通道响度补偿和宽动态压缩)均对共振峰结构具有一定的破坏性,对患者听懂语音十分不利。本文结合共振峰检测,提出一种基于共振峰提取的多通道响度补偿算法,在原有多通道响度补偿的基础上,通过对滤波器组的重新设计并加入共振峰提取模块对共振峰予以保护。仿真结果证明,该算法对4类常见患耳均能达到满意的补偿效果,同时,与多通道响度补偿和宽动态压缩两种方法比较,该算法在保护共振峰结构完整性方面性能更优。   相似文献   

8.
为了解决传统氦语音处理技术存在的处理速度慢、计算复杂、操作困难等问题,提出了一种采用机器学习的氦语音识别方法,通过深层网络学习高维信息、提取多种特征,不但解决了过拟合问题,同时也具备了字错率(Word Error Rate,WER)低、收敛速度快的优点。首先自建氦语音孤立词和连续氦语音数据库,对氦语音数据预处理,提取的语音特征主要包括共振峰特征、基音周期特征和FBank(Filter Bank)特征。之后将语音特征输入到由深度卷积神经网络(Deep Convolutional Neural Network,DCNN)和连接时序分类(Connectionist Temporal Classification,CTC)组成的声学模型进行语音到拼音的建模,最后应用Transformer语言模型得到汉字输出。提取共振峰特征、基音周期特征和FBank特征的氦语音孤立词识别模型相比于仅提取FBank特征的识别模型的WER降低了7.91%,连续氦语音识别模型的WER降低了14.95%。氦语音孤立词识别模型的最优WER为1.53%,连续氦语音识别模型的最优WER为36.89%。结果表明,所提方法可有效识别氦语音。  相似文献   

9.
汉语语音合成系统中激励源和声调模型研究   总被引:1,自引:0,他引:1  
刘志坚  刘加 《通信学报》1998,19(4):55-60
在语音合成器中激励源对合成语音的质量有极为重要的作用,本文对几种浊音激励源及其合成的结果进行了分析比较,同时对激励源细动态变化特性进行了研究。汉语的声调对合成语音质量影响很大,本文通过对语音基频、音长、音强的变化分析研究,建立了汉语的声调模型。在此基础上开发了一种并联型共振峰模型,该合成器能合成出具有较好清晰度和自然度的语音  相似文献   

10.
赵越  林玮 《电声技术》2016,40(11):48-52
耳语音的声学特征是研究其语音识别和说话人识别的重要组成部分.介绍了耳语音的特点并讨论了其声学特征.由于耳语音没有基频,所以共振峰与音长特性可以作为重要的声学参数用于识别.对汉语6个耳语音元音进行了分析研究,证明共振峰频率和音长可以作为耳语音识别的特征参数.  相似文献   

11.
维吾尔语属阿尔泰语系突厥语族,其共振峰频率参数是语音识别和语音合成的重要依据。首次运用实验语音学的基本理论和方法,在“维吾尔语语音声学参数数据库”的办公环境语料条件下,对维吾尔语四音节元音和谐词进行了统计分析,给出了维吾尔语元音共振峰频率参数和分布规律,并通过四音节元音和谐词实验结果,用实验数据验证了其共振峰频率分布的口耳之学规律。为参数式或波形拼接式语音合成系统中调整合成前的元音和谐问题提供了重要的参考依据。  相似文献   

12.
The authors discuss a method for spectral analysis of noise corrupted signals using statistical properties of the zero-crossing intervals. It is shown that an initial stage of filter-bank analysis is effective for achieving noise robustness. The technique is compared with currently popular spectral analysis techniques based on singular value decomposition and is found to provide generally better resolution and lower variance at low signal to noise ratios (SNRs). These techniques, along with three established methods and three variations of these method, are further evaluated for their effectiveness for formant frequency estimation of noise corrupted speech. The theoretical results predict and experimental results confirm that the zero-crossing method performs well for estimating low frequencies and hence for first formant frequency estimation in speech at high noise levels (~0 dB SNR). Otherwise, J.A. Cadzow's high performance method (1983) is found to be a close alternative for reliable spectral estimation. As expected the overall performance of all techniques is found to degrade for speech data. The standard autocorrelation-LPC method is found best for clean speech and all methods deteriorate roughly equally in noise  相似文献   

13.
Dupree  B.C. 《Electronics letters》1984,20(7):279-280
An algorithm is proposed which will obtain, from an input speech signal, formant parameter data to control a parallel formant speech synthesiser. By allowing some delay and employing variable-frame-rate techniques, the parameter data can be obtained at a low frame rate (typically 20 frames per second) suitable for transmission or storage.  相似文献   

14.
In this paper, alternative dynamic features for speech recognition are proposed. The goal of this work is to improve speech recognition accuracy by deriving the representation of distinctive dynamic characteristics from a speech spectrum. This work was inspired by two temporal dynamics of a speech signal. One is the highly non‐stationary nature of speech, and the other is the inter‐frame change of a speech spectrum. We adopt the use of a sub‐frame spectrum analyzer to capture very rapid spectral changes within a speech analysis frame. In addition, we attempt to measure spectral fluctuations of a more complex manner as opposed to traditional dynamic features such as delta or double‐delta. To evaluate the proposed features, speech recognition tests over smartphone environments were conducted. The experimental results show that the feature streams simply combined with the proposed features are effective for an improvement in the recognition accuracy of a hidden Markov model–based speech recognizer.  相似文献   

15.
Lipreading provides a limited amount of information about speech signals to profoundly deaf people. Visual displays using peripheral vision as an alternative sensory modality can provide supplementary speech information. The utility of a cosmetically acceptable peripheral vision display was explored. A pair of eyeglasses with a commercially available two-dimensional red LED array (5 x 7), and its associated electronics was developed. The display is visible only to the wearer, and is located in the temporal field and the horizontal meridian of the right eye. Selected speech features were encoded as visual patterns for presentation to the lipreader. These features of the speech signal (the fundamental frequency of the speech, high-frequency energy, and low-passed speech signal or total energy envelope) were presented with the objective of providing information about voicing and plosion/frication. Experiments demonstrate the capability of the peripheral display in conveying speech information. Presenting vowel-consonant-vowel syllables, the performance was in excess of 76% with aided lipreading as compared to 41% by lipreading only.  相似文献   

16.
The quality of synthetic speech is affected by two factors: intelligibility and naturalness. At present, synthesized speech may be highly intelligible, but often sounds unnatural. Speech intelligibility depends on the synthesizer's ability to reproduce the formants, the formant bandwidths, and formant transitions, whereas speech naturalness is thought to depend on the excitation waveform characteristics for voiced and unvoiced sounds. Voiced sounds may be generated by a quasiperiodic train of glottal pulses of specified shape exciting the vocal tract filter. It is generally assumed that the glottal source and the vocal tract filter are linearly separable and do not interact. However, this assumption is often not valid, since it has been observed that appreciable source-tract interaction can occur in natural speech. Previous experiments in speech synthesis have demonstrated that the naturalness of synthetic speech does improve when source-tract interaction is simulated in the synthesis process. The purpose of this paper is two-fold: (1) to present an algorithm for automatically measuring source-tract interaction for voiced speech, and (2) to present a simple speech production model that incorporates source-tract interaction into the glottal source model, This glottal source model controls: (1) the skewness of the glottal pulse, and (2) the amount of the first formant ripple superimposed on the glottal pulse. A major application of the results of this paper is the modeling of vocal disorders  相似文献   

17.
Speech analysis/synthesis algorithms utilizing linear prediction coefficients have certain advantages over those employing formantbased techniques. For example, 4-kHz speech samples may be synthesized using a basic sequence of 10 multiply/adds followed by a single addition of the current sample of the excitation function. Real-time software synthesis of 4-kHz speech is possible (using this technique) on certain 16-b minicomputers, but the central processing unit (CPU) overhead may approach 100 percent. We describe an economical (< 600-dollar) hardware realization of a 4-kHz digital linear predictive speech synthesizer which requires, at most, a CPU overhead of about 40 percent real time. The device is constructed of standard TTL/MOS logic and consists (essentially) of a high speed 2's complement multiplier/adder capable of calculating a 26-b product (10-b speech samples, 16-b coefficients) in 0.33 μs, and a dual shift register. In addition, a procedure is discussed which enables the device to be used both as a formant synthesizer for vowels or voiced consonant production, and as a predictive synthesizer for other speech sounds. This procedure, hybrid synthesis, permits the utilization of formant concatenation techniques and reduces the coefficient storage required to specify vowels/voiced consonants by about 60 percent.  相似文献   

18.
该文通过实验方法研究和分析了汉字语音共振峰的特点,发现可跟踪并找到各个共振峰,结合汉字发音所具有的一般规律,提出了一种基于跟踪共振峰的语音增强算法。该算法能够有效地识别出带噪语音中的语音帧和非语音帧,简单且有效地去除非语音帧的全部噪声,明显抑制语音帧内的噪声。算法计算复杂度低并具有噪声环境可移植性。  相似文献   

19.
In this paper, we describe a group delay-based signal processing technique for the analysis and detection of hypernasal speech. Our preliminary acoustic analysis on nasalized vowels shows that, even though additional resonances are introduced at various frequency locations, the introduction of a new resonance in the low-frequency region (around 250 Hz) is found to be consistent. This observation is further confirmed by a perceptual analysis carried out on vowel sounds that are modified by introducing different nasal resonances, and an acoustic analysis on hypernasal speech. Based on this, for subsequent experiments the focus is given only to the low-frequency region. The additive property of the group delay function can be exploited to resolve two closely spaced formants. However, when the formants are very close with considerably wider bandwidths as in hypernasal speech, the group delay function also fails to resolve. To overcome this, we suggest a band-limited approach to estimate the locations of the formants. Using the band-limited group delay spectrum, we define a new acoustic measure for the detection of hypernasality. Experiments are carried out on the phonemes /a/, /i/, and /u/ uttered by 33 hypernasal speakers and 30 normal speakers. Using the group delay-based acoustic measure, the performance on a hypernasality detection task is found to be 100% for /a/, 88.78% for /i/ and 86.66% for /u/. The effectiveness of this acoustic measure is further cross-verified on a speech data collected in an entirely different recording environment.  相似文献   

20.
This paper describes the implementation of a Speech Understanding System component which tracks the formants of pseudo-syllabic nuclei containing voiced consonants. The nuclei are isolated from continuous speech after a precategorical classification in which feature extraction is carried out by modules organized in a hierarchy of levels. FFT and LPC spectra are the input to the formant tracking system. It works under the control of rules specifying the possible formant evolutions given previously hypothesized phonetic features and produces fuzzy graphs rather than usual formant patterns because formants are not always evident in the spectrogram pattern.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号