共查询到20条相似文献,搜索用时 15 毫秒
1.
Vishnu Vidyadhara Raju Vegesna Krishna Gurugubelli Anil kumar Vuppala 《International Journal of Speech Technology》2018,21(3):521-532
A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of this work is to improve the performance of ASR in the presence of emotional conditions using prosody modification. The influence of different emotions on the prosody parameters is exploited in this work. Emotion conversion methods are employed to generate the word level non-uniform prosody modified speech. Modification factors for prosodic components such as pitch, duration and energy are used. The prosody modification is done in two ways. Firstly, emotion conversion is done at the testing stage to generate the neutral speech from the emotional speech. Secondly, the ASR is trained with the generated emotional speech from the neutral speech. In this work, the presence of emotions in speech is studied for the Telugu ASR systems. A new database of IIIT-H Telugu speech corpus is collected to build the large vocabulary neutral Telugu speech ASR system. The emotional speech samples from IITKGP-SESC Telugu corpus are used for testing it. The emotions of anger, happiness and compassion are considered during the evaluation. An improvement in the performance of ASR systems is observed in the prosody modified speech. 相似文献
2.
Zero frequency resonator (ZFR) was proposed earlier for the extraction of glottal closure instants (GCIs) (Murty and Yegnanarayana 2008). The output of ZFR is an exponentially growing/decaying signal. The trend of this signal can be removed to get the required resolution for detecting relevant information. By considering a window size of typical 1–2 pitch periods, the trend removed signal mainly exhibits information related to GCIs. This work proposes two methods for the detection of glottal opening instants (GOIs) using ZFR. In the first method, the window size for trend removing is reduced to a lower level (say, 0.33 \(\times \) pitch period), and the possibility of hypothesizing GOIs is demonstrated. In the second method, window size remains in the range of 1–2 pitch periods, but the input to ZFR is modified to remove GCIs information. The proposed methods are evaluated using CMU-Arctic database and compared with existing methods for GOI detection. The performance for the detection of GOIs is comparable to that of GCIs and also existing methods. 相似文献
3.
《IEEE transactions on audio, speech, and language processing》2009,17(1):138-149
Automatic speech recognition (ASR) systems rely almost exclusively on short-term segment-level features (MFCCs), while ignoring higher level suprasegmental cues that are characteristic of human speech. However, recent experiments have shown that categorical representations of prosody, such as those based on the Tones and Break Indices (ToBI) annotation standard, can be used to enhance speech recognizers. However, categorical prosody models are severely limited in scope and coverage due to the lack of large corpora annotated with the relevant prosodic symbols (such as pitch accent, word prominence, and boundary tone labels). In this paper, we first present an architecture for augmenting a standard ASR with symbolic prosody. We then discuss two novel, unsupervised adaptation techniques for improving, respectively, the quality of the linguistic and acoustic components of our categorical prosody models. Finally, we implement the augmented ASR by enriching ASR lattices with the adapted categorical prosody models. Our experiments show that the proposed unsupervised adaptation techniques significantly improve the quality of the prosody models; the adapted prosodic language and acoustic models reduce binary pitch accent (presence versus absence) classification error rate by 13.8% and 4.3%, respectively (relative to the seed models) on the Boston University Radio News Corpus, while the prosody-enriched ASR exhibits a 3.1% relative reduction in word error rate (WER) over the baseline system. 相似文献
4.
Inge Troch 《Automatica》1973,9(1):117-124
For linear multivariable control systems the question of observing the state by means of sampling with arbitrary but fixed choice of the sampling instants is discussed. The freedom due to this arbitrary choice is used to derive criteria for a choice of the sampling instants which is as advantageous as possible as far as the propagation of measuring errors is concerned. The extended principle of duality is formulated and the dual problem of control by means of step functions is covered. Further topics mentioned are e.g. the identification problem and the influence of uncertainties of parameters. 相似文献
5.
Debadatta Pati S. R. Mahadeva Prasanna 《International Journal of Speech Technology》2012,15(2):241-257
In this work we develop a speaker recognition system based on the excitation source information and demonstrate its significance by comparing with the vocal tract information based system. The speaker-specific excitation information is extracted by the subsegmental, segmental and suprasegmental processing of the LP residual. The speaker-specific information from each level is modeled independently using Gaussian mixture modeling—universal background model (GMM-UBM) modeling and then combined at the score level. The significance of the proposed speaker recognition system is demonstrated by conducting speaker verification experiments on the NIST-03 database. Two different tests, namely, Clean test and Noisy test are conducted. In case of Clean test, the test speech signal is used as it is for verification. In case of Noisy test, the test speech is corrupted by factory noise (9 dB) and then used for verification. Even though for Clean test case, the proposed source based speaker recognition system still provides relatively poor performance than the vocal tract information, its performance is better for Noisy test case. Finally, for both clean and noisy cases, by providing different and robust speaker-specific evidences, the proposed system helps the vocal tract system to further improve the overall performance. 相似文献
6.
A.I.C. Monaghan 《International Journal of Speech Technology》2003,6(1):73-81
The model of prosody used in the Aculab TTS system is unusual in several respects. Firstly, it is based firmly on current metrical theories of prosody. Secondly, it is entirely knowledge-based: there are no stochastic components in the model. Thirdly, it makes use of a quasi-random element to avoid the predictability of conventional synthetic prosody. Fourthly, it is specifically designed for multilingual use: it currently handles several Germanic and Romance languages. 相似文献
7.
Gábor Olaszy 《International Journal of Speech Technology》2000,3(3-4):165-176
Prosody is the change of F0 and intensity in time and the speed of articulation. The presence or absence of the realization of word accent is also examined as an important feature in prosody generation. During verbal communication various prosody forms contribute to the expression of the textual content of the message on the one hand and of the personal intention of the speaker on the other. In many cases in dialogues the same text can be (must be) pronounced with different intentions. Our goal was to find what kind of prosody patterns and rules are characteristic of these utterance types and what the acoustic relationship among them is for Hungarian. In this article the prosody structures of the most important dialogue components are described, and invariant structures are derived and verified by speech synthesis. Rules are also stated as generalized function structures to show the acoustic relationship of the prosody of these expressions to the prosody of statements. Using these rules, it is possible to convert the prosody of a given utterance type to another one by preserving the naturalness of the speech. The rules can be used in text to speech (TTS) conversion to generate spoken dialogues. 相似文献
8.
Optimal control of switched systems based on parameterization of the switching instants 总被引:2,自引:0,他引:2
This paper presents a new approach for solving optimal control problems for switched systems. We focus on problems in which a prespecified sequence of active subsystems is given. For such problems, we need to seek both the optimal switching instants and the optimal continuous inputs. In order to search for the optimal switching instants, the derivatives of the optimal cost with respect to the switching instants need to be known. The most important contribution of the paper is a method which first transcribes an optimal control problem into an equivalent problem parameterized by the switching instants and then obtains the values of the derivatives based on the solution of a two point boundary value differential algebraic equation formed by the state, costate, stationarity equations, the boundary and continuity conditions, along with their differentiations. This method is applied to general switched linear quadratic problems and an efficient method based on the solution of an initial value ordinary differential equation is developed. An extension of the method is also applied to problems with internally forced switching. Examples are shown to illustrate the results in the paper. 相似文献
9.
10.
《国际计算机数学杂志》2012,89(1):69-80
Association rule is one of the data mining techniques involved in discovering information that represents the association among data. Data in the database sometimes appear infrequent but highly associated with a specific data. This paper proposes a technique for significant rare data by introducing second support in discovering the association rules of such data. We show that the proposed approach provides better performance as compared to standard association rules techniques. 相似文献
11.
An important result in the robust adaptive control of continuous-time systems, using the persistent excitation of the reference input, was recently given by Narendra and Annaswamy (1986, IEEE Trans. Aut. Control, AC-31, 306–315). According to this result, the global boundedness of all the signals in the adaptive system can be assured if the degree of persistent excitation of the reference input is larger than an appropriate bound on the external disturbance. The main theorem in Narendra and Annaswamy (1986) is proved for a class of plants characterized by the property that the reference model used in the adaptive controller could be chosen to be strictly positive real, a condition which involves constraints on the relative degree of the plant. This paper presents a generalization of the above result to plants of arbitrary relative degree. Together with the work reported in the earlier paper, it demonstrates that the boundedness of all the signals in an adaptive system in the presence of bounded disturbances and arbitrary initial conditions can be assured by increasing the degree of persistent excitation of the reference input. 相似文献
12.
Janez Stergar Vladimir Hozjan Bogomir Horvat 《International Journal of Speech Technology》2003,6(3):289-299
This paper presents the data-driven prediction of word level prosody breaks modelling for the Slovenian language. Automatic learning techniques depend on the construction of a large corpus labeled appropriately. This labeling can be done either automatically, or by hand. While automatic labeling can be less accurate than hand labeling, the latter is very time consuming and, in some cases, inconsistent. Therefore, a new interactive tool for word level prosody labeling (major/minor breaks) is presented together with a new semi-automatic approach for determining prosody breaks. This interactive tool combines the advantages of hand labeling and automatic labeling by achieving a high consistency in labeling and reducing the time needed for hand labeling. The labeled Slovenian corpus has been used to train our phrase break prediction module, implementing a neural network (NN) structure. Experiments for the data-driven prediction of major = minor and major/minor phrase breaks were performed. The prediction accuracy achieved marks state-of-the-art word level prosody breaks prediction for the Slovenian language and is comparable to the prediction accuracy of other approaches in which more complex NN structures (Müller et al., 2000) or other prediction methods (Black and Tailor, 1997) were applied, and a much larger corpus was used for training. The overall prediction accuracy achieved is 94% for major = minor breaks and over 98/92% for major/minor phrase breaks, respectively. 相似文献
13.
以自然语流中出现的焦点为对象,对汉语中焦点的声学特征表现进行了研究.研究结果表明:(1)焦点对音节韵律特征的影响与音节所在的高层韵律环境(上下文相关信息)密切相关.处于不同高层韵律环境的音节,其韵律特征受焦点影响改变的幅度和方向是不同的.(2)焦点的轻重感知一定程度上可以通过线性调节语音声学参数增量来表现出来.(3)在语音合成中,焦点的韵律特征可分为两步来进行预测.实验证实,在焦点位置已知的情况下该方法能够合成自然度很高的汉语语句焦点. 相似文献
14.
We use Stem-ML to build an automatic learning system for Mandarin prosody that allows us to make quantitative measurements of prosodic strengths. Stem-ML is a phenomenological model of the muscle dynamics and planning process that controls the tension of the vocal folds. Because Stem-ML describes the interactions between nearby tones or accents, we were able to use a highly constrained model with only one accent template for each lexical tone category, and a single prosodic strength per word. The model accurately reproduces the intonation of the speaker, capturing 87% of the variance of the speech's fundamental frequency, f
0. The result reveals strong alternating metrical patterns in words, and suggests that the speaker uses word strength to mark a hierarchy of sentence, clause, phrase, and word boundaries. 相似文献
15.
16.
The perceived quality of synthetic speech strongly depends on its prosodic naturalness. Departing from earlier works by Mixdorff on a linguistically motivated model of German intonation based on the Fujisaki model, an integrated approach to predicting F0 along with syllable duration and energy was developed. The current paper first presents some statistical results concerning the relationship between linguistic and phonetic information underlying an utterance and its prosodic features. These results were employed for training the MFN-based integrated prosodic model predicting syllable duration and energy along with syllable-aligned Fujisaki control parameters. The paper then focusses on the method of perceptual evaluation developed, comparing resynthesis stimuli created by controlled prosodic degrading of natural speech with stimuli created using the integrated model. The results indicate that the integrated model generally receives better ratings than degraded stimuli with comparable durational and F0 deviations from the original. An important outcome is the observation that the accuracy of the predicted syllable durations appears to be a stronger factor with respect to the perceived quality than the accuracy of the predicted F0 contour. 相似文献
17.
Modifying the prosody parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. The objective of this work is to develop a dynamic prosody modification method based on zero frequency filtered signal (ZFFS), a byproduct of zero frequency filtering (ZFF). The existing epoch based prosody modification techniques use epochs as pitch markers and the required prosody modification is achieved by the interpolation of epoch intervals plot. Alternatively, this work proposes a method for prosody modification by the resampling of ZFFS. Also the existing epoch based prosody modification method is further refined for modifying the prosodic parameters at every epoch level. Thus providing more flexibility for prosody modification. The general framework for deriving the modified epoch locations can also be used for obtaining the dynamic prosody modification from existing PSOLA and epoch based prosody modification methods. The quality of the prosody modified speech is evaluated using waveforms, spectrograms and subjective studies. The usefulness of the proposed dynamic prosody modification is demonstrated for neutral to emotional conversion task. The subjective evaluations performed for the emotion conversion indicate the effectiveness of the dynamic prosody modification over the fixed prosody modification for emotion conversion. The dynamic prosody modified speech files synthesized using the proposed, epoch based and TD-PSOLA methods are available at http://www.iitg.ac.in/eee/emstlab/demos/demo5.php. 相似文献
18.
Cancer classification using ensemble of neural networks with multiple significant gene subsets 总被引:2,自引:1,他引:2
Molecular level diagnostics based on microarray technologies can offer the methodology of precise, objective, and systematic
cancer classification. Genome-wide expression patterns generally consist of thousands of genes. It is desirable to extract
some significant genes for accurate diagnosis of cancer because not all genes are associated with a cancer. In this paper,
we have used representative gene vectors that are highly discriminatory for cancer classes and extracted multiple significant
gene subsets based on those representative vectors respectively. Also, an ensemble of neural networks learned from the multiple
significant gene subsets is proposed to classify a sample into one of several cancer classes. The performance of the proposed
method is systematically evaluated using three different cancer types: Leukemia, colon, and B-cell lymphoma. 相似文献
19.
Xiang-Yang Wang Hong-Ying Yang Yong-Wei Li Fang-Yu Yang 《Digital Signal Processing》2013,23(4):1136-1153
In this paper, we propose a robust color image retrieval method using visual interest point feature of significant bit-planes. We firstly extract the visually significant bit-plane image from the original color image according to the bit-plane theory and noise attack characteristic. And then, we extract the visual interest points from the original color image by using the significant bit-plane image and multi-scale Harris–Laplace detector, and construct the fuzzy color histogram of visual interest points. We finally compute the similarity between color images by using the fuzzy color histogram of visual interest points. Experiments on large databases show that the proposed algorithm is significantly more effective than the state-of-the-art approaches. Especially, it can retrieve the noise-attacked (including blurring, sharpening, and illumination, etc.) image effectively. 相似文献
20.
Niu Pan-Pan Wang Xiang-Yang Liu Yu-Nan Yang Hong-Ying 《Multimedia Tools and Applications》2017,76(3):3403-3433
Desynchronization attacks that cause displacement between embedding and detection are usually difficult for watermark to survive. It is a challenging work to design a robust image watermarking scheme against desynchronization attacks, especially for color images. In this paper, we propose a robust color image watermarking approach based on local invariant significant bitplane histogram. The novelty of the proposed approach includes: 1) A fast and effective color image feature points detector is constructed, in which probability density and color invariance model are used; 2) The fully affine invariant local feature regions are built based on probability density Hessian matrix; and 3) The invariant significant bitplane histograms are introduced to embed digital watermark. The extensive experimental works are carried out on a color image set collected from Internet, and the preliminary results show that the proposed watermarking approach can survive numerous kinds of distortions, including common image processing operations and desynchronization attacks. 相似文献