首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This article describes experiments on speech segmentation using long short-term memory recurrent neural networks. The main part of the paper deals with multi-lingual and cross-lingual segmentation, that is, it is performed on a language different from the one on which the model was trained. The experimental data involves large Czech, English, German, and Russian speech corpora designated for speech synthesis. For optimal multi-lingual modeling, a compact phonetic alphabet was proposed by sharing and clustering phones of particular languages. Many experiments were performed exploring various experimental conditions and data combinations. We proposed a simple procedure that iteratively adapts the inaccurate default model to the new voice/language. The segmentation accuracy was evaluated by comparison with reference segmentation created by a well-tuned hidden Markov model-based framework with additional manual corrections. The resulting segmentation was also employed in a unit selection text-to-speech system. The generated speech quality was compared with the reference segmentation by a preference listening test.  相似文献   

2.
The process of counting stuttering events could be carried out more objectively through the automatic detection of stop-gaps, syllable repetitions and vowel prolongations. The alternative would be based on the subjective evaluations of speech fluency and may be dependent on a subjective evaluation method. Meanwhile, the automatic detection of intervocalic intervals, stop-gaps, voice onset time and vowel durations may depend on the speaker and the rules derived for a single speaker might be unreliable when trying to consider them as universal ones. This implies that learning algorithms having strong generalization capabilities could be applied to solve the problem. Nevertheless, such a system requires vectors of parameters, which characterize the distinctive features in a subject's speech patterns. In addition, an appropriate selection of the parameters and feature vectors while learning may augment the performance of an automatic detection system.The paper reports on automatic recognition of stuttered speech in normal and frequency altered feedback speech. It presents several methods of analyzing stuttered speech and describes attempts to establish those parameters that represent stuttering event. It also reports results of some experiments on automatic detection of speech disorder events that were based on both rough sets and artificial neural networks.  相似文献   

3.
4.
This paper describes a multichannel acoustic data collection recorded under the European DICIT project, during Wizard of Oz (WOZ) experiments carried out at FAU and FBK-irst laboratories. The application of interest in DICIT is a distant-talking interface for control of interactive TV working in a typical living room, with many interfering devices. The objective of the experiments was to collect a database supporting efficient development and tuning of acoustic processing algorithms for signal enhancement. In DICIT, techniques for sound source localization, multichannel acoustic echo cancellation, blind source separation, speech activity detection, speaker identification and verification as well as beamforming are combined to achieve a maximum possible reduction of the user speech impairments typical of distant-talking interfaces. The collected database permitted to simulate at preliminary stage a realistic scenario and to tailor the involved algorithms to the observed user behaviors. In order to match the project requirements, the WOZ experiments were recorded in three languages: English, German and Italian. Besides the user inputs, the database also contains non-speech related acoustic events, room impulse response measurements and video data, the latter used to compute three-dimensional positions of each subject. Sessions were manually transcribed and segmented at word level, introducing also specific labels for acoustic events.  相似文献   

5.
由于人类情感的表达受文化和社会的影响,不同语言语音情感的特征差异较大,导致单一语言语音情感识别模型泛化能力不足。针对该问题,提出了一种基于多任务注意力的多语言语音情感识别方法。通过引入语言种类识别辅助任务,模型在学习不同语言共享情感特征的同时也能学习各语言独有的情感特性,从而提升多语言情感识别模型的多语言情感泛化能力。在两种语言的维度情感语料库上的实验表明,所提方法相比于基准方法在Valence和Arousal任务上的相对UAR均值分别提升了3.66%~5.58%和1.27%~6.51%;在四种语言的离散情感语料库上的实验表明,所提方法的相对UAR均值相比于基准方法提升了13.43%~15.75%。因此,提出的方法可以有效地抽取语言相关的情感特征并提升多语言情感识别的性能。  相似文献   

6.
腭裂语音高鼻音等级的自动识别对于腭咽功能的评估具有重要临床价值。对腭裂语音高鼻音等级自动识别算法进行了研究,提出基于声道特性的腭裂语音高鼻音等级自动识别算法。利用高低阶线性预测倒谱系数(Linear Prediction Cepstrum Coefficient,LPCC)与倒谱系数结合成为LPCC-Cep特征组作为声学特征参数,采用稀疏表示分类器(Sparse Representation based Classification,SRC)实现腭裂语音四类高鼻音等级(正常、轻度、中度和重度)的自动识别。实验结果表明,提出的自动识别算法取得了较高的高鼻音类别正确识别率。其中,LPCC-Cep特征组参数对高鼻音等级的正确识别率为83.38%。  相似文献   

7.
师夏阳  张风远  袁嘉琪  黄敏 《计算机应用》2022,42(11):3379-3385
攻击性言论会对社会安定造成严重不良影响,但目前攻击性言论自动检测主要集中在少数几种高资源语言,对低资源语言缺少足够的攻击性言论标注语料导致检测困难,为此,提出一种跨语言无监督攻击性迁移检测方法。首先,使用多语BERT(mBERT)模型在高资源英语数据集上进行对攻击性特征的学习,得到一个原模型;然后,通过分析英语与丹麦语、阿拉伯语、土耳其语、希腊语的语言相似程度,将原模型迁移到这四种低资源语言上,实现对低资源语言的攻击性言论自动检测。实验结果显示,与BERT、线性回归(LR)、支持向量机(SVM)、多层感知机(MLP)这四种方法相比,所提方法在丹麦语、阿拉伯语、土耳其语、希腊语这四种语言上的攻击性言论检测的准确率和F1值均提高了近2个百分点,接近目前的有监督检测,可见采用跨语言模型迁移学习和迁移检测相结合的方法能够实现对低资源语言的无监督攻击性检测。  相似文献   

8.
Automatic recognition of the speech of children is a challenging topic in computer-based speech recognition systems. Conventional feature extraction method namely Mel-frequency cepstral coefficient (MFCC) is not efficient for children's speech recognition. This paper proposes a novel fuzzy-based discriminative feature representation to address the recognition of Malay vowels uttered by children. Considering the age-dependent variational acoustical speech parameters, performance of the automatic speech recognition (ASR) systems degrades in recognition of children's speech. To solve this problem, this study addresses representation of relevant and discriminative features for children's speech recognition. The addressed methods include extraction of MFCC with narrower filter bank followed by a fuzzy-based feature selection method. The proposed feature selection provides relevant, discriminative, and complementary features. For this purpose, conflicting objective functions for measuring the goodness of the features have to be fulfilled. To this end, fuzzy formulation of the problem and fuzzy aggregation of the objectives are used to address uncertainties involved with the problem.The proposed method can diminish the dimensionality without compromising the speech recognition rate. To assess the capability of the proposed method, the study analyzed six Malay vowels from the recording of 360 children, ages 7 to 12. Upon extracting the features, two well-known classification methods, namely, MLP and HMM, were employed for the speech recognition task. Optimal parameter adjustment was performed for each classifier to adapt them for the experiments. The experiments were conducted based on a speaker-independent manner. The proposed method performed better than the conventional MFCC and a number of conventional feature selection methods in the children speech recognition task. The fuzzy-based feature selection allowed the flexible selection of the MFCCs with the best discriminative ability to enhance the difference between the vowel classes.  相似文献   

9.
基于深度学习的跨语言情感分析模型需要借助预训练的双语词嵌入(Bilingual Word Embedding,BWE)词典获得源语言和目标语言的文本向量表示.为了解决BWE词典较难获得的问题,该文提出一种基于词向量情感特征表示的跨语言文本情感分析方法,引入源语言的情感监督信息以获得源语言情感感知的词向量表示,使得词向量...  相似文献   

10.
This paper describes a language independent method for automatic syllabification of speech signal. This method utilizes the valleys in short time energy (STE) contour and location of vowel onset points (VOP) for marking the syllable boundaries. In the proposed method, automatic syllabification is performed in three steps. First, long silence/pause regions are marked with the help of speech/non-speech detection. Then VOPs are located from the Hilbert Envelope of LP residual. The existence of more than one VOP in a continuous speech region (identified using speech/non-speech detection in the first step) is an indication of syllable boundaries within the region. Location with minimum energy in the STE contour between two consecutive VOP is identified as the syllable boundary. Since automatic VOP detection algorithm fails to detect some of the VOPs, certain syllable boundaries will be missed. Therefore, at the third step, additional syllable boundaries are detected from STE contour by fixing a valley threshold which is equal to the mean value of STE corresponding to each speech region between two consecutive syllable boundaries. This method is evaluated for 50 sentences each in read, extempore and conversational mode speech of Malayalam and Bengali languages. Overall accuracy of 80% is obtained with ± 50 ms tolerance with reference to manually marked syllable boundaries for this database. Method also shows good accuracy in case of TIMIT and NTIMIT data without tuning of thresholds and other parameters. This method is useful for applications that do not require exact syllable boundaries, rather a meaningful separation of syllables. Application of this technique for prosody based emotion recognition is illustrated using Emo-DB German emotional database.  相似文献   

11.
This paper investigates the temporal excitation patterns of creaky voice. Creaky voice is a voice quality frequently used as a phrase-boundary marker, but also as a means of portraying attitude, affective states and even social status. Consequently, the automatic detection and modelling of creaky voice may have implications for speech technology applications. The acoustic characteristics of creaky voice are, however, rather distinct from modal phonation. Further, several acoustic patterns can bring about the perception of creaky voice, thereby complicating the strategies used for its automatic detection, analysis and modelling. The present study is carried out using a variety of languages, speakers, and on both read and conversational data and involves a mutual information-based assessment of the various acoustic features proposed in the literature for detecting creaky voice. These features are then exploited in classification experiments where we achieve an appreciable improvement in detection accuracy compared to the state of the art. Both experiments clearly highlight the presence of several creaky patterns. A subsequent qualitative and quantitative analysis of the identified patterns is provided, which reveals a considerable speaker-dependent variability in the usage of these creaky patterns. We also investigate how creaky voice detection systems perform across creaky patterns.  相似文献   

12.
This paper presents the design and evaluation of a multi-lingual fingerspelling recognition module that is designed for an information terminal. Through the use of multimodal input and output methods, the information terminal acts as a communication medium between deaf and blind people. The system converts fingerspelled words to speech and vice versa using fingerspelling recognition, fingerspelling synthesis, speech recognition and speech synthesis in Czech, Russian, and Turkish languages. We describe an adaptive skin color based fingersign recognition system with a close to real-time performance and present recognition results on 88 different letters signed by five different signers, using above four hours of training and test videos.  相似文献   

13.
14.
In this paper the comparison of performances of different feature representations of the speech signal and comparison of classification procedures for Slovene phoneme recognition are presented. Recognition results are obtained on the database of continuous Slovene speech consisting of short Slovene sentences spoken by female speakers. MEL-cepstrum and LPC-cepstrum features combined with the normalized frame loudness were found to be the most suitable feature representations for Slovene speech. It was found that determination of MEL-cepstrum using linear spacing of bandpass filters gave significantly better results for speaker dependent recognition. Comparison of classification procedures favours the Bayes classification assuming normal distribution of the feature vectors (BNF) to the classification based on quadratic discriminant functions (DF) for minimum mean-square error and subspace method (SM), which does not confirm the results obtained in some previous studies for German and Finn speech. Additionally, classification procedures based on hidden Markov models (HMM) and the Kohonen Self-Organizing Map (KSOM) were tested on a smaller amount of speech data (1 speaker only). Classification results are comparable with classification using BNF.  相似文献   

15.
This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word and phrasal level by examination of supra-segmental parameters. A word level segmentationer has been developed which can indicate the word boundaries with acceptable precision for both languages. The ultimate aim is to increase the robustness of speech recognition on the language modelling level by the detection of word and phrase boundaries, and thus we can significantly decrease the searching space during the decoding process. Searching space reduction is highly important in the case of agglutinative languages. In Hungarian and in Finnish, if stress is present, this is always on the first syllable of the word stressed. Thus if stressed syllables can be detected, these must be at the beginning of the word. We have developed different algorithms based either on a rule-based or a data-driven approach. The rule-based algorithms and HMM-based methods are compared. The best results were obtained by data-driven algorithms using the time series of fundamental frequency and energy together. Syllable length was found to be much less effective, hence was discarded. By use of supra-segmental features, word boundaries can be marked with high accuracy, even if we are unable to find all of them. The method we evaluated is easily adaptable to other fixed-stress languages. To investigate this we adapted our data-driven method to the Finnish language and obtained similar results.  相似文献   

16.
Many language identification (LID) systems are based on language models using techniques that consider the fluctuation of speech over time. Considering these fluctuations necessitates longer recording intervals to obtain reasonable accuracy. Our research extracts features from short recording intervals to enable successful classification of spoken language. The feature extraction process is based on frames of 20 ms, whereas most previous LIDs presented results based on much longer frames (3?s or longer). We defined and implemented 200 features divided into four feature sets: cepstrum features, RASTA features, spectrum features, and waveform features. We applied eight machine learning (ML) methods on the features that were extracted from a corpus containing speech files in 10 languages from the Oregon Graduate Institute (OGI) telephone speech database and compared their performances using extensive experimental evaluation. The best optimized classification results were achieved by random forest (RF): from 76.29% on 10 languages to 89.18% on 2 languages. These results are better or comparable to the state-of-the-art results for the OGI database. Another set of experiments that was performed was gender classification from 2 to 10 languages. The accuracy and the F measure values for the RF method for all the language experiments were greater than or equal to 90.05%.  相似文献   

17.
18.
ABSTRACT

This paper describes the language component of FASTY, a text prediction system designed to improve text input efficiency for disabled users. The FASTY language component is based on state-of-the-art n-gram-based word-level and part-of-speech-level prediction and on a number of innovative modules (morphological analysis, collocation-based prediction, compound prediction) that are meant to enhance performance in languages other than English. Together with its modular architecture, these novel techniques make it adaptable to a wide range of languages without sacrificing performance. Currently, versions for Dutch, German, French, Italian, and Swedish are supported. The system can be parameterized to be used with different user interfaces and for a range of different applications. In this paper, we discuss each of the modules in detail and we present a series of experimental evaluations of the system.  相似文献   

19.
Adaptation in statistical pattern recognition using tangent vectors   总被引:1,自引:0,他引:1  
We integrate the tangent method into a statistical framework for classification analytically and practically. The resulting consistent framework for adaptation allows us to efficiently estimate the tangent vectors representing the variability. The framework improves classification results on two real-world pattern recognition tasks from the domains handwritten character recognition and automatic speech recognition.  相似文献   

20.
Due to the various techniques used in experimental phonetics and the language inventories, more and more has been learned about the nature of stops of the world's languages. Stop consonants occur in all languages, with voiceless unaspirated stops being the most common. The differences in voice onset time (VOT) have been termed lead vs. short lag, where VOT itself is defined as the timing between the onset of phonation and the release of the occlusion of the vocal tract.For Hungarian, no systematic analysis of the stops has been carried out thus far. This paper aims to investigate the acoustic and perceptual properties of VOTs of the three Hungarian voiceless stops when they appear in isolation (in syllables and in words) but also when they occur in spontaneous speech.The results of the acoustic analysis show a clear difference between careful and spontaneous speech. Bilabials and velars are significantly shorter in fluent speech than in careful speech (18.51 msec and 35.31 msec respectively, as opposed to 24.64 msec and 50.17 msec) while dentals seem to be unchanged (23.3 msec as opposed to 26.59 msec). Therefore, the actual duration of VOT is characteristic of the place of the articulation of stops in spontaneous speech, and VOTs of bilabials and dentals do not differ from each other in careful speech. Vowels following the stops influence them more in careful than in spontaneous speech, which can also be explained by the experimentally confirmed phenomenon of the changing quality of the present-day Hungarian vowels into the neutral vowel. Voice onset time is a specific feature of the Hungarian unaspirated plosive consonants. A further experiment was carried out to define the actual function of the VOTs of the voiceless stops in the Hungarian listeners' perception.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号