首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.

  相似文献   

2.

Emotion detection is a hot topic nowadays for its potential application to intelligent systems in different fields such as neuromarketing, dialogue systems, friendly robotics, vending platforms and amiable banking. Nevertheless, the lack of a benchmarking standard makes it difficult to compare results produced by different methodologies, which could help the research community improve existing approaches and design new ones. Besides, there is the added problem of accurate dataset production. Most of the emotional speech databases and associated documentation are either privative or not publicly available. Therefore, in this work, two stress-elicited databases containing speech from male and female speakers were recruited, and four classification methods are compared in order to detect and classify speech under stress. Results from each method are presented to show their quality performance, besides the final scores attained, in what is a novel approach to the field of study.

  相似文献   

3.

The adoption of high-accuracy speech recognition algorithms without an effective evaluation of their impact on the target computational resource is impractical for mobile and embedded systems. In this paper, techniques are adopted to minimise the required computational resource for an effective mobile-based speech recognition system. A Dynamic Multi-Layer Perceptron speech recognition technique, capable of running in real time on a state-of-the-art mobile device, has been introduced. Even though a conventional hidden Markov model when applied to the same dataset slightly outperformed our approach, its processing time is much higher. The Dynamic Multi-layer Perceptron presented here has an accuracy level of 96.94% and runs significantly faster than similar techniques.

  相似文献   

4.
Abstract

Techniques of speech synthesis potentially suitable for machine voice output were demonstrated in research laboratories 20 years ago (see, for example, Holmes et al. 1964), but have so far been restricted in application by the difficulty of generating acceptable speech with a sufficiently flexible vocabulary. JSRU's current laboratory system produces highly intelligible speech from an unlimited English vocabulary. The technique of speech synthesis by rule enables synthetic speech to be generated from conventionally spelled English text, with provision for using modified spelling or phonetic symbols for the small proportion of words that would otherwise be pronounced incorrectly.

Recent advances in electronic technology have made it feasible to implement the most advanced systems for flexible speech synthesis in low-cost equipment. In addition to research towards improving the speech quality, JSRU is shortly expecting to demonstrate synthesis by rule in a self-contained voice output peripheral based on inexpensive microprocessor and signal processing integrated circuits. This paper considers some of the operational constraints which must be placed on the use of such a device if speech synthesis is to take its place as a general-purpose machine-man communication medium.  相似文献   

5.
Abstract

Since the 1970s, many improvements have been made in the technology available for automatic speech recognition (ASR). Changes in the methods of analysing the incoming speech have resulted in larger, more complex vocabularies being used with greater recognition accuracy. Despite this enhanced performance and substantial research activity, the introduction of voice input into the office is still largely unrealized. This paper reviews the state-of-the-art of office applications of ASR, dividing them into the areas of voice messaging and word processing activities, data entry and information retrieval systems, and environmental control. Within these areas, cartographic computer-aided-design systems are identified as an application with proven success. The slow growth of voice input in the office is discussed in the light of constraints imposed by existing speech technology, and the need for human factors evaluation of potential applications.  相似文献   

6.

Biometric applications are very sensitive to the process because of its complexity in presenting unstructured input to the processing. The existing applications of image processing are based on the implementation of different programing segments such as image acquisition, segmentation, extraction, and final output. The proposed model is designed with 2 convolution layers and 3 dense layers. We examined the module with 5 datasets including 3 benchmark datasets, namely CASIA, UBIRIS, MMU, random dataset, and the live video. We calculated the FPR, FNR, Precision, Recall, and accuracy of each dataset. The calculated accuracy of CASIA using the proposed system is 82.8%, for UBIRIS is 86%, MMU is 84%, and the random dataset is 84%. On live video with low resolution, calculated accuracy is 72.4%. The proposed system achieved better accuracy compared to existing state-of-the-art systems.

  相似文献   

7.

Similar to many other professions, the medical field has undergone immense automation during the past decade. The complexity and rise of healthcare data led to a surge in artificial intelligence applications. Despite increased automation, such applications lack the desired accuracy and efficiency for healthcare problems. To address the aforementioned issue, this study presents an automatic health care system that can effectively substitute a doctor at an initial stage of diagnosis and help save time by recommending the necessary precautions. The proposed approach comprises two modules where Modul-1 aims at training the machine learning models using the disease symptoms dataset and their corresponding symptoms and precautions. Preprocessing and feature extraction are done as prerequisite steps. In Module-1 several algorithms are applied to the disease dataset such as support vector machine, random forest, extra trees classifier, logistic regression, multinomial naive Bayes, and decision tree. Module-2 interacts with the user (patient) through which the patient can describe the illness symptoms using a microphone. The voice data are transformed into text using the Google speech recognizer. The transformed data is later used with the trained model for disease prediction, as well as, recommending the precautions. The proposed approach achieves an accuracy of 99.9% during the real-time evaluation.

  相似文献   

8.

Highly expensive capturing devices and barely existent high-resolution palmprint datasets have slowed the development of forensic palmprint biometric systems in comparison with civilian systems. These issues are addressed in this work. The feasibility of using document scanners as a cheaper option to acquire palmprints for minutiae-based matching systems is explored. A new high-resolution palmprint dataset was established using an industry-standard Green Bit MC517 scanner and an HP Scanjet G4010 document scanner. Furthermore, a new enhancement algorithm to attenuate the negative effect of creases in the process of minutiae extraction is proposed. Experimental results highlight the potentialities of document scanners for forensic applications. Advantages and disadvantages of both technologies are discussed in this context as well.

  相似文献   

9.
This paper describes recent developments at NTT in the areas of speech recognition, speech synthesis, and interactive voice systems as they relate to telecommunications applications. Speaker-independent largevocabulary speech recognition based on context-dependent phone models and LR parser, and high-quality text-to-speech (TTS) conversion using the waveform concatenation method, both realized as software, have enabled interactive voice systems for fast and easy prototyping of telephone-based applications. Practical applications are discussed with examples.  相似文献   

10.
Wu  Xing  Ji  Sihui  Wang  Jianjia  Guo  Yike 《Applied Intelligence》2022,52(13):14839-14852

Human beings are capable of imagining a person’s voice according to his or her appearance because different people have different voice characteristics. Although researchers have made great progress in single-view speech synthesis, there are few studies on multi-view speech synthesis, especially the speech synthesis using face images. On the basis of implicit relationship between the speaker’s face image and his or her voice, we propose a multi-view speech synthesis method called SSFE (Speech Synthesis with Face Embeddings). The proposed SSFE consists of three parts: a voice encoder, a face encoder and an improved multi-speaker text-to-speech (TTS) engine. On the one hand, the proposed voice encoder generates the voice embeddings from the speaker’s speech and the proposed face encoder extracts the voice features from the speaker’s face as f-voice embeddings. On the other hand, the multi-speaker TTS engine would synthesize the speech with voice embeddings and f-voice embeddings. We have conducted extensive experiments to evaluate the proposed SSFE on the synthesized speech quality and face-voice matching degree, in which the Mean Opinion Score of the SSFE is more than 3.7 and the matching degree is about 1.7. The experimental results prove that the proposed SSFE method outperforms state-of-the-art methods on the synthesized speech in terms of speech quality and face-voice matching degree.

  相似文献   

11.
Designing text-to-speech systems capable of producing natural sounding speech segments in different Indian languages is a challenging and ongoing problem. Due to the large number of possible pronunciations in different Indian languages, a number of speech segments are needed to be stored in the speech database while a concatenative speech synthesis technique is used to achieve highly natural speech segments. However, the large speech database size makes it unusable for small hand held devices or human computer interactive systems with limited storage resources. In this paper, we proposed a fraction-based waveform concatenation technique to produce intelligible speech segments from a small footprint speech database. The results of all the experiments performed shows the effectiveness of the proposed technique in producing intelligible speech segments in different Indian languages even with very less storage and computation overhead compared to the existing syllable-based technique.  相似文献   

12.
Speech synthesis by rule has made considerable advances and it is being used today in numerous text-to-speech synthesis systems. Current systems are able to synthesise pleasant-sounding voices at high intelligibility levels. However, because their synthetic speech quality is still inferior to that of fluently produced human speech it has not found wide acceptance and instead it has been restricted mainly in useful applications for the handicapped or for restricted tasks in telecommunications. The problems with automatic speech synthesis are related to the methods of controlling speech synthesizer models in order to mimic the varying properties of the human speech production system during discourse. In this paper, artificial neural networks are developed for the control of a formant synthesizer. A set of common words comprising of larynx-produced phonemes were analysed and used to train a neural network cluster. The system was able to produce intelligible speech for certain phonemic combinations of new and unfamiliar words.  相似文献   

13.

Depth-image-based rendering (DIBR) is widely used in 3DTV, free-viewpoint video, and interactive 3D graphics applications. Typically, synthetic images generated by DIBR-based systems incorporate various distortions, particularly geometric distortions induced by object dis-occlusion. Ensuring the quality of synthetic images is critical to maintaining adequate system service. However, traditional 2D image quality metrics are ineffective for evaluating synthetic images as they are not sensitive to geometric distortion. In this paper, we propose a novel no-reference image quality assessment method for synthetic images based on convolutional neural networks, introducing local image saliency as prediction weights. Due to the lack of existing training data, we construct a new DIBR synthetic image dataset as part of our contribution. Experiments were conducted on both the public benchmark IRCCyN/IVC DIBR image dataset and our own dataset. Results demonstrate that our proposed metric outperforms traditional 2D image quality metrics and state-of-the-art DIBR-related metrics.

  相似文献   

14.
Highest quality synthetic voices remain scarce in both parametric synthesis systems and in concatenative ones. Much synthetic speech lacks naturalness, pleasantness and flexibility. While great strides have been made over the past few years in the quality of synthetic speech, there is still much work that needs to be done. Now the major challenges facing developers are how to provide optimal size, performance, extensibility, and flexibility, together with developing improved signal processing techniques. This paper focuses on issues of performance and flexibility against a background containing a brief evolution of speech synthesis; some acoustic, phonetic and linguistic issues; and the merits and demerits of two commonly used synthesis techniques: parametric and concatenative. Shortcomings of both techniques are reviewed. Methodological developments in the variable size, selection and specification of the speech units used in concatenative systems are explored and shown to provide a more positive outlook for more natural, bearable synthetic speech. Differentiating considerations in making and improving concatenative systems are explored and evaluated. Acoustic and sociophonetic criteria are reviewed for the improvement of variable synthetic voices, and a ranking of their relative importance is suggested. Future rewards are weighed against current technical and developmental challenges. The conclusion indicates some of the current and future applications of TTS.  相似文献   

15.

Parkinson’s disease (PD) is a degenerative, central nervous system disorder. The diagnosis of PD is difficult, as there is no standard diagnostic test and a particular system that gives accurate results. Therefore, automated diagnostic systems are required to assist the neurologist. In this study, we have developed a new hybrid diagnostic system for addressing the PD diagnosis problem. The main novelty of this paper lies in the proposed approach that involves a combination of the k-means clustering-based feature weighting (KMCFW) method and a complex-valued artificial neural network (CVANN). A Parkinson dataset comprising the features obtained from speech and sound samples were used for the diagnosis of PD. PD attributes are weighted through the use of the KMCFW method. New features obtained are converted into a complex number format. These feature values are presented as an input to the CVANN. The efficiency and effectiveness of the proposed system have been rigorously evaluated against the PD dataset in terms of five different evaluation methods. Experimental results have demonstrated that the proposed hybrid system, entitled KMCFW–CVANN, significantly outperforms the other methods detailed in the literature and achieves the highest classification results reported so far, with a classification accuracy of 99.52 %. Therefore, the proposed system appears to be promising in terms of a more accurate diagnosis of PD. Also, the application confirms the conclusion that the reliability of the classification ability of a complex-valued algorithm with regard to a real-valued dataset is high.

  相似文献   

16.
《Data Processing》1984,26(2):49-51
Systems that incorporate advanced speech technology derive great benefits from allowing people to interact with them in a manner which requires no previous training. The article describes the state of the art of speech recognition and voice synthesis, as well as applications such as voice store and forward systems, and reviews future developments in the field. The speech technology market is on the verge of a largescale expansion bringing continuous speech recognition/speaker-independent systems within our reach by the 1990s.  相似文献   

17.

Nowadays, automatic speech emotion recognition has numerous applications. One of the important steps of these systems is the feature selection step. Because it is not known which acoustic features of person’s speech are related to speech emotion, much effort has been made to introduce several acoustic features. However, since employing all of these features will lower the learning efficiency of classifiers, it is necessary to select some features. Moreover, when there are several speakers, choosing speaker-independent features is required. For this reason, the present paper attempts to select features which are not only related to the emotion of speech, but are also speaker-independent. For this purpose, the current study proposes a multi-task approach which selects the proper speaker-independent features for each pair of classes. The selected features are then given to the classifier. Finally, the outputs of the classifiers are appropriately combined to achieve an output of a multi-class problem. Simulation results reveal that the proposed approach outperforms other methods and offers higher efficiency in terms of detection accuracy and runtime.

  相似文献   

18.
现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好,但对实际应用中的未知攻击类型检测效果显著降低。因此,结合最近提出的双路径Res2Net(DP-Res2Net),提出一种基于时域波形的半监督端到端虚假语音检测方法。首先,为了解决训练数据集和测试数据集两者数据分布差异较大的问题,采用半监督学习进行领域迁移;然后,对于特征工程,直接将时域采样点输入DP-Res2Net中,增加局部的多尺度信息,并充分利用音频片段之间的依赖性;最后,输入特征经过浅层卷积模块、特征融合模块、全局平均池化模块得到嵌入张量,用来判别自然语音与虚假伪造语音。在公开可用的ASVspoof 2021 Speech Deep Fake评估集和VCC数据集上评估了所提出方法的性能,实验结果表明它的等错误率(EER)为19.97%,与官方最优基线系统相比降低了10.8%。基于时域波形的半监督端到端检测虚假语音检测方法面对未知攻击时是有效的,且具有更高的泛化能力。  相似文献   

19.
In this paper, we investigate the contribution of tone in a Hidden Markov Model (HMM)-based speech synthesis of Ibibio (ISO 693-3: nic; Ethnologue: IBB), an under-resourced language. We review the language’s speech characteristics, required for building the front end components of the design and propose a finite state transducer (FST), useful for modelling the language’s tonetactics. The existing speech database of Ibibio is also studied and the quality of synthetic speech examined through a spectral analysis of voices obtained from two synthesis experiments, with and without tone feature labels. A confusion matrix classifying the results of a controlled listening test for both experiments is constructed and statistics comparing their performance quality presented. Results obtained revealed that synthesis systems with tone feature labels outperformed synthesis systems without tone feature labels, as more tone confusions were perceived by listeners in the latter.  相似文献   

20.

Identity authentication based on Automatic Speaker Verification (ASV) has attracted extensive attention. Voice can be used as a substitute of password in many applications. However, the security of current ASV systems has been seriously challenged by many malicious spoofing attacks. Among all those attacks, replay attack is one of the biggest threats to the ASV System, where an adversary can use a pre-recorded speech sample of the legal user to access the ASV system. In this paper, we present a replay attack detection (RAD) scheme to distinguish normal speech and replayed speech. We focus on the distortion caused by loudspeaker: low-frequency attenuation and high-frequency harmonics, and present a suite of RAD features DL-RAD, including Harmonic Energy Ratio (HER), Low Spectral Ratio (LSR), Low Spectral Variance (LSV), and Low Spectral Difference Variance (LSDV), to describe the different characteristics between the normal speech signal and replay speech signal. SVM is adopted as a classifier to evaluate the performance of these features. Experiment results show that the True Positive Rate (TPR), True Negative Rate (TNR) of the proposed method are about 98.15% and 98.75% respectively, which are significantly better than the existing scheme. The proposed scheme can be applied to both text-dependent and text-independent ASV systems.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号