Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.
相似文献Emotion detection is a hot topic nowadays for its potential application to intelligent systems in different fields such as neuromarketing, dialogue systems, friendly robotics, vending platforms and amiable banking. Nevertheless, the lack of a benchmarking standard makes it difficult to compare results produced by different methodologies, which could help the research community improve existing approaches and design new ones. Besides, there is the added problem of accurate dataset production. Most of the emotional speech databases and associated documentation are either privative or not publicly available. Therefore, in this work, two stress-elicited databases containing speech from male and female speakers were recruited, and four classification methods are compared in order to detect and classify speech under stress. Results from each method are presented to show their quality performance, besides the final scores attained, in what is a novel approach to the field of study.
相似文献The adoption of high-accuracy speech recognition algorithms without an effective evaluation of their impact on the target computational resource is impractical for mobile and embedded systems. In this paper, techniques are adopted to minimise the required computational resource for an effective mobile-based speech recognition system. A Dynamic Multi-Layer Perceptron speech recognition technique, capable of running in real time on a state-of-the-art mobile device, has been introduced. Even though a conventional hidden Markov model when applied to the same dataset slightly outperformed our approach, its processing time is much higher. The Dynamic Multi-layer Perceptron presented here has an accuracy level of 96.94% and runs significantly faster than similar techniques.
相似文献Biometric applications are very sensitive to the process because of its complexity in presenting unstructured input to the processing. The existing applications of image processing are based on the implementation of different programing segments such as image acquisition, segmentation, extraction, and final output. The proposed model is designed with 2 convolution layers and 3 dense layers. We examined the module with 5 datasets including 3 benchmark datasets, namely CASIA, UBIRIS, MMU, random dataset, and the live video. We calculated the FPR, FNR, Precision, Recall, and accuracy of each dataset. The calculated accuracy of CASIA using the proposed system is 82.8%, for UBIRIS is 86%, MMU is 84%, and the random dataset is 84%. On live video with low resolution, calculated accuracy is 72.4%. The proposed system achieved better accuracy compared to existing state-of-the-art systems.
相似文献Similar to many other professions, the medical field has undergone immense automation during the past decade. The complexity and rise of healthcare data led to a surge in artificial intelligence applications. Despite increased automation, such applications lack the desired accuracy and efficiency for healthcare problems. To address the aforementioned issue, this study presents an automatic health care system that can effectively substitute a doctor at an initial stage of diagnosis and help save time by recommending the necessary precautions. The proposed approach comprises two modules where Modul-1 aims at training the machine learning models using the disease symptoms dataset and their corresponding symptoms and precautions. Preprocessing and feature extraction are done as prerequisite steps. In Module-1 several algorithms are applied to the disease dataset such as support vector machine, random forest, extra trees classifier, logistic regression, multinomial naive Bayes, and decision tree. Module-2 interacts with the user (patient) through which the patient can describe the illness symptoms using a microphone. The voice data are transformed into text using the Google speech recognizer. The transformed data is later used with the trained model for disease prediction, as well as, recommending the precautions. The proposed approach achieves an accuracy of 99.9% during the real-time evaluation.
相似文献Highly expensive capturing devices and barely existent high-resolution palmprint datasets have slowed the development of forensic palmprint biometric systems in comparison with civilian systems. These issues are addressed in this work. The feasibility of using document scanners as a cheaper option to acquire palmprints for minutiae-based matching systems is explored. A new high-resolution palmprint dataset was established using an industry-standard Green Bit MC517 scanner and an HP Scanjet G4010 document scanner. Furthermore, a new enhancement algorithm to attenuate the negative effect of creases in the process of minutiae extraction is proposed. Experimental results highlight the potentialities of document scanners for forensic applications. Advantages and disadvantages of both technologies are discussed in this context as well.
相似文献Human beings are capable of imagining a person’s voice according to his or her appearance because different people have different voice characteristics. Although researchers have made great progress in single-view speech synthesis, there are few studies on multi-view speech synthesis, especially the speech synthesis using face images. On the basis of implicit relationship between the speaker’s face image and his or her voice, we propose a multi-view speech synthesis method called SSFE (Speech Synthesis with Face Embeddings). The proposed SSFE consists of three parts: a voice encoder, a face encoder and an improved multi-speaker text-to-speech (TTS) engine. On the one hand, the proposed voice encoder generates the voice embeddings from the speaker’s speech and the proposed face encoder extracts the voice features from the speaker’s face as f-voice embeddings. On the other hand, the multi-speaker TTS engine would synthesize the speech with voice embeddings and f-voice embeddings. We have conducted extensive experiments to evaluate the proposed SSFE on the synthesized speech quality and face-voice matching degree, in which the Mean Opinion Score of the SSFE is more than 3.7 and the matching degree is about 1.7. The experimental results prove that the proposed SSFE method outperforms state-of-the-art methods on the synthesized speech in terms of speech quality and face-voice matching degree.
相似文献Depth-image-based rendering (DIBR) is widely used in 3DTV, free-viewpoint video, and interactive 3D graphics applications. Typically, synthetic images generated by DIBR-based systems incorporate various distortions, particularly geometric distortions induced by object dis-occlusion. Ensuring the quality of synthetic images is critical to maintaining adequate system service. However, traditional 2D image quality metrics are ineffective for evaluating synthetic images as they are not sensitive to geometric distortion. In this paper, we propose a novel no-reference image quality assessment method for synthetic images based on convolutional neural networks, introducing local image saliency as prediction weights. Due to the lack of existing training data, we construct a new DIBR synthetic image dataset as part of our contribution. Experiments were conducted on both the public benchmark IRCCyN/IVC DIBR image dataset and our own dataset. Results demonstrate that our proposed metric outperforms traditional 2D image quality metrics and state-of-the-art DIBR-related metrics.
相似文献Parkinson’s disease (PD) is a degenerative, central nervous system disorder. The diagnosis of PD is difficult, as there is no standard diagnostic test and a particular system that gives accurate results. Therefore, automated diagnostic systems are required to assist the neurologist. In this study, we have developed a new hybrid diagnostic system for addressing the PD diagnosis problem. The main novelty of this paper lies in the proposed approach that involves a combination of the k-means clustering-based feature weighting (KMCFW) method and a complex-valued artificial neural network (CVANN). A Parkinson dataset comprising the features obtained from speech and sound samples were used for the diagnosis of PD. PD attributes are weighted through the use of the KMCFW method. New features obtained are converted into a complex number format. These feature values are presented as an input to the CVANN. The efficiency and effectiveness of the proposed system have been rigorously evaluated against the PD dataset in terms of five different evaluation methods. Experimental results have demonstrated that the proposed hybrid system, entitled KMCFW–CVANN, significantly outperforms the other methods detailed in the literature and achieves the highest classification results reported so far, with a classification accuracy of 99.52 %. Therefore, the proposed system appears to be promising in terms of a more accurate diagnosis of PD. Also, the application confirms the conclusion that the reliability of the classification ability of a complex-valued algorithm with regard to a real-valued dataset is high.
相似文献Nowadays, automatic speech emotion recognition has numerous applications. One of the important steps of these systems is the feature selection step. Because it is not known which acoustic features of person’s speech are related to speech emotion, much effort has been made to introduce several acoustic features. However, since employing all of these features will lower the learning efficiency of classifiers, it is necessary to select some features. Moreover, when there are several speakers, choosing speaker-independent features is required. For this reason, the present paper attempts to select features which are not only related to the emotion of speech, but are also speaker-independent. For this purpose, the current study proposes a multi-task approach which selects the proper speaker-independent features for each pair of classes. The selected features are then given to the classifier. Finally, the outputs of the classifiers are appropriately combined to achieve an output of a multi-class problem. Simulation results reveal that the proposed approach outperforms other methods and offers higher efficiency in terms of detection accuracy and runtime.
相似文献Identity authentication based on Automatic Speaker Verification (ASV) has attracted extensive attention. Voice can be used as a substitute of password in many applications. However, the security of current ASV systems has been seriously challenged by many malicious spoofing attacks. Among all those attacks, replay attack is one of the biggest threats to the ASV System, where an adversary can use a pre-recorded speech sample of the legal user to access the ASV system. In this paper, we present a replay attack detection (RAD) scheme to distinguish normal speech and replayed speech. We focus on the distortion caused by loudspeaker: low-frequency attenuation and high-frequency harmonics, and present a suite of RAD features DL-RAD, including Harmonic Energy Ratio (HER), Low Spectral Ratio (LSR), Low Spectral Variance (LSV), and Low Spectral Difference Variance (LSDV), to describe the different characteristics between the normal speech signal and replay speech signal. SVM is adopted as a classifier to evaluate the performance of these features. Experiment results show that the True Positive Rate (TPR), True Negative Rate (TNR) of the proposed method are about 98.15% and 98.75% respectively, which are significantly better than the existing scheme. The proposed scheme can be applied to both text-dependent and text-independent ASV systems.
相似文献