期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Human emotion recognition from videos using spatio-temporal and audio features

Munaf Rashid S. A. R. Abu-Bakar Musa Mokji 《The Visual computer》2013,29(12):1269-1275

In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first identified and then extracted from emotional speech. For facial expressions spatio-temporal features are extracted from visual streams. Principal component analysis (PCA) is applied for dimensionality reduction of the visual features and capturing 97 % of variances. Codebook is constructed for both audio and visual features using Euclidean space. Then occurrences of the histograms are employed as input to the state-of-the-art SVM classifier to realize the judgment of each classifier. Moreover, the judgments from each classifier are combined using Bayes sum rule (BSR) as a final decision step. The proposed system is tested on public data set to recognize the human emotions. Experimental results and simulations proved that using visual features only yields on average 74.15 % accuracy, while using audio features only gives recognition average accuracy of 67.39 %. Whereas by combining both audio and visual features, the overall system accuracy has been significantly improved up to 80.27 %. 相似文献

2.

Improvement to speech-music discrimination using sinusoidal model based features

Jalil Shirazi Shahrokh Ghaemmaghami 《Multimedia Tools and Applications》2010,50(2):415-435

This paper addresses a model-based audio content analysis for classification of speech-music mixed audio signals into speech and music. A set of new features is presented and evaluated based on sinusoidal modeling of audio signals. The new feature set, including variance of the birth frequencies and duration of the longest frequency track in sinusoidal model, as a measure of the harmony and signal continuity, is introduced and discussed in detail. These features are used and compared to typical features as inputs to an audio classifier. Performance of these sinusoidal model features is evaluated through classification of audio into speech and music using both the GMM (Gaussian Mixture Model) and the SVM (Support Vector Machine) classifiers. Experimental results show that the proposed features are quite successful in speech/music discrimination. By using only a set of two sinusoidal model features, extracted from 1-s segments of the signal, we achieved 96.84% accuracy in the audio classification. Experimental comparisons also confirm superiority of the sinusoidal model features to the popular time domain and frequency domain features in audio classification. 相似文献

3.

针对多种处理痕迹的数字语音取证算法

向立严迪群王让定李孝文《计算机应用》2019,39(1):126-130

现有的数字语音取证研究主要集中于对单一的某种操作进行检测，无法对不相关的操作进行判断。针对该问题，提出了一种能够同时检测经过变调、低通滤波、高通滤波和加噪这四种操作的数字语音取证方法。首先，计算语音的归一化梅尔频率倒谱系数（MFCC）统计矩特征；然后通过多个二分类器对特征进行训练，并组合投票得到多分类器；最后使用该多分类器对待测语音进行分类。在TIMIT以及UME语音库上的实验结果表明，归一化MFCC统计矩特征在库内实验中均达到了97%以上的检测率，且在对MP3压缩鲁棒性测试的实验中，检测率仍能保持在96%以上。相似文献

4.

Automatic speaker age and gender recognition using acoustic and prosodic level information fusion

Ming Li Kyu J. Han Shrikanth Narayanan 《Computer Speech and Language》2013,27(1):151-167

The paper presents a novel automatic speaker age and gender identification approach which combines seven different methods at both acoustic and prosodic levels to improve the baseline performance. The three baseline subsystems are (1) Gaussian mixture model (GMM) based on mel-frequency cepstral coefficient (MFCC) features, (2) Support vector machine (SVM) based on GMM mean supervectors and (3) SVM based on 450-dimensional utterance level features including acoustic, prosodic and voice quality information. In addition, we propose four subsystems: (1) SVM based on UBM weight posterior probability supervectors using the Bhattacharyya probability product kernel, (2) Sparse representation based on UBM weight posterior probability supervectors, (3) SVM based on GMM maximum likelihood linear regression (MLLR) matrix supervectors and (4) SVM based on the polynomial expansion coefficients of the syllable level prosodic feature contours in voiced speech segments. Contours of pitch, time domain energy, frequency domain harmonic structure energy and formant for each syllable (segmented using energy information in the voiced speech segment) are considered for analysis in subsystem (4). The proposed four subsystems have been demonstrated to be effective and able to achieve competitive results in classifying different age and gender groups. To further improve the overall classification performance, weighted summation based fusion of these seven subsystems at the score level is demonstrated. Experiment results are reported on the development and test set of the 2010 Interspeech Paralinguistic Challenge aGender database. Compared to the SVM baseline system (3), which is the baseline system suggested by the challenge committee, the proposed fusion system achieves 5.6% absolute improvement in unweighted accuracy for the age task and 4.2% for the gender task on the development set. On the final test set, we obtain 3.1% and 3.8% absolute improvement, respectively. 相似文献

5.

Analysis and detection of mimicked speech based on prosodic features

Leena Mary K. K. Anish Babu Aju Joseph 《International Journal of Speech Technology》2012,15(3):407-417

This paper describes a work aimed towards understanding the art of mimicking by professional mimicry artists while imitating the speech characteristics of known persons, and also explores the possibility of detecting a given speech as genuine or impostor. This includes a systematic approach of collecting three categories of speech data, namely original speech of the mimicry artists, speech while mimicking chosen celebrities and original speech of the chosen celebrities, to analyze the variations in prosodic features. A?method is described for the automatic extraction of relevant prosodic features in order to model speaker characteristics. Speech is automatically segmented as intonation phrases using speech/nonspeech classification. Further segmentation is done using valleys in energy contour. Intonation, duration and energy features are extracted for each of these segments. Intonation curve is approximated using Legendre polynomials. Other useful prosodic features include average jitter, average shimmer, total duration, voiced duration and change in energy. These prosodic features extracted from original speech of celebrities and mimicry artists are used for creating speaker models. Support Vector Machine (SVM) is used for creating speaker models, and detection of a given speech as genuine or impostor is attempted using a speaker verification framework of SVM models. 相似文献

6.

Detection of audio covert channels using statistical footprints of hidden messages

《Digital Signal Processing》2006,16(4):389-401

We address the problem of detecting the presence of hidden messages in audio. The detector is based on the characteristics of the denoised residuals of the audio file, which may consist of a mixture of speech and music data. A set of generalized moments of the audio signal is measured in terms of objective and perceptual quality measures. The detector discriminates between cover and stego files using a selected subset of features and an SVM classifier. The proposed scheme achieves on the average 88% discrimination performance on individual steganographic algorithms and 98.5% on individual watermarking algorithms. Between 75 and 90% discrimination performance is achieved in universal tests. Correct detection performance for individual embedding algorithms is roughly 90% when the detector can encounter any one in an ensemble of different embedding algorithms. 相似文献

7.

A Method for Automatic Detection of Vocal Fry

Ishi C.T. Sakakibara K.-I. Ishiguro H. Hagita N. 《IEEE transactions on audio, speech, and language processing》2008,16(1):47-56

Vocal fry (also called creak, creaky voice, and pulse register phonation) is a voice quality that carries important linguistic or paralinguistic information, depending on the language. We propose a set of acoustic measures and a method for automatically detecting vocal fry segments in speech utterances. A glottal pulse-synchronized method is proposed to deal with the very low fundamental frequency properties of vocal fry segments, which cause problems in the classic short-term analysis methods. The proposed acoustic measures characterize power, aperiodicity, and similarity properties of vocal fry signals. The basic idea of the proposed method is to scan for local power peaks in a ldquovery short-termrdquo power contour for obtaining glottal pulse candidates, check for periodicity properties, and evaluate a similarity measure between neighboring glottal pulse candidates for deciding the possibility of being vocal fry pulses. In the periodicity analysis, autocorrelation peak properties are taken into account for avoiding misdetection of periodicity in vocal fry segments. Evaluation of the proposed acoustic measures in the automatic detection resulted in 74% correct detection, with an insertion error rate of 13%. 相似文献

8.

Improvement of phone recognition accuracy using speech mode classification

Kumud Tripathi K. Sreenivasa Rao 《International Journal of Speech Technology》2018,21(3):489-500

In this work, we have developed a speech mode classification model for improving the performance of phone recognition system (PRS). In this paper, we have explored vocal tract system, excitation source and prosodic features for development of speech mode classification (SMC) model. These features are extracted from voiced regions of a speech signal. In this study, conversation, extempore, and read speech are considered as three different modes of speech. The vocal tract component of speech is extracted using Mel-frequency cepstral coefficients (MFCCs). The excitation source features are captured through Mel power differences of spectrum in sub-bands (MPDSS) and residual Mel-frequency cepstral coefficients (RMFCCs) of the speech signal. The prosody information is extracted from pitch and intensity. Speech mode classification models are developed using above described features independently, and in fusion. The experiments carried out on Bengali speech corpus to analyze the accuracy of the speech mode classification model using the artificial neural network (ANN), naive Bayes, support vector machines (SVMs) and k-nearest neighbor (KNN). We proposed four classification models which are combined using maximum voting approach for optimal performance. From the results, it is observed that speech mode classification model developed using the fusion of vocal tract system, excitation source and prosodic features of speech, yields the best performance of 98%. Finally, the proposed speech mode classifier is integrated to the PRS, and the accuracy of phone recognition system is observed to be improved by 11.08%. 相似文献

9.

基于韵律特征的SVM说话人确认

下载免费PDF全文

黄肖忠李辉许东星郭伟《计算机工程与应用》2011,47(15):148-151

提出了一种基于韵律特征和SVM的文本无关说话人确认系统。采用小波分析方法,从语音信号的MFCC、F0和能量轨迹中提取出超音段韵律特征,通过实验研究三者的韵律特征在特征层的最佳互补融合,得到信号的韵律特征PMFCCFE,用韵律特征的GMM均值超矢量作为参数训练目标话者的SVM模型,以更有效地区分目标话者和冒认话者。在NIST06 8side-1side数据库的实验表明,以短时倒谱参数的GMM-UBM系统为基准,超音段韵律特征的GMM-SVM系统的EER相对下降了57.9%,MinDCF相对下降了41.4%。相似文献

10.

Psychophysically determined user acceptable oral reading speed (UAORS) for an 8-h work day

《International Journal of Industrial Ergonomics》2015

Speech input as a form of hands-free input modality has gained popularity over the last decade. At present, there are professions (e.g., medical and legal professionals) who are using this technology (continuous speech) for extended periods of time. Prolonged use of voice-enabled applications might result to voice fatigue and potentially vocal nodules. However, studies providing guidelines on acceptable oral reading speed for an 8-h duration are not available. An experiment using a psychophysical methodology was conducted to determine the user acceptable oral reading speed (UAORS) for an 8-h day. Testing conducted on 10 males indicated that mean speed of 121 words per min could be considered an acceptable oral reading speed for an 8-h day with a 40 dB background noise. A period of 2 h and 20 min of oral reading was also found to be adequate to exert same vocal load as 8 h. Results from this study could be incorporated to set thresholds by designers of speech software in order to prevent vocal fatigue. The acceptable oral reading speed derived in this study could also be used as a guideline for professionals who use their vocal chords for extended periods of time such as teachers, lawyers, clergy, and cheerleaders. 相似文献

11.

基于SVM的流行音乐中人声的识别

石自强李海峰孙佳音《计算机工程与应用》2008,44(25):126-128

针对流行音乐中人声的发现问题,使用SVM分类器针对MFCC特征进行训练和分类。依据音频特征的连续性,后期对分类结果进行低通滤波。实验结果表明,该方法在帧层面上的识别率可以达到85.76%。实验中也发现不同语种的演唱者在发音上,特别是在MFCC特征上存在很大的统计差异性。实验中对歌曲分类的结果可以作为近一步实现音乐相似性度量的依据之一。相似文献

12.

基于模糊C均值聚类与单类支持向量机的音频隐写分析方法

王昱洁蒋薇薇《计算机应用》2016,36(3):647-652

针对传统的二分类音频隐写分析方法对未知隐写方法的适应性较差的问题,提出了一种基于模糊C均值(FCM)聚类与单类支持向量机(OC-SVM)的音频隐写分析方法。在训练过程中,首先对训练音频进行特征提取,包括短时傅里叶变换(STFT)频谱的统计特征和基于音频质量测度的特征,然后对所提取的特征进行FCM聚类得到C个聚类,最后送入多个超球面的OC-SVM分类器进行训练;检测过程中,对测试音频进行特征提取,根据多个超球面OC-SVM分类器的边界对待测音频进行检测。实验结果表明,该隐写分析方法对于几种典型的音频隐写方法能够较为正确地检测,满容量嵌入时,测试音频的总体检测率达到85.1%,与K-means聚类方法相比,所提方法的检测正确率提高了至少2%。该隐写分析方法比二分类的隐写分析方法更具有通用性,更适用于隐写方法事先未知情况下的隐写音频的检测。相似文献

13.

基于支持向量机(SVM)的数字音频水印 总被引：3，自引：2，他引：1

王剑林福宗《计算机研究与发展》2005,42(9):1605-1611

提出了一种新的基于支持向量机(support vector machine,SVM)的数字音频水印算法.主要思想是在宿主音频中嵌入一段模板信息,定义模板信息与宿主音频之间的一种对应关系,将水印的检测问题转化为一个可用SVM处理的二分类问题, 利用SVM对先验知识(对应关系)的学习,以达到对未知数字音频水印的正确分类检测.仿真实验结果表明,该数字音频水印具有较强的健壮性和不可感知性,在受到MP3压缩、低通滤波、重采样/量化、噪声干扰等常用信号处理方法的处理后,能正确检测出水印,同时在水印检测时不需要原始音频,实现了水印的盲检测. 相似文献

14.

基于APR—SVM的音频分类方法

王晓峰蒋先涛《微机发展》2012,(10):59-61,65

音频分类在多媒体应用中十分广泛,主要有时域分析和频域分析方法。文中提出了一种基于自适应间距比（APR）算法和支持向量机（svrd）算法的音频分类方法,先用APR算法区分语音与非语音;对于非语音,再通过SVM进行音频分类。APR算法是比较PR参数和阈值来区分语音和非语音,它和信噪比密切相关;而将非语音分成四组：音乐,汽车,会议,雨声,提取特征因子。实验结果表明：文中设计的分类器的精度达到93．75％以上,能很好地把各类型音频分开。相似文献

15.

支持向量机多类目标分类器的结构简化研究 总被引：8，自引：0，他引：8

下载免费PDF全文

王立国张晔谷延锋《中国图象图形学报》2005,10(5):571-574

由于支持向量机(SVM)在模式识别和回归分析中有着独特优势，因此成为近来研究的热点，其优势主要体现在处理非线性和高维数据问题方面。最初的SVM特别适合解决两类目标分类问题，而对于多类目标分类，则需将其转化为多个两类目标分类问题，相应地即可构造多个两类目标子分类器，但由于这种情况导致了分类器结构的过于复杂，从而导致判决速度的降低。为了快速地进行分类．提出了一种简化结构的多类目标分类器，其不仅使得子分类器数目大大减少，而且使分类速度明显提高；同时对其分类精度和复杂度进行了对比分析。实验结果证明。该分类器是有效的。相似文献

16.

基于SVM的音频分类系统设计及实现

孙文静李士强《计算机科学》2010,37(12):209-210

分析音频时域特征及提取方法,研究基于支持向量机的语音分类系统流程、分类系统架构以及SVM语音分类器的设计,并进行了相关实验。结果表明,设计的基于SVM的音频分类系统能够有效地对音频进行分类,平均识别准确率达到90%以上。相似文献

17.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence

Ananthakrishnan S. Narayanan S.S. 《IEEE transactions on audio, speech, and language processing》2008,16(1):216-228

With the advent of prosody annotation standards such as tones and break indices (ToBI), speech technologists and linguists alike have been interested in automatically detecting prosodic events in speech. This is because the prosodic tier provides an additional layer of information over the short-term segment-level features and lexical representation of an utterance. As the prosody of an utterance is closely tied to its syntactic and semantic content in addition to its lexical content, knowledge of the prosodic events within and across utterances can assist spoken language applications such as automatic speech recognition and translation. On the other hand, corpora annotated with prosodic events are useful for building natural-sounding speech synthesizers. In this paper, we build an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates. Following previous work in this area, we focus on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level. Our experiments achieved a performance rate of 86.75% agreement on the accent detection task, and 91.61% agreement on the phrase boundary detection task on the Boston University Radio News Corpus. 相似文献

18.

基于SVM的墙地砖颜色自动分类 总被引：1，自引：0，他引：1

苏彩红朱学峰刘笛《计算机仿真》2004,21(12):179-181

支持向量机(SVM)是一种采用结构风险最小化原则的新的机器学习方法,具有完备的理论基础。该文首次把支持向量机技术应用于墙地砖的自动分类,首先通过对墙地砖图像的RGB通道进行小波分解,由于不同通道的相关性,故提取其协变信号作为特征集,再构建二叉树形式的决策树来实现SVM多类分类,然后对墙地砖进行了颜色分类实验,并与knn分类结果对比,实验结果证明SVM分类器具有更高的分类准确率。相似文献

19.

Cross-covariance-based features for speech classification in film audio

《Journal of Visual Languages and Computing》2015

As multimedia becomes the dominant form of entertainment through an ever increasing range of digital formats, there has been a growing interest in obtaining information from entertainment media. Speech is one of the core resources in multimedia, providing a foundation for the extraction of semantic information. Thus, detecting speech is a critical first step for speech-based information retrieval systems. This work focuses on speech detection in one of the dominant forms of entertainment media: feature films. A novel approach for voice activity detection (VAD) in film audio is proposed. The approach uses correlation to analyze associations of Mel Frequency Cepstral Coefficient (MFCC) pairs in speech and non-speech data. This information then drives feature selection for the creation of MFCC cross-covariance feature vectors (MFCC-CCs) which are used to train a random forest classifier to solve a binary speech/non-speech classification problem on audio data from entertainment media. The classifier performance is evaluated on a number of test sets and achieves a classification accuracy of up to 94%. The approach is also compared with state of the art and contemporary VAD algorithms, and demonstrates competitive results. 相似文献

20.

读书机器人变声系统的研制

邓杰房宁赵群飞《微型电脑应用》2012,28(4):50-52

为了增加读书机器人（JoyT0n）朗读声音的多样性,设计了一种基于单一语音库的声音变换系统。将读书机器TTS（text to speech）合成出的初始声音分解成声音激励信号和声道滤波器信号,并转换到频域进行修改。利用短时傅立叶幅度谱重构激励信号的方法以及通过修改声道滤波器参数的方法来变换音速、音调和音色。修改后的声音激励信号和声道滤波器信号被重新合成产生新的声音信号。该变声系统能在不增加语音库容量的情况下使读书机器人用丰富多彩的感情和声调朗读。相似文献