首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
Audio event detection (AED) and recognition is a signal processing and analysis domain used in a wide range of applications including surveillance, home automation and behavioral assessment. The field presents numerous challenges to the current state-of-the-art due to its highly nonlinear nature. High false alarm rates (FARs) in such applications particularly limit the capabilities of vision-based perimeter monitoring systems by inducing high operator dependence. On the other hand, conventional fence-based vibration detectors and pressure-driven “taut wires” offer high sensitivity at the cost of a high FAR due to debris, animals and weather.This work reports an audio event identification methodology implemented as a test-bed system for a surveillance application to reduce FAR, maximize blind-spot coverage and improve audio event classification accuracy. The first phase utilizes a nonlinear autoregressive classifier to locate and classify discrete audio events via an exogenous sound direction variable to improve classifier confidence. The second phase implements a time-series-based system to recognize various audio activity groups from nominal everyday sound events such as traffic and muffled speech. The discretely labeled data is thus trained with HMM and Conditional Random Field classifiers and reports a substantial improvement in classification accuracies of indoor human activities.  相似文献   

2.
Automatic audio content recognition has attracted an increasing attention for developing multimedia systems, for which the most popular approaches combine frame-based features with statistic models or discriminative classifiers. The existing methods are effective for clean single-source event detection but may not perform well for unstructured environmental sounds, which have a broad noise-like flat spectrum and a diverse variety of compositions. We present an automatic acoustic scene understanding framework that detects audio events through two hierarchies, acoustic scene recognition and audio event recognition, in which the former is preceded by following dominant audio sources and in turn helps infer non-dominant audio events within the same scene through modeling their occurrence correlations. On the scene recognition hierarchy, we perform adaptive segmentation and feature extraction for every input acoustic scene stream through Eigen-audiospace and an optimized feature subspace, respectively. After filtering background, scene streams are recognized by modeling the observation density of dominant features using a two-level hidden Markov model. On the audio event recognition hierarchy, scene knowledge is characterized by an audio context model that essentially describes the occurrence correlations of dominant and non-dominant audio events within this scene. Monte Carlo integration and gradient descent techniques are employed to maximize the likelihood and correctly tag each audio event. To the best of our knowledge, this is the first work that models event correlations as scene context for robust audio event detection from complex and noisy environments. Note that according to the recent report, the mean accuracy for the acoustic scene classification task by human listeners is only around 71 % on the data collected in office environments from the DCASE dataset. None of the existing methods performs well on all scene categories and the average accuracy of the best performances of the recent 11 methods is 53.8 %. The proposed method averagely achieves an accuracy of 62.3 % on the same dataset. Additionally, we create a 10-CASE dataset by manually collecting 5,250 audio clips of 10 scene types and 21 event categories. Our experimental results on 10-CASE show that the proposed method averagely achieves the enhanced performance of 78.3 %, and the average accuracy of audio event recognition can be effectively improved by capturing dominant audio sources and reasoning non-dominant events from the dominant ones through acoustic context modeling. In the future work, exploring the interactions between acoustic scene recognition and audio event detection, and incorporating other modalities to improve the accuracy are required to further advance the proposed framework.  相似文献   

3.
As multimedia becomes the dominant form of entertainment through an ever increasing range of digital formats, there has been a growing interest in obtaining information from entertainment media. Speech is one of the core resources in multimedia, providing a foundation for the extraction of semantic information. Thus, detecting speech is a critical first step for speech-based information retrieval systems. This work focuses on speech detection in one of the dominant forms of entertainment media: feature films. A novel approach for voice activity detection (VAD) in film audio is proposed. The approach uses correlation to analyze associations of Mel Frequency Cepstral Coefficient (MFCC) pairs in speech and non-speech data. This information then drives feature selection for the creation of MFCC cross-covariance feature vectors (MFCC-CCs) which are used to train a random forest classifier to solve a binary speech/non-speech classification problem on audio data from entertainment media. The classifier performance is evaluated on a number of test sets and achieves a classification accuracy of up to 94%. The approach is also compared with state of the art and contemporary VAD algorithms, and demonstrates competitive results.  相似文献   

4.
传统的冯诺依曼架构在处理语音等复杂信息时能效较低,神经形态电路更适合于语音等复杂信息的智能处理。常用的音频场景识别方式中的长时特征和短时特征都有其不足之处,卷积神经网络可通过训练提取适合后续分类任务的特征,在特征提取方面有更大的优势。针对四层的卷积神经网络的特征提取及分析方法在语谱图上进行了音频场景识别的研究,并验证了音频场景识别在神经形态电路-类脑计算芯片上的可实现性。  相似文献   

5.
Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME’s performance in the 2012 TRECVID MED evaluation was one of the best reported.  相似文献   

6.
基于分形布朗运动和Ada Boosting的多类音频例子识别   总被引:2,自引:0,他引:2  
提出了一种基于分形布朗运动的音频特征提取和识别方法.这种方法使用分形布朗运动模型计算出音频例子的分形维数,并作为其分形特征.针对音频分形特征符合高斯分布的特点,使用Ada Boosting算法进行特征约减.然后分别使用Ada-加权高斯分类器和支持向量机对约减特征后的音频分类,并在两类分类的基础上构造多类分类的模型.实验表明,经过特征约减后的音频分形特征在音乐和语音的分类中都优于其他音频特征.  相似文献   

7.
基频(也称音高或F0)及其变化规律是语音信号的一个重要特征。语音作为一种近似周期性的信号,准确提取音频的基频特征参数对语音的后期处理如语音识别等有重要意义。许多学者也在此做了大量的研究工作,并提出了相关算法,取得了较好的结果。文章对语音信号的基频提取算法进行研究,做了一番系统的梳理和简介。  相似文献   

8.
The performance of speech recognition in distant-talking environments is severely degraded by the reverberation that can occur in enclosed spaces (e.g., meeting rooms). To mitigate this degradation, dereverberation techniques such as network structure-based denoising autoencoders and multi-step linear prediction are used to improve the recognition accuracy of reverberant speech. Regardless of the reverberant conditions, a novel discriminative bottleneck feature extraction approach has been demonstrated to be effective for speech recognition under a range of conditions. As bottleneck feature extraction is not primarily designed for dereverberation, we are interested in whether it can compensate for other carefully designed dereverberation approaches. In this paper, we propose three schemes covering both front-end processing (cascaded combination and parallel combination) and back-end processing (system combination). Each of these schemes integrates bottleneck feature extraction with dereverberation. The effectiveness of these schemes is evaluated via a series of experiments using the REVERB challenge dataset.  相似文献   

9.
基于特征选择的语音特征获取用于说话人识别是目前较为有效的方式。但是,最优语音特征随着具体应用环境的变化而不同。因此,提出了基于四类型语音特征封装式遗传特征选择算法(FSF-WrGAF),该算法提取了四种类型的语音特征参数,通过链式智能体遗传算法和GMM-UBM进行封装式动态特征选择,获取高精度的识别准确率。采用了多种指标完成该算法的性能测试。实验结果表明,该算法具体实现过程简便,改进效果明显,较同类算法在多项指标(识别率,EER,DET曲线)上都有显著提高。  相似文献   

10.
Content-based audio signal classification into broad categories such as speech, music, or speech with noise is the first step before any further processing such as speech recognition, content-based indexing, or surveillance systems. In this paper, we propose an efficient content-based audio classification approach to classify audio signals into broad genres using a fuzzy c-means (FCM) algorithm. We analyze different characteristic features of audio signals in time, frequency, and coefficient domains and select the optimal feature vector by employing a noble analytical scoring method to each feature. We utilize an FCM-based classification scheme and apply it on the extracted normalized optimal feature vector to achieve an efficient classification result. Experimental results demonstrate that the proposed approach outperforms the existing state-of-the-art audio classification systems by more than 11% in classification performance.  相似文献   

11.
为进一步提升语音测谎性能,提出了一种基于去噪自编码器(DAE)和长短时记忆(LSTM)网络的语音测谎算法。首先,该算法构建了优化后的DAE和LSTM的并行结构PDL;然后,提取出语音中的人工特征并输入DAE以获取更具鲁棒性的特征,同时,将语音加窗分帧后提取出的Mel谱逐帧输入到LSTM进行帧级深度特征的学习;最后,将这两种特征通过全连接层及批归一化处理后实现融合,使用softmax分类器进行谎言识别。CSC(Columbia-SRI-Colorado)库和自建语料库上的实验结果显示,融合特征分类的识别准确率分别为65.18%和68.04%,相比其他对比算法的识别准确率最高分别提升了5.56%和7.22%,表明所提算法可以有效提高谎言识别精度。  相似文献   

12.
作为计算机视觉的重要分支,异常行为识别与检测技术已在智能安防、医疗监护、交通管控等领域获得了广泛应用.对异常行为的界定及判别方法与场景因素紧密相关,针对不同应用场景特点,适当选择特征提取及异常行为识别与检测方法,进而保证预警准确率,在实际应用中至关重要.基于此,对基于视频的人体异常行为识别与检测方法进行综述,首先给出人体异常行为的定义、特点及分类;其次,对特征提取方法进行总结,特征提取方法的选取及提取特征的好坏直接影响后续判别结果;再次,从异常行为识别和异常行为检测两个角度对异常行为判别方法进行分析和讨论,给出常用异常行为检测数据集及相关算法表现;最后,对本领域未来研究方向提出展望.  相似文献   

13.
The process of counting stuttering events could be carried out more objectively through the automatic detection of stop-gaps, syllable repetitions and vowel prolongations. The alternative would be based on the subjective evaluations of speech fluency and may be dependent on a subjective evaluation method. Meanwhile, the automatic detection of intervocalic intervals, stop-gaps, voice onset time and vowel durations may depend on the speaker and the rules derived for a single speaker might be unreliable when trying to consider them as universal ones. This implies that learning algorithms having strong generalization capabilities could be applied to solve the problem. Nevertheless, such a system requires vectors of parameters, which characterize the distinctive features in a subject's speech patterns. In addition, an appropriate selection of the parameters and feature vectors while learning may augment the performance of an automatic detection system.The paper reports on automatic recognition of stuttered speech in normal and frequency altered feedback speech. It presents several methods of analyzing stuttered speech and describes attempts to establish those parameters that represent stuttering event. It also reports results of some experiments on automatic detection of speech disorder events that were based on both rough sets and artificial neural networks.  相似文献   

14.
针对语音情感识别研究体系进行综述。这一体系包括情感描述模型、情感语音数据库、特征提取与降维、情感分类与回归算法4个方面的内容。本文总结离散情感模型、维度情感模型和两模型间单向映射的情感描述方法;归纳出情感语音数据库选择的依据;细化了语音情感特征分类并列出了常用特征提取工具;最后对特征提取和情感分类与回归的常用算法特点进行凝练并总结深度学习研究进展,并提出情感语音识别领域需要解决的新问题、预测了发展趋势。  相似文献   

15.
基于内容的音频检索关键技术研究   总被引:4,自引:0,他引:4  
朱爱红  李连 《现代计算机》2003,(11):37-40,51
音频是一种重要的媒体,包含丰富的听觉特征。本文根据当前音频检索研究的进展,综述基于内容的音频检索方法,讨论了一些音频检索技术研究中的关键技术:音频特征提取、音频分类、语音识别技术等。最后展望了音频检索技术的发展前景。  相似文献   

16.
The majority of pieces of music, including classical and popular music,are composed using music scales, such as keys. The key or the scale information of a piece provides important clues on its high level musical content, like harmonic and melodic context. Automatic key detection from music data can be useful for music classification, retrieval or further content analysis. Many researchers have addressed key finding from symbolically encoded music(MIDI); however, works for key detection in musical audio is still limited. Techniques for key detection from musical audio mainly consist of two steps:pitch extraction and key detection. The pitch feature typically characterizes the weights of presence of particular pitch classes in the music audio. In the existing approaches to pitch extraction, little consideration has been taken on pitch mistuning and interference of noisy percussion sounds in the audio signals, which inevitably affects the accuracy of key detection. In this paper, we present a novel technique of precise pitch profile feature extraction, which deals with pitch mistuning and noisy percussive sounds. The extracted pitch profile feature can characterize the pitch content in the signal more accurately than the previous techniques, thus lead to a higher key detection accuracy. Experiments based on classical and popular music data were conducted. The results showed that the proposed method has higher key detection accuracy than previous methods, especially for popular music with a lot of noisy drum sounds.  相似文献   

17.
基于小波变换和支持向量机的音频分类   总被引:2,自引:0,他引:2       下载免费PDF全文
音频特征提取是音频分类的基础,而音频分类又是内容的音频检索的关键。综合分析了语音和音乐的区别性特征,提出一种基于小波变换和支持向量机的音频特征提取和分类的方法,用于纯语音、音乐、带背景音乐的语音以及环境音的分类,并且评估了新特征集合在SVM分类器上的分类效果。实验结果表明,提出的音频特征有效、合理,分类性能较好。  相似文献   

18.
Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.  相似文献   

19.
Decision support systems become a very important part of our lives. Technologies make them applicable almost everywhere. These systems can simplify solutions to the numerous problems, speed up analysis of the results of medical research and contribute to the rapid classification of input patterns. Practical applications may not only contribute to efficient technology, but also improve important aspects of our lives. In this article, we propose a model of decision support system for speech processing. Proposed mechanism can be used in various applications, where voice sample can be evaluated by the use of the proposed methodology. The proposed solution is based on analysis of the speech signal through the developed intelligent technique in which the signal is processed by the composed mathematical transform cooperating with bio-inspired algorithm and spiking neural network to evaluate possible voice problems. A novelty of our idea is the approach to the topic from a different side, because graphical representations of audio signals and heuristic methods are composed for feature extraction. The results are discussed after extensive comparisons in terms of advantages and disadvantages of the proposed approach. As a part of the conducted research, we demonstrated which transformations and heuristic algorithms work better in the process of voice analysis.  相似文献   

20.
高效精准的乐器识别技术可以有效地推动声源分离、音乐识谱、音乐流派分类等研究的深入发展,可广泛应用于播放列表生成、声学环境分类、乐器智能教学和交互式多媒体等众多领域。近年来,随着乐器识别研究的不断推进,乐器识别系统在性能上有了大幅提高,但依旧存在着部分乐器难以识别、乐器音频特征提取较为困难、复音乐器识别精准度较低等诸多问题,如何借助人工智能技术对乐器进行高效精准的识别成为当前研究的热点和难点。针对当前研究现状,从乐器识别常用音频特征、乐器识别模型及方法和常用数据集三个方面进行综述,并对当前研究中存在的局限性和未来发展趋势进行总结,为乐器识别研究提供一定的借鉴参考。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号