共查询到20条相似文献,搜索用时 0 毫秒
1.
This paper addresses the problem of parameterization for speech/music discrimination. The current successful parameterization based on cepstral coefficients uses the Fourier transformation (FT), which is well adapted for stationary signals. In order to take into account the non-stationarity of music/speech signals, this work proposes to study wavelet-based signal decomposition instead of FT. Three wavelet families and several numbers of vanishing moments have been evaluated. Different types of energy, calculated for each frequency band obtained from wavelet decomposition, are studied. Static, dynamic and long-term parameters were evaluated. The proposed parameterization are integrated into two class/non-class classifiers: one for speech/non-speech, one for music/non-music. Different experiments on realistic corpora, including different styles of speech and music (Broadcast News, Entertainment, Scheirer), illustrate the performance of the proposed parameterization, especially for music/non-music discrimination. Our parameterization yielded a significant reduction of the error rate. More than 30% relative improvement was obtained for the envisaged tasks compared to MFCC parameterization. 相似文献
2.
Machine-learning based classification of speech and music 总被引:2,自引:0,他引:2
The need to classify audio into categories such as speech or music is an important aspect of many multimedia document retrieval systems. In this paper, we investigate audio features that have not been previously used in music-speech classification, such as the mean and variance of the discrete wavelet transform, the variance of Mel-frequency cepstral coefficients, the root mean square of a lowpass signal, and the difference of the maximum and minimum zero-crossings. We, then, employ fuzzy C-means clustering to the problem of selecting a viable set of features that enables better classification accuracy. Three different classification frameworks have been studied:Multi-Layer Perceptron (MLP) Neural Networks, radial basis functions (RBF) Neural Networks, and Hidden Markov Model (HMM), and results of each framework have been reported and compared. Our extensive experimentation have identified a subset of features that contributes most to accurate classification, and have shown that MLP networks are the most suitable classification framework for the problem at hand. 相似文献
3.
《Ergonomics》2012,55(11):1068-1091
The aim of this study was to find out what are the effects of three different sound environments on performance of cognitive tasks of varying complexity. These three sound environments were ‘speech’, ‘masked speech’ and ‘continuous noise’. They corresponded to poor, acceptable and perfect acoustical privacy in an open-plan office, respectively. The speech transmission indices were 0.00, 0.30 and 0.80, respectively. Sounds environments were presented at 48 dBA. The laboratory experiment on 36 subjects lasted for 4 h for each subject. Proofreading performance deteriorated in the ‘speech’ (p < 0.05) compared to the other two sound environments. Reading comprehension and computer-based tasks (simple and complex reaction time, subtraction, proposition, Stroop and vigilance) remained unaffected. Subjects assessed the ‘speech’ as the most disturbing, most disadvantageous and least pleasant environment (p < 0.01). ‘Continuous noise’ annoyed the least. Subjective arousal was highest in ‘masked speech’ and lowest in ‘continuous noise’ (p < 0.05). Performance in real open-plan offices could be improved by reducing speech intelligibility, e.g. by attenuating speech level and using an appropriate masking environment. 相似文献
4.
Hidetoshi Miyao Minoru Maruyama 《International Journal on Document Analysis and Recognition》2007,9(1):49-58
The objective of this study is to produce a system that would allow music symbols to be written by hand using a pen-based
computer that would simulate the feeling of writing on sheets of paper and that would also accurately recognize the music
symbols. To accomplish these objectives, the following methods are proposed: (1) Two features, time-series data and an image
of a handwritten stroke, are used to recognize strokes; and (2) The strokes are combined, as efficiently as possible, and
outputted automatically as a music symbol. As a result, recognition rates of 97.60 and 98.80% were obtained in tests with
strokes and music symbols, respectively. 相似文献
5.
Peter Grogono 《Software》1973,3(4):369-383
MUSYS is a system of programs used to create electronic music at the computer studio of Electronic Music Studios, London. This paper describes the programming language employed by composers, and the implementation of its compiler and of other programs in the system. It is shown that by the use of a macrogenerator, an efficient and useful system can be built from simple software on a small computer. 相似文献
6.
This paper presents a new speech enhancement system that works in wavelet domain. The core of system is an improved WaveShrink module. First, different parameters of WaveShrink are studied; then, based on the features of speech signal, an improved wavelet-based speech enhancement system is proposed. The system uses a novel thresholding algorithm, and introduces a new method for threshold selection. Moreover, the efficiency of system has been increased by selecting more suitable parameters for voiced, unvoiced and silence regions, separately. The proposed system has been evaluated on different sentences under various noise conditions. The results show a plausible improvement in performance of system, in comparison with similar approaches. 相似文献
7.
This paper considers the separation and recognition of overlapped speech sentences assuming single-channel observation. A system based on a combination of several different techniques is proposed. The system uses a missing-feature approach for improving crosstalk/noise robustness, a Wiener filter for speech enhancement, hidden Markov models for speech reconstruction, and speaker-dependent/-independent modeling for speaker and speech recognition. We develop the system on the Speech Separation Challenge database, involving a task of separating and recognizing two mixing sentences without assuming advanced knowledge about the identity of the speakers nor about the signal-to-noise ratio. The paper is an extended version of a previous conference paper submitted for the challenge. 相似文献
8.
N. Ruiz-Reyes P. Vera-Candeas J. E. Muñoz S. García-Galán F. J. Cañadas 《Multimedia Tools and Applications》2009,41(2):253-286
Automatic discrimination of speech and music is an important tool in many multimedia applications. The paper presents a robust and effective approach for speech/music discrimination, which relies on a set of features derived from fundamental frequency (F0) estimation. Comparison between the proposed set of features and some commonly used timbral features is performed, aiming to assess the good discriminatory power of the proposed F0-based feature set. The classification scheme is composed of a classical Statistical Pattern Recognition classifier followed by a Fuzzy Rules Based System. Comparison with other well-proven classification schemes is also performed. Experimental results reveal that our speech/music discriminator is robust enough, making it suitable for a wide variety of multimedia applications. 相似文献
9.
Brad H. Story Ingo R. Titze Darrell Wong 《Engineering Applications of Artificial Intelligence》1997,10(6):593-601
This paper explores a model that reduces speech production to the specification of four time-varying parameters; F1 and F2, voice fundamental frequency (F0), and a relative amplitude of the voice. The trajectory of the first two formants, F1 and F2, is treated as a series of coordinate pairs that are mapped from the F1F2 plane into a two-dimensional plane of coefficients. These coefficients are multipliers of two empirically-based orthogonal basis vectors which, when added to a neutral vowel area function, will produce a new area function with the desired locations of F1 and F2. Thus, area functions and voice parameters extracted at appropriate time intervals can be fed into a speech simulation model to recreate the original speech. A transformation of the speech can also be imposed by manipulating the area function and voice characteristics prior to the recreation of speech by simulation. The model has initially been developed for vowel-like speech utterances, but the effect of consonants on the F1F2 trajectory is also briefly addressed. 相似文献
10.
In this paper, we present intermediate results of continuing research into the utility of generalised hierarchical structures for the representation of musical information. We build on an abstract data type presented in Wigginset al. (1989), usingconstituents, which are structurally significant groupings of musical events. We suggest that a division into such groupings can be musically meaningful, and that it can be more flexible than similar approaches. We demonstrate our representation system at work in both analysis and composition, with output from computer programs. We conclude that it is possible and useful to represent music in a way independent of the particular style, tonal system, etc., of the music itself.The authors work in the Department of Artificial Intelligence, University of Edinburgh, Scotland. 相似文献
11.
基于MATLAB的谱相减语音增强算法的研究 总被引:5,自引:0,他引:5
谱相减算法运算量小、便于快速处理、效果好,是一种重要的语音增强算法。针对谱相减法经典形式存在的“音乐噪声”残留问题,人们提出了各种改进形式。该文以这一类语音增强算法作为对象,对“音乐噪声”的消除方法进行了研究。介绍了谱相减法经典形式及多种改进形式的基本原理,并基于MATLAB,详细说明了算法具体的实现过程。结合实录语音样本,给出了不同形式谱相减法增强后的结果,对各种方法的效果进行了对比,总结了使用经验,并以此为基础提出了一种提高处理后带噪语音信噪比的改进方案。 相似文献
12.
麦克风阵列语音增强技术及其应用 总被引:3,自引:5,他引:3
本文简要叙述了应用麦克风阵列进行语音增强的原理及方法。且由于麦克风阵列在实际语音处理时具有良好的拾取语音能力及噪声鲁棒性,本文将介绍该技术在车载系统环境、机器人语音识别、大型场所的记录会议、助听装置及声源定位等系统中的应用。 相似文献
13.
Approximate pattern matching algorithms have become an important tool in computer assisted music analysis and information retrieval. The number of different problem formulations has greatly expanded in recent years, not least because of the subjective nature of measuring musical similarity. From an algorithmic perspective, the complexity of each problem depends crucially on the exact definition of the difference between two strings. We present an overview of advances in approximate string matching in this field focusing on new measures of approximation. 相似文献
14.
The design and implementation of Harbin Institute of Technology—Digital Music Library (HIT-DML) is presented in this paper.
Firstly, a novel framework, a music data model, and a query language are proposed as the theoretical foundation of the library.
Secondly, music computing algorithms used in the library for feature extracting and matching are described. In addition, indices
are introduced for both mining themes of music objects and accelerating content-based information retrieval. Finally, experimental
results on the indices and the current development of the library are provided.
HIT-DML is distinguished by the following points. First, it is inherently based on database systems, and combines database
technologies with multimedia technologies seamlessly. Musical data are structurally stored. Second, it has a solid theoretical
foundation, from framework and data model to query language. Last, it can retrieve musical information based on content against
different kinds of musical instruments. The indices used, also power the library. 相似文献
15.
Support vector machine active learning for music retrieval 总被引:7,自引:0,他引:7
16.
This study examined the effects of music tempo and task difficulty on the performance of multi-attribute decision-making according two alternative perspectives: background music as the arousal inducer vs. the distractor. An eye-tracking based experiment was conducted. Our results supported the arousal inducer perspective that, with the same level of decision time, participants made decisions more accurately with the presentation of faster than slower tempo music. Further, faster tempo music was found to improve the accuracy of harder decision-making only, not that of easier decision-making. More interestingly, our exploratory analysis on eye fixations found the occurrence of adaptive behavior, namely, that the search pattern of participants became more intra-dimensional under the faster tempo music as compared with the slower tempo music. 相似文献
17.
Jan Kleindienst Tom Macek Ladislav Serdi Jan edivý 《Image and vision computing》2007,25(12):1836-1847
In this article we describe an interaction framework that uses speech recognition and computer-vision to model new generation of interfaces in the residential environment. We outline the blueprints of the architecture and describe the main building blocks. We show a concrete prototype platform where this novel architecture has been deployed and tested at the user field trials. The work is co-funded by EC as part of “HomeTalk” IST-2001-33507 project. 相似文献
18.
Groupware and collaborative tools have been proposed to support cooperative work. However, they suffer from some rather severe limitations. Alternatively, multi-agent systems can be proposed to improve the situation. In the latter case, the user normally interacts with the system through a special agent called a personal assistant. In this paper, we describe the design of an ontology-based speech interface for personal assistants applied in the context of cooperative projects. We believe that this type of interface will improve the quality of assistance and increase collaboration between project members. We present the interface and its insertion into a multi-agent system designed for research and development projects. We describe the design of the interface, highlighting the role of ontologies for semantic interpretation. As a result of this conversational speech interface, we expect an increase in the quality of assistance and a reduction in the time needed to answer user’s requests. 相似文献
19.
In this paper, we propose to use variational Bayesian (VB) method to learn the clean speech signal from noisy observation directly. It models the probability distribution of clean signal using a Gaussian mixture model (GMM) and minimizes the misfit between the true probability distributions of hidden variables and model parameters and their approximate distributions. Experimental results demonstrate that the performance of the proposed algorithm is better than that of some other methods. 相似文献
20.
《Computer Speech and Language》2014,28(5):1209-1232
This paper presents a study on the importance of short-term speech parameterizations for expressive statistical parametric synthesis. Assuming a source-filter model of speech production, the analysis is conducted over spectral parameters, here defined as features which represent a minimum-phase synthesis filter, and some excitation parameters, which are features used to construct a signal that is fed to the minimum-phase synthesis filter to generate speech. In the first part, different spectral and excitation parameters that are applicable to statistical parametric synthesis are tested to determine which ones are the most emotion dependent. The analysis is performed through two methods proposed to measure the relative emotion dependency of each feature: one based on K-means clustering, and another based on Gaussian mixture modeling for emotion identification. Two commonly used forms of parameters for the short-term speech spectral envelope, the Mel cepstrum and the Mel line spectrum pairs are utilized. As excitation parameters, the anti-causal cepstrum, the time-smoothed group delay, and band-aperiodicity coefficients are considered. According to the analysis, the line spectral pairs are the most emotion dependent parameters. Among the excitation features, the band-aperiodicity coefficients present the highest correlation with the speaker's emotion. The most emotion dependent parameters according to this analysis were selected to train an expressive statistical parametric synthesizer using a speaker and language factorization framework. Subjective test results indicate that the considered spectral parameters have a bigger impact on the synthesized speech emotion when compared with the excitation ones. 相似文献