共查询到20条相似文献,搜索用时 15 毫秒
1.
A computational auditory scene analysis system for speech segregation and robust speech recognition 总被引:1,自引:1,他引:1
Yang Shao Soundararajan Srinivasan Zhaozhang Jin DeLiang Wang 《Computer Speech and Language》2010,24(1):77-93
A conventional automatic speech recognizer does not perform well in the presence of multiple sound sources, while human listeners are able to segregate and recognize a signal of interest through auditory scene analysis. We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time–frequency (T–F) mask which retains the mixture in a local T–F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T–F units across time frames. The resulting masks are used in an uncertainty decoding framework for automatic speech recognition. We evaluate our system on a speech separation challenge and show that our system yields substantial improvement over the baseline performance. 相似文献
2.
The paper deals with the divergence of information filter in multi sensor target tracking problem using bearing only measurements. Information filter has a number of advantages in terms of computational requirements over Kalman filter for target tracking applications. Compared to Kalman filter it also has the advantage that one can start estimation even without an initial estimate. But this filter is seen to diverge after tracking for a short period of time, even when the target is moving at a constant velocity. A technique to overcome this problem has been discussed in this paper. The information update equations of the conventional information filter are modified in terms of fuzzy function of error and change of error, and the results have been found to be encouraging. The efficacy of the technique in preventing divergence is demonstrated in the context of tracking a maneuvering target also. 相似文献
3.
4.
人工神经网络在传感器数据融合中的应用 总被引:1,自引:2,他引:1
针对压力传感器对温度的交叉灵敏度,采用BP人工神经网络法对其进行数据融合处理。消除温度对压力传感器的影响,大大提高了传感器的稳定性及其精度,效果良好。 相似文献
5.
针对观测平台和运动对象间的距离参数会对传感器随机测量误差带来影响的问题,提出了一种基于模糊距离阈值的主被动传感器量测融合算法。讨论了根据距离参数选择主被动融合跟踪模式的方法,采用指数函数和模糊处理技术,利用已有信息实时改变主、被动传感器在量测融合过程中所占的权重。仿真结果表明,当传感器和运动对象间的距离对随机测量误差的影响不能忽略时,基于模糊距离阈值的主被动传感器变权重融合算法和传统的固定权重融合算法相比更加稳定,能够充分发挥主、被动传感器间的互补特性。 相似文献
6.
《Behaviour & Information Technology》2012,31(6):845-851
The Internet has rarely been used in auditory perception studies due to concerns about standardisation and calibration across different systems and settings. However, not all auditory research is based on the investigation of fine-grained differences in auditory thresholds. Where meaningful ‘real-world’ listening, for instance the perception of speech, is concerned, the Internet may be a more appropriate and ecologically valid setting to collect data. This study compared affective ratings of low-pass-filtered infant-, foreigner- and British adult-directed speech obtained with traditional methods in the laboratory, with those obtained from an Internet sample. Dropout rates and demographic distribution of participants in the Internet condition were also assessed. The results show that affective ratings were similar for both the Internet and laboratory samples. These findings indicate the viability of Internet-based research into affective speech perception and suggest that precise acoustic environmental control may not always be necessary. 相似文献
7.
Bruce E. Balentine Colin M. Ayer Clint L. Miller Brian L. Scott 《International Journal of Speech Technology》1997,2(1):7-19
When human beings converse, they alternate between talking and listening. Participating in such turntaking behaviors is more
difficult for machines that use speech recognition to listen and speech output to talk. This paper describes an algorithm
for managing such turn-taking through the use of a sliding capture window. The device is specific to discrete speech recognition
technologies that do not have access to echo cancellation. As such, it addresses those inexpensive applications that suffer
the most from turn-taking errors—providing a “speech button” that stabilizes the interface. Correcting for short-lived turn-taking
errors can be thought of as “debouncing” the button. An informal study based on ten subjects using a voice dialing application
illuminates the design. 相似文献
8.
为使更多人了解使用少数民族语音产品,有效解决我国少数民族地区与其他区域之间的语言障碍问题,促进民族间的相互交流。通过搜集资料,以国内基于语音识别技术的维吾尔语、蒙古语、藏语的语音产品为研究对象,梳理其开发和应用情况,发现目前开发的相关产品主要集中于语音输入法、语音翻译软件和转录产品三方面,在此基础上,对产品使用产生的影响进行分析,并对相关语音产品的发展前景进行展望。 相似文献
9.
Most work on multi-biometric fusion is based on static fusion rules. One prominent limitation of static fusion is that it cannot respond to the changes of the environment or the individual users. This paper proposes context-aware multi-biometric fusion, which can dynamically adapt the fusion rules to the real-time context. As a typical application, the context-aware fusion of gait and face for human identification in video is investigated. Two significant context factors that may affect the relationship between gait and face in the fusion are considered, i.e., view angle and subject-to-camera distance. Fusion methods adaptable to these two factors based on either prior knowledge or machine learning are proposed and tested. Experimental results show that the context-aware fusion methods perform significantly better than not only the individual biometric traits, but also those widely adopted static fusion rules including SUM, PRODUCT, MIN, and MAX. Moreover, context-aware fusion based on machine learning shows superiority over that based on prior knowledge. 相似文献
10.
文章提出了一种抗噪声的语音特征。首先让语音信号的功率谱通过一组带通滤波器,再计算各滤波器输出的差分值。理论分析和实验一致证明,以此作为语音信号的特征,可以大幅度提高语音识别系统在噪声环境中的性能。 相似文献
11.
Harouna Kabré 《International Journal of Speech Technology》1997,2(2):133-143
This paper describes speech recognition software called ECHO (Environnement de Communication Homme-Ordinateur) which is devoted to the design of usable interactive speech-based applications. Usability refers to the stability of the Speech Recognition system performance under ordinary usage conditions (i.e., different acoustic environments) rather than high performance in a limited set of well-known conditions. In order to reach this objective, the system must be able to anticipate any change in the environment and adapt itself to the different usage conditions. One solution for this problem has been achieved by connecting some specialized modules of speech signal pre-processing and post-processing to the automatic speech recognition engine. These different modules interact as a mirror with the speech engine in order to force it to adapt its runtime parameters, thus improving its performance and its robustness. 相似文献
12.
为了使自动导引小车系统具有足够的柔性,将实时数字图像处理和传感器联合控制等技术应用到系统的设计中.采用图像信息识别行走路径,传感器信息控制速度及起停动作,基于这两者的信息融合实现柔性自导.实验表明,基于该设计方案的柔性自导小车实现了无线化和自动化,在现代物流业中有较大的实用价值. 相似文献
13.
Visual perception is typically performed in the context of a task or goal. Nonetheless, visual processing has traditionally been conceptualized in terms of a fixed, task-independent hierarchy of feature detectors. We explore the computational implications of allowing early visual processing to be task modulated. Using artificial neural networks, we show that significant improvements in task accuracy can be obtained by allowing the weights to be modulated by task. The primary benefits are obtained under resource-limited processing. A relatively modest task-based modulation of weights and activities can lead to a large performance boost, suggesting an efficient means of increasing effective cortical capacity. 相似文献
14.
Brad H. Story Ingo R. Titze Darrell Wong 《Engineering Applications of Artificial Intelligence》1997,10(6):593-601
This paper explores a model that reduces speech production to the specification of four time-varying parameters; F1 and F2, voice fundamental frequency (F0), and a relative amplitude of the voice. The trajectory of the first two formants, F1 and F2, is treated as a series of coordinate pairs that are mapped from the F1F2 plane into a two-dimensional plane of coefficients. These coefficients are multipliers of two empirically-based orthogonal basis vectors which, when added to a neutral vowel area function, will produce a new area function with the desired locations of F1 and F2. Thus, area functions and voice parameters extracted at appropriate time intervals can be fed into a speech simulation model to recreate the original speech. A transformation of the speech can also be imposed by manipulating the area function and voice characteristics prior to the recreation of speech by simulation. The model has initially been developed for vowel-like speech utterances, but the effect of consonants on the F1F2 trajectory is also briefly addressed. 相似文献
15.
16.
《Advanced Engineering Informatics》2014,28(1):102-110
Dysarthria is a neurological impairment of controlling the motor speech articulators that compromises the speech signal. Automatic Speech Recognition (ASR) can be very helpful for speakers with dysarthria because the disabled persons are often physically incapacitated. Mel-Frequency Cepstral Coefficients (MFCCs) have been proven to be an appropriate representation of dysarthric speech, but the question of which MFCC-based feature set represents dysarthric acoustic features most effectively has not been answered. Moreover, most of the current dysarthric speech recognisers are either speaker-dependent (SD) or speaker-adaptive (SA), and they perform poorly in terms of generalisability as a speaker-independent (SI) model. First, by comparing the results of 28 dysarthric SD speech recognisers, this study identifies the best-performing set of MFCC parameters, which can represent dysarthric acoustic features to be used in Artificial Neural Network (ANN)-based ASR. Next, this paper studies the application of ANNs as a fixed-length isolated-word SI ASR for individuals who suffer from dysarthria. The results show that the speech recognisers trained by the conventional 12 coefficients MFCC features without the use of delta and acceleration features provided the best accuracy, and the proposed SI ASR recognised the speech of the unforeseen dysarthric evaluation subjects with word recognition rate of 68.38%. 相似文献
17.
This communication discusses how automatic speech recognition (ASR) can support universal access to communication and learning
through the cost-effective production of text synchronised with speech and describes achievements and planned developments
of the Liberated Learning Consortium to: support preferred learning and teaching styles; assist those who for cognitive, physical
or sensory reasons find notetaking difficult; assist learners to manage and search online digital multimedia resources; provide
automatic captioning of speech for deaf learners or when speech is not available or suitable; assist blind, visually impaired
or dyslexic people to read and search material; and, assist speakers to improve their communication skills. 相似文献
18.
B. Mann P. Newhouse J. Pagram A. Campbell H. Schulz 《Journal of Computer Assisted Learning》2002,18(3):296-308
This research focused on the prediction that children in their school setting would learn more from educational multimedia when critical information was presented as spoken instead of textual cues. Analyses of a study (n= 42) showed that 12‐year‐olds did not learn any more from temporal speech cueing than from temporal text cueing. The findings suggest that multimedia learning for children is a different kind of learning experience than for adults or older adolescents. The results indicate underdeveloped executive control of the referential connections in the children's working memory between reading screen text while listening to spoken cues, and between watching on‐screen animations play while listening to spoken cues. Further study is warranted. Implications may be derived for educational multimedia research in school settings. 相似文献
19.
We are addressing the novel problem of jointly evaluating multiple speech patterns for automatic speech recognition and training. We propose solutions based on both the non-parametric dynamic time warping (DTW) algorithm, and the parametric hidden Markov model (HMM). We show that a hybrid approach is quite effective for the application of noisy speech recognition. We extend the concept to HMM training wherein some patterns may be noisy or distorted. Utilizing the concept of “virtual pattern” developed for joint evaluation, we propose selective iterative training of HMMs. Evaluating these algorithms for burst/transient noisy speech and isolated word recognition, significant improvement in recognition accuracy is obtained using the new algorithms over those which do not utilize the joint evaluation strategy. 相似文献
20.
《Ergonomics》2012,55(9):1841-1850
Very little is known about the magnitudes and sources of errors associated with the visual estimation of postural classification displayed on TV screens. This study was conducted to address this issue. Sixty-three subjects participated in the experiments. The findings indicate that: (1) subjects found it difficult to evaluate upper extremity postures (particularly the elbow and the wrist), while the postures around the lower back were the easiest to evaluate; (2) the lower extremity positions affected the ability of the subjects to accurately classify postures around the wrist, elbow, shoulder, neck, and lower back, with the estimates being > 70% for sitting and > 60% for standing (except for the elbow); and (3) in general, flexion and extension are easier to evaluate than neutral and non-neutral postures. 相似文献