首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This article presents an overview of different approaches for providing automatic speech recognition (ASR) technology to mobile users. Three principal system architectures with respect to the employment of a wireless communication link are analyzed: Embedded Speech Recognition Systems, Network Speech Recognition (NSR) and Distributed Speech Recognition (DSR). An overview of the solutions having been standardized so far as well as a critical analysis of the latest developments in the field of speech recognition in mobile environments is given. Open issues, pros and cons of the different methodologies and techniques are highlighted. Special emphasis is placed on the constraints and limitations ASR applications are confronted with under different architectures.  相似文献   

2.
When an Automatic Speech Recognition (ASR) system is applied in noisy environments, Voice Activity Detection (VAD) is crucial to the performance of the overall system. The employment of the VAD for ASR on embedded mobile systems will minimize physical distractions and make the system convenient to use. Conventional VAD algorithm is of high complexity, which makes it unsuitable for embedded mobile devices; or of low robustness, which holds back its application in mobile noisy environments. In this paper, we propose a robust VAD algorithm specifically designed for ASR on embedded mobile devices. The architecture of the proposed algorithm is based on a two-level decision making strategy, where there is an interaction between a lower features-based level and subsequent decision logic based on a finite-state machine. Many discriminating features are employed in the lower level to improve the robustness of the VAD. The two-level decision strategy allows different features to be used in different states and reduces the cost of the algorithm, which makes the proposed algorithm suitable for embedded mobile devices. The evaluation experiments show the proposed VAD algorithm is robust and contribute to the overall performance gain of the ASR system in various acoustic environments.  相似文献   

3.

Speech recognition is a fascinating process that offers the opportunity to interact and command the machine in the field of human-computer interactions. Speech recognition is a language-dependent system constructed directly based on the linguistic and textual properties of any language. Automatic speech recognition (ASR) systems are currently being used to translate speech to text flawlessly. Although ASR systems are being strongly executed in international languages, ASR systems’ implementation in the Bengali language has not reached an acceptable state. In this research work, we sedulously disclose the current status of the Bengali ASR system’s research endeavors. In what follows, we acquaint the challenges that are mostly encountered while constructing a Bengali ASR system. We split the challenges into language-dependent and language-independent challenges and guide how the particular complications may be overhauled. Following a rigorous investigation and highlighting the challenges, we conclude that Bengali ASR systems require specific construction of ASR architectures based on the Bengali language’s grammatical and phonetic structure.

  相似文献   

4.
Automatic speech recognition (ASR) system suffers from the variation of acoustic quality in an input speech. Speech may be produced in noisy environments and different speakers have their own way of speaking style. Variations can be observed even in the same utterance and the same speaker in different moods. All these uncertainties and variations should be normalized to have a robust ASR system. In this paper, we apply and evaluate different approaches of acoustic quality normalization in an utterance for robust ASR. Several HMM (hidden Markov model)-based systems using utterance-level, word-level, and monophone-level normalization are evaluated with HMM-SM (subspace method)-based system using monophone-level normalization for normalizing variations and uncertainties in an utterance. SM can represent variations of fine structures in sub-words as a set of eigenvectors, and so has better performance at monophone-level than HMM. Experimental results show that word accuracy is significantly improved by the HMM-SM-based system with monophone-level normalization compared to that by the typical HMM-based system with utterance-level normalization in both clean and noisy conditions. Experimental results also suggest that monophone-level normalization using SM has better performance than that using HMM.  相似文献   

5.
This paper deals with speaker-independent Automatic Speech Recognition (ASR) system for continuous speech. This ASR system has been developed for Modern Standard Arabic (MSA) using recordings of six regions taken from ALGerian Arabic Speech Database (ALGASD), and has been designed by using Hidden Markov Models. The main purpose of this study is to investigate the effect of regional accent on speech recognition rates. First, the experiment assessed the general performance of the model for the data speech of six regions, details of the recognition results are performed to observe the deterioration of the performance of the ASR according to the regional variation included in the speech material. The results have shown that the ASR performance is clearly impacted by the regional accents of the speakers.  相似文献   

6.
Speech text entry can be problematic during ideal dictation conditions, but difficulties are magnified when external conditions deteriorate. Motion during speech is an extraordinary condition that might have detrimental effects on automatic speech recognition. This research examined speech text entry while mobile. Speech enrollment profiles were created by participants in both a seated and walking environment. Dictation tasks were also completed in both the seated and walking conditions. Although results from an earlier study suggested that completing the enrollment process under more challenging conditions may lead to improved recognition accuracy under both challenging and less challenging conditions, the current study provided contradictory results. A detailed review of error rates confirmed that some participants minimized errors by enrolling under more challenging conditions while others benefited by enrolling under less challenging conditions. Still others minimized errors when different enrollment models were used under the opposing condition. Leveraging these insights, we developed a decision model to minimize recognition error rates regardless of the conditions experienced while completing dictation tasks. When applying the model to existing data, error rates were reduced significantly but additional research is necessary to effectively validate the proposed solution.  相似文献   

7.
Speech recognition has a number of potential advantages over traditional manual controls for the operation of in-car and other mobile devices. Two laboratory experiments aimed to test these proposed benefits, and to optimise the design of future speech interfaces. Participants carried out tasks with a phone or in-car enteratainment system, while engaged in a concurrent driving task. Speech input reduced the adverse effects of system operation on driving performance, but manual control led to faster transaction times and improved task accuracy. Explicit feedback of the recognition results was found to be necessary, with audio-only feedback leading to better task performance than combined audio-plus-visual. It is recommended that speech technology is incorporated into the user interface as a redundant alternative to manual operation. However, the importance of good human factors in the design of speech dialogues is emphasised.  相似文献   

8.
Dysarthria is a neurological impairment of controlling the motor speech articulators that compromises the speech signal. Automatic Speech Recognition (ASR) can be very helpful for speakers with dysarthria because the disabled persons are often physically incapacitated. Mel-Frequency Cepstral Coefficients (MFCCs) have been proven to be an appropriate representation of dysarthric speech, but the question of which MFCC-based feature set represents dysarthric acoustic features most effectively has not been answered. Moreover, most of the current dysarthric speech recognisers are either speaker-dependent (SD) or speaker-adaptive (SA), and they perform poorly in terms of generalisability as a speaker-independent (SI) model. First, by comparing the results of 28 dysarthric SD speech recognisers, this study identifies the best-performing set of MFCC parameters, which can represent dysarthric acoustic features to be used in Artificial Neural Network (ANN)-based ASR. Next, this paper studies the application of ANNs as a fixed-length isolated-word SI ASR for individuals who suffer from dysarthria. The results show that the speech recognisers trained by the conventional 12 coefficients MFCC features without the use of delta and acceleration features provided the best accuracy, and the proposed SI ASR recognised the speech of the unforeseen dysarthric evaluation subjects with word recognition rate of 68.38%.  相似文献   

9.
Users often have tasks that can be accomplished with the aid of multiple media – for example with text, sound and pictures. For example, communicating an urban navigation route can be expressed with pictures and text. Today’s mobile devices have multimedia capabilities; cell phones have cameras, displays, sound output, and (soon) speech recognition. Potentially, these multimedia capabilities can be used for multimedia-intensive tasks, but two things stand in the way. First, recognition of visual input and speech recognition still remain unreliable. Second, the mechanics of integrating multiple media and recognition systems remains daunting for users. We address both these issues in a system, MARCO, multimodal agent for route construction. MARCO collects route information by taking pictures of landmarks, accompanied by verbal directions. We combine results from off-the-shelf speech recognition and optical character recognition to achieve better recognition of route landmarks than either recognition system alone. MARCO automatically produces an illustrated, step-by-step guide to the route.  相似文献   

10.
Automatic Speech Recognition (ASR) is the process of mapping an acoustic speech signal into a human readable text format. Traditional systems exploit the Acoustic Component of ASR using the Gaussian Mixture Model- Hidden Markov Model (GMM-HMM) approach.Deep NeuralNetwork (DNN) opens up new possibilities to overcome the shortcomings of conventional statistical algorithms. Recent studies modeled the acoustic component of ASR system using DNN in the so called hybrid DNN-HMM approach. In the context of activation functions used to model the non-linearity in DNN, Rectified Linear Units (ReLU) and maxout units are mostly used in ASR systems. This paper concentrates on the acoustic component of a hybrid DNN-HMM system by proposing an efficient activation function for the DNN network. Inspired by previous works, euclidean norm activation function is proposed to model the non-linearity of the DNN network. Such non-linearity is shown to belong to the family of Piecewise Linear (PWL) functions having distinct features. These functions can capture deep hierarchical features of the pattern. The relevance of the proposal is examined in depth both theoretically and experimentally. The performance of the developed ASR system is evaluated in terms of Phone Error Rate (PER) using TIMIT database. Experimental results achieve a relative increase in performance by using the proposed function over conventional activation functions.  相似文献   

11.
当前的语音识别模型在英语、法语等表音文字中已取得很好的效果。然而,汉语是一种典型的表意文字,汉字与语音没有直接的对应关系,但拼音作为汉字读音的标注符号,与汉字存在相互转换的内在联系。因此,在汉语语音识别中利用拼音作为解码时的约束,可以引入一种更接近语音的归纳偏置。该文基于多任务学习框架,提出一种基于拼音约束联合学习的汉语语音识别方法,以端到端的汉字语音识别为主任务,以拼音语音识别为辅助任务,通过共享编码器,同时利用汉字与拼音识别结果作为监督信号,增强编码器对汉语语音的表达能力。实验结果表明,相比基线模型,该文提出的方法取得了更优的识别效果,词错误率降低了2.24%。  相似文献   

12.
Many recent studies show that Augmented Reality (AR) and Automatic Speech Recognition (ASR) technologies can be used to help people with disabilities. Many of these studies have been performed only in their specialized field. Audio-Visual Speech Recognition (AVSR) is one of the advances in ASR technology that combines audio, video, and facial expressions to capture a narrator’s voice. In this paper, we combine AR and AVSR technologies to make a new system to help deaf and hard-of-hearing people. Our proposed system can take a narrator’s speech instantly and convert it into a readable text and show the text directly on an AR display. Therefore, in this system, deaf people can read the narrator’s speech easily. In addition, people do not need to learn sign-language to communicate with deaf people. The evaluation results show that this system has lower word error rate compared to ASR and VSR in different noisy conditions. Furthermore, the results of using AVSR techniques show that the recognition accuracy of the system has been improved in noisy places. Also, the results of a survey that was conducted with 100 deaf people show that more than 80 % of deaf people are very interested in using our system as an assistant in portable devices to communicate with people.  相似文献   

13.
This paper proposes a real-time lip reading system (consisting of a lip detector, lip tracker, lip activation detector, and word classifier), which can recognize isolated Korean words. Lip detection is performed in several stages: face detection, eye detection, mouth detection, mouth end-point detection, and active appearance model (AAM) fitting. Lip tracking is then undertaken via a novel two-stage lip tracking method, where the model-based Lucas-Kanade feature tracker is used to track the outer lip, and then a fast block matching algorithm is used to track the inner lip. Lip activation detection is undertaken through a neural network classifier, the input for which being a combination of the lip motion energy function and the first dominant shape feature. In the last step, input words are defined and recognized by three different classifiers: HMM, ANN, and K-NN. We combine the proposed lip reading system with an audio-only automatic speech recognition (ASR) system to improve the word recognition performance in the noisy environments. We then demonstrate the potential applicability of the combined system for use within hands free in-vehicle navigation devices. Results from experiments undertaken on 30 isolated Korean words using the K-NN classifier at a speed of 15 fps demonstrate that the proposed lip reading system achieves a 92.67% word correct rate (WCR) for person-dependent tests, and a 46.50% WCR for person-independent tests. Also, the combined audio-visual ASR system increases the WCR from 0% to 60% in a noisy environment.  相似文献   

14.
Conventional Hidden Markov Model (HMM) based Automatic Speech Recognition (ASR) systems generally utilize cepstral features as acoustic observation and phonemes as basic linguistic units. Some of the most powerful features currently used in ASR systems are Mel-Frequency Cepstral Coefficients (MFCCs). Speech recognition is inherently complicated due to the variability in the speech signal which includes within- and across-speaker variability. This leads to several kinds of mismatch between acoustic features and acoustic models and hence degrades the system performance. The sensitivity of MFCCs to speech signal variability motivates many researchers to investigate the use of a new set of speech feature parameters in order to make the acoustic models more robust to this variability and thus improve the system performance. The combination of diverse acoustic feature sets has great potential to enhance the performance of ASR systems. This paper is a part of ongoing research efforts aspiring to build an accurate Arabic ASR system for teaching and learning purposes. It addresses the integration of complementary features into standard HMMs for the purpose to make them more robust and thus improve their recognition accuracies. The complementary features which have been investigated in this work are voiced formants and Pitch in combination with conventional MFCC features. A series of experimentations under various combination strategies were performed to determine which of these integrated features can significantly improve systems performance. The Cambridge HTK tools were used as a development environment of the system and experimental results showed that the error rate was successfully decreased, the achieved results seem very promising, even without using language models.  相似文献   

15.
Human activity recognition systems are currently implemented by hundreds of applications and, in recent years, several technology manufacturers have introduced new wearable devices for this purpose. Battery consumption constitutes a critical point in these systems since most are provided with a rechargeable battery. In this paper, by using discrete techniques based on the Ameva algorithm, an innovative approach for human activity recognition systems on mobile devices is presented. Furthermore, unlike other systems in current use, this proposal enables recognition of high granularity activities by using accelerometer sensors. Hence, the accuracy of activity recognition systems can be increased without sacrificing efficiency. A comparative is carried out between the proposed approach and an approach based on the well-known neural networks.  相似文献   

16.
In this paper, a spoken query system is demonstrated which can be used to access the latest agricultural commodity prices and weather information in Kannada language using mobile phone. The spoken query system consists of Automatic Speech Recognition (ASR) models, Interactive Voice Response System (IVRS) call flow, Agricultural Marketing Network (AGMARKNET) and India Meteorological Department (IMD) databases. The ASR models are developed by using the Kaldi speech recognition toolkit. The task specific speech data is collected from the different dialect regions of Karnataka (a state in India speaks Kannada language) to develop ASR models. The web crawler is used to get the commodity price and weather information from AGMARKNET and IMD websites. The postgresql database management system is used to manage the crawled data. The 80 and 20% of validated speech data is used for system training and testing respectively. The accuracy and Word Error Rate (WER) of ASR models are highlighted and end to end spoken query system is developed for Kannada language.  相似文献   

17.
随着移动技术与相关技术的迅速发展,手机、个人掌上电脑(PDA)、笔记本电脑等各种电子设备变得流行,它们已成为人们工作和娱乐必不可少的随身用品。对于各种移动电子设备在中国的推广使用,汉字输入是一个必须考虑的问题。传统的输入方式大多使用键盘,不论是笔记本电脑使用的标准键盘,还是各手机厂商设计的简化键盘,都是使用键盘采集信息,然后通过汉语拼音或者笔画输入等方式完成汉字输入。对于嵌入式小型设备来说,原有键盘设计引起占用空间大和输入汉字效率低等诸多问题。如何解决这些问题,同时保证设备足够的显示空间,又不添加新的复杂硬件设备。一种叫做触摸屏手写汉字输入的技术越来越受到人们的推崇。以Windows CE 5.0为运行平台,Embedded Visual C 4.0,为开发环境,设计和实现了一套屏幕手写识别系统,不仅能对现有汉字进行有效识别,用户还可以根据需要自行对字库扩展,有助于提高汉字的识别率。  相似文献   

18.
Speech recognition can be a powerful tool when physical disabilities, environmental factors, or the tasks in which an individual is engaged hinders the individual’s ability to use traditional input devices. While state-of-the-art speech-recognition systems typically provide mechanisms for both data entry and cursor control, speech-based interactions continue to be slow when compared to similar keyboard- or mouse-based interactions. Although numerous researchers continue to investigate methods of improving speech-based interactions, most of these efforts focus on the underlying technologies or dictation-oriented applications. As a result, the efficacy of speech-based cursor control has received little attention. In this article, we describe two experiments that provide insight into the issues involved when using speech-based cursor control. The first compares two variations of a common speech-based cursor-control mechanism. One employs the standard mouse cursor while the second provides a predictive cursor designed to help users compensate for the delays often associated with speech recognition. As expected, larger targets and shorter distances resulted in shorter target selection times, while larger targets also resulted in fewer errors. Interestingly, there were no differences between the standard and predictive cursors. The second experiment investigates the delays associated with spoken input, explains why the original predictive-cursor implementation failed to provide the expected benefits, and provides insight that guided the design of a new predictive cursor. Published online: 6 November 2002  相似文献   

19.
In this paper we investigated Artificial Neural Networks (ANN) based Automatic Speech Recognition (ASR) by using limited Arabic vocabulary corpora. These limited Arabic vocabulary subsets are digits and vowels carried by specific carrier words. In addition to this, Hidden Markov Model (HMM) based ASR systems are designed and compared to two ANN based systems, namely Multilayer Perceptron (MLP) and recurrent architectures, by using the same corpora. All systems are isolated word speech recognizers. The ANN based recognition system achieved 99.5% correct digit recognition. On the other hand, the HMM based recognition system achieved 98.1% correct digit recognition. With vowels carrier words, the MLP and recurrent ANN based recognition systems achieved 92.13% and 98.06, respectively, correct vowel recognition; but the HMM based recognition system achieved 91.6% correct vowel recognition.  相似文献   

20.
随着移动设备的快速发展,使得语音识别系统大量地从实验室的PC平台转移到嵌入式设备中。将嵌入式语音识别与现有的嵌入式平台的各种应用软件相结合,能够使现有的各种应用软件(包括操作系统)增添便利的人机交互的语音界面。论文在基于Intel PXA270嵌入式微处理器开发平台上实现了WinCE操作系统的定制和移植;并结合WINCE5.0语音接口Speech Application Programming Interface(SAPI5.0),使用Embedded Visual C++4.0(EVC)成功开发嵌入式语音识别系统。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号