首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
随着移动互联时代的到来和语音识别技术的日益成熟,通过语音的交互方式来使用移动终端成为一种趋势.如何理解用户自然状态下的口语输入,传统的做法是手写上下文无关的文法规则,但是文法规则的书写需耗费大量的人力和物力,很难去维护和更新.提出一种采用支持向量机和条件随机场串行结合的方法,把口语任务理解分解为任务发现和信息抽取两个过程,并最终将任务表达成语义向量的形式.最终对“讯飞语点”语音助手用户返回的八个不同的任务种类的数据进行了测试,在一比一的噪声中识别任务语义表达的准确率为90.29%,召回率为88.87%.  相似文献   

2.
汉语股票实时行情查询对话系统   总被引:1,自引:0,他引:1  
介绍了一个用于股票实时行情查询的口语化的人机对话系统,该系统集成了语音识别、语言理解、对话控制等技术。文中定义了一个情景语义框架模型,较好地处理了口语理解系统的一些难点。  相似文献   

3.
We are interested in the problem of robust understanding from noisy spontaneous speech input. With the advances in automated speech recognition (ASR), there has been increasing interest in spoken language understanding (SLU). A challenge in large vocabulary spoken language understanding is robustness to ASR errors. State of the art spoken language understanding relies on the best ASR hypotheses (ASR 1-best). In this paper, we propose methods for a tighter integration of ASR and SLU using word confusion networks (WCNs). WCNs obtained from ASR word graphs (lattices) provide a compact representation of multiple aligned ASR hypotheses along with word confidence scores, without compromising recognition accuracy. We present our work on exploiting WCNs instead of simply using ASR one-best hypotheses. In this work, we focus on the tasks of named entity detection and extraction and call classification in a spoken dialog system, although the idea is more general and applicable to other spoken language processing tasks. For named entity detection, we have improved the F-measure by using both word lattices and WCNs, 6–10% absolute. The processing of WCNs was 25 times faster than lattices, which is very important for real-life applications. For call classification, we have shown between 5% and 10% relative reduction in error rate using WCNs compared to ASR 1-best output.  相似文献   

4.
This paper presents WebGALAXY, a flexible multi-modal user interface system that allows wide access to selected information on the World Wide Web (WWW) by integrating spoken and typed natural language queries and hypertext navigation. WebALAXY extends our GALAXY spoken language system, a distributed client-server system for retrieving information from online sources through speech and natural language. WebGALAXY supports a spoken user interface via a standard telephone line as well as a graphical user interface via a standard Web browser using either Java/JavaScript or a cgi-bin/forms front end. Natural language understanding is performed by the system and information servers retrieve the requested information from various online resources including WWW servers, Gopher servers and CompuServe. Currently, queries about three domains are supported: weather, air travel, and points of interest around Boston.  相似文献   

5.
This paper concentrates on the problem of designing and developing a spoken query retrieval (SQR) system to access large document databases via voice. The main challenge is to identify and address issues related to the adaptation and scalability of integrating automatic speech recognition (ASR) and information retrieval (IR). In this paper, a Context Aware Language Model (CALM) framework allowing information retrieval to large document databases via voice is presented and findings from a research study using the framework will be discussed as well.  相似文献   

6.
7.
In this paper we demonstrate the use of prosody models for developing speech systems in Indian languages. Duration and intonation models developed using feedforward neural networks are considered as prosody models. Labelled broadcast news data in the languages Hindi, Telugu, Tamil and Kannada is used for developing the neural network models for predicting the duration and intonation. The features representing the positional, contextual and phonological constraints are used for developing the prosody models. In this paper, the use of prosody models is illustrated using speech recognition, speech synthesis, speaker recognition and language identification applications. Autoassociative neural networks and support vector machines are used as classification models for developing the speech systems. The performance of the speech systems has shown to be improved by combining the prosodic features along with one popular spectral feature set consisting of Weighted Linear Prediction Cepstral Coefficients (WLPCCs).  相似文献   

8.
We present a methodology of controlling machines using spoken language commands. The two major problems relating to the speech interfaces for machines, namely, the interpretation of words with fuzzy implications and the out-of-vocabulary (OOV) words in natural conversation, are investigated. The system proposed in this paper is designed to overcome the above two problems in controlling machines using spoken language commands. The present system consists of a hidden Markov model (HMM) based automatic speech recognizer (ASR), with a keyword spotting system to capture the machine sensitive words from the running utterances and a fuzzy-neural network (FNN) based controller to represent the words with fuzzy implications in spoken language commands. Significance of the words, i.e., the contextual meaning of the words according to the machine's current state, is introduced to the system to obtain more realistic output equivalent to users' desire. Modularity of the system is also considered to provide a generalization of the methodology for systems having heterogeneous functions without diminishing the performance of the system. The proposed system is experimentally tested by navigating a mobile robot in real time using spoken language commands.  相似文献   

9.
The design of Spoken Dialog Systems cannot be considered as the simple combination of speech processing technologies. Indeed, speech-based interface design has been an expert job for a long time. It necessitates good skills in speech technologies and low-level programming. Moreover, rapid development and reusability of previously designed systems remains uneasy. This makes optimality and objective evaluation of design very difficult. The design process is therefore a cyclic process composed of prototype releases, user satisfaction surveys, bug reports and refinements. It is well known that human intervention for testing is time-consuming and above all very expensive. This is one of the reasons for the recent interest in dialog simulation for evaluation as well as for design automation and optimization. In this paper we expose a probabilistic framework for a realistic simulation of spoken dialogs in which the major components of a dialog system are modeled and parameterized thanks to independent data or expert knowledge. Especially, an Automatic Speech Recognition (ASR) system model and a User Model (UM) have been developed. The ASR model, based on articulatory similarities in language models, provides task-adaptive performance prediction and Confidence Level (CL) distribution estimation. The user model relies on the Bayesian Networks (BN) paradigm and is used both for user behavior modeling and Natural Language Understanding (NLU) modeling. The complete simulation framework has been used to train a reinforcement-learning agent on two different tasks. These experiments helped to point out several potentially problematic dialog scenarios.  相似文献   

10.
11.
As mobile computing devices grow smaller and as in-car computing platforms become more common, we must augment traditional methods of human-computer interaction. Although speech interfaces have existed for years, the constrained system resources of pervasive devices, such as limited memory and processing capabilities, present new challenges. We provide an overview of embedded automatic speech recognition (ASR) on the pervasive device and discuss its ability to help us develop pervasive applications that meet today's marketplace needs. ASR recognizes spoken words and phrases. State-of-the-art ASR uses a phoneme-based approach for speech modeling: it gives each phoneme (or elementary speech sound) in the language under consideration a statistical representation expressing its acoustic properties.  相似文献   

12.
This paper presents a new technique to enhance the performance of the input interface of spoken dialogue systems based on a procedure that combines during speech recognition the advantages of using prompt-dependent language models with those of using a language model independent of the prompts generated by the dialogue system. The technique proposes to create a new speech recognizer, termed contextual speech recognizer, that uses a prompt-independent language model to allow recognizing any kind of sentence permitted in the application domain, and at the same time, uses contextual information (in the form of prompt-dependent language models) to take into account that some sentences are more likely to be uttered than others at a particular moment of the dialogue. The experiments show the technique allows enhancing clearly the performance of the input interface of a previously developed dialogue system based exclusively on prompt-dependent language models. But most important, in comparison with a standard speech recognizer that uses just one prompt-independent language model without contextual information, the proposed recognizer allows increasing the word accuracy and sentence understanding rates by 4.09% and 4.19% absolute, respectively. These scores are slightly better than those obtained using linear interpolation of the prompt-independent and prompt-dependent language models used in the experiments.  相似文献   

13.
We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which contains data from the eleven official languages of South Africa. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpus. We also report on phoneme distance measures across languages, and describe initial phone recognisers that were developed using this data. We find that a surprisingly small number of speakers (fewer than 50) and around 10 to 20 h of speech per language are sufficient for the purposes of acceptable phone-based recognition.  相似文献   

14.
We present MARS (Multilingual Automatic tRanslation System), a research prototype speech-to-speech translation system. MARS is aimed at two-way conversational spoken language translation between English and Mandarin Chinese for limited domains, such as air travel reservations. In MARS, machine translation is embedded within a complex speech processing task, and the translation performance is highly effected by the performance of other components, such as the recognizer and semantic parser, etc. All components in the proposed system are statistically trained using an appropriate training corpus. The speech signal is first recognized by an automatic speech recognizer (ASR). Next, the ASR-transcribed text is analyzed by a semantic parser, which uses a statistical decision-tree model that does not require hand-crafted grammars or rules. Furthermore, the parser provides semantic information that helps further re-scoring of the speech recognition hypotheses. The semantic content extracted by the parser is formatted into a language-independent tree structure, which is used for an interlingua based translation. A Maximum Entropy based sentence-level natural language generation (NLG) approach is used to generate sentences in the target language from the semantic tree representations. Finally, the generated target sentence is synthesized into speech by a speech synthesizer.Many new features and innovations have been incorporated into MARS: the translation is based on understanding the meaning of the sentence; the semantic parser uses a statistical model and is trained from a semantically annotated corpus; the output of the semantic parser is used to select a more specific language model to refine the speech recognition performance; the NLG component uses a statistical model and is also trained from the same annotated corpus. These features give MARS the advantages of robustness to speech disfluencies and recognition errors, tighter integration of semantic information into speech recognition, and portability to new languages and domains. These advantages are verified by our experimental results.  相似文献   

15.
Building a continuous speech recognizer for the Bangla (widely used as Bengali) language is a challenging task due to the unique inherent features of the language like long and short vowels and many instances of allophones. Stress and accent vary in spoken Bangla language from region to region. But in formal read Bangla speech, stress and accents are ignored. There are three approaches to continuous speech recognition (CSR) based on the sub-word unit viz. word, phoneme and syllable. Pronunciation of words and sentences are strictly governed by set of linguistic rules. Many attempts have been made to build continuous speech recognizers for Bangla for small and restricted tasks. However, medium and large vocabulary CSR for Bangla is relatively new and not explored. In this paper, the authors have attempted for building automatic speech recognition (ASR) method based on context sensitive triphone acoustic models. The method comprises three stages, where the first stage extracts phoneme probabilities from acoustic features using a multilayer neural network (MLN), the second stage designs triphone models to catch context of both sides and the final stage generates word strings based on triphone hidden Markov models (HMMs). The objective of this research is to build a medium vocabulary triphone based continuous speech recognizer for Bangla language. In this experimentation using Bangla speech corpus prepared by us, the recognizer provides higher word accuracy as well as word correct rate for trained and tested sentences with fewer mixture components in HMMs.  相似文献   

16.
Many automatic speech recognition (ASR) systems rely on the sole pronunciation dictionaries and language models to take into account information about language. Implicitly, morphology and syntax are to a certain extent embedded in the language models but the richness of such linguistic knowledge is not exploited. This paper studies the use of morpho-syntactic (MS) information in a post-processing stage of an ASR system, by reordering N-best lists. Each sentence hypothesis is first part-of-speech tagged. A morpho-syntactic score is computed over the tag sequence with a long-span language model and combined to the acoustic and word-level language model scores. This new sentence-level score is finally used to rescore N-best lists by reranking or consensus. Experiments on a French broadcast news task show that morpho-syntactic knowledge improves the word error rate and confidence measures. In particular, it was observed that the errors corrected are not only agreement errors and errors on short grammatical words but also other errors on lexical words where the hypothesized lemma was modified.  相似文献   

17.
This paper describes the preparation, recording, analyzing, and evaluation of a new speech corpus for Modern Standard Arabic (MSA). The speech corpus contains a total of 415 sentences recorded by 40 (20 male and 20 female) Arabic native speakers from 11 different Arab countries representing three major regions (Levant, Gulf, and Africa). Three hundred and sixty seven sentences are considered as phonetically rich and balanced, which are used for training Arabic Automatic Speech Recognition (ASR) systems. The rich characteristic is in the sense that it must contain all phonemes of Arabic language, whereas the balanced characteristic is in the sense that it must preserve the phonetic distribution of Arabic language. The remaining 48 sentences are created for testing purposes, which are mostly foreign to the training sentences and there are hardly any similarities in words. In order to evaluate the speech corpus, Arabic ASR systems were developed using the Carnegie Mellon University (CMU) Sphinx 3 tools at both training and testing/decoding levels. The speech engine uses 3-emitting state Hidden Markov Models (HMM) for tri-phone based acoustic models. Based on experimental analysis of about 8?h of training speech data, the acoustic model is best using continuous observation’s probability model of 16 Gaussian mixture distributions and the state distributions were tied to 500 senones. The language model contains uni-grams, bi-grams, and tri-grams. For same speakers with different sentences, Arabic ASR systems obtained average Word Error Rate (WER) of 9.70%. For different speakers with same sentences, Arabic ASR systems obtained average WER of 4.58%, whereas for different speakers with different sentences, Arabic ASR systems obtained average WER of 12.39%.  相似文献   

18.
This paper proposes a new technique to test the performance of spoken dialogue systems by artificially simulating the behaviour of three types of user (very cooperative, cooperative and not very cooperative) interacting with a system by means of spoken dialogues. Experiments using the technique were carried out to test the performance of a previously developed dialogue system designed for the fast-food domain and working with two kinds of language model for automatic speech recognition: one based on 17 prompt-dependent language models, and the other based on one prompt-independent language model. The use of the simulated user enables the identification of problems relating to the speech recognition, spoken language understanding, and dialogue management components of the system. In particular, in these experiments problems were encountered with the recognition and understanding of postal codes and addresses and with the lengthy sequences of repetitive confirmation turns required to correct these errors. By employing a simulated user in a range of different experimental conditions sufficient data can be generated to support a systematic analysis of potential problems and to enable fine-grained tuning of the system.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号