共查询到20条相似文献,搜索用时 0 毫秒
1.
Shinji Watanabe Tomoharu Iwata Takaaki Hori Atsushi Sako Yasuo Ariki 《Computer Speech and Language》2011,25(2):440-461
In a real environment, acoustic and language features often vary depending on the speakers, speaking styles and topic changes. To accommodate these changes, speech recognition approaches that include the incremental tracking of changing environments have attracted attention. This paper proposes a topic tracking language model that can adaptively track changes in topics based on current text information and previously estimated topic models in an on-line manner. The proposed model is applied to language model adaptation in speech recognition. We use the MIT OpenCourseWare corpus and Corpus of Spontaneous Japanese in speech recognition experiments, and show the effectiveness of the proposed method. 相似文献
2.
3.
A cache-based natural language model for speech recognition 总被引:4,自引:0,他引:4
Kuhn R. De Mori R. 《IEEE transactions on pattern analysis and machine intelligence》1990,12(6):570-583
Speech-recognition systems must often decide between competing ways of breaking up the acoustic input into strings of words. Since the possible strings may be acoustically similar, a language model is required; given a word string, the model returns its linguistic probability. Several Markov language models are discussed. A novel kind of language model which reflects short-term patterns of word use by means of a cache component (analogous to cache memory in hardware terminology) is presented. The model also contains a 3g-gram component of the traditional type. The combined model and a pure 3g-gram model were tested on samples drawn from the Lancaster-Oslo/Bergen (LOB) corpus of English text. The relative performance of the two models is examined, and suggestions for the future improvements are made 相似文献
4.
Despite the significant progress of automatic speech recognition (ASR) in the past three decades, it could not gain the level
of human performance, particularly in the adverse conditions. To improve the performance of ASR, various approaches have been
studied, which differ in feature extraction method, classification method, and training algorithms. Different approaches often
utilize complementary information; therefore, to use their combination can be a better option. In this paper, we have proposed
a novel approach to use the best characteristics of conventional, hybrid and segmental HMM by integrating them with the help
of ROVER system combination technique. In the proposed framework, three different recognizers are created and combined, each
having its own feature set and classification technique. For design and development of the complete system, three separate
acoustic models are used with three different feature sets and two language models. Experimental result shows that word error
rate (WER) can be reduced about 4% using the proposed technique as compared to conventional methods. Various modules are implemented
and tested for Hindi Language ASR, in typical field conditions as well as in noisy environment. 相似文献
5.
We show the results of studying models of the Russian language constructed with recurrent artificial neural networks for systems of automatic recognition of continuous speech. We construct neural network models with different number of elements in the hidden layer and perform linear interpolation of neural network models with the baseline trigram language model. The resulting models were used at the stage of rescoring the N best list. In our experiments on the recognition of continuous Russian speech with extra-large vocabulary (150 thousands of word forms), the relative reduction in the word error rate obtained after rescoring the 50 best list with the neural network language models interpolated with the trigram model was 14%. 相似文献
6.
Speech production errors characteristic of dysarthria are chiefly responsible for the low accuracy of automatic speech recognition (ASR) when used by people diagnosed with it. A person with dysarthria produces speech in a rather reduced acoustic working space, causing typical measures of speech acoustics to have values in ranges very different from those characterizing unimpaired speech. It is unlikely then that models trained on unimpaired speech will be able to adjust to this mismatch when acted on by one of the currently well-studied adaptation algorithms (which make no attempt to address this extent of mismatch in population characteristics).In this work, we propose an interpolation-based technique for obtaining a prior acoustic model from one trained on unimpaired speech, before adapting it to the dysarthric talker. The method computes a ‘background’ model of the dysarthric talker's general speech characteristics and uses it to obtain a more suitable prior model for adaptation (compared to the speaker-independent model trained on unimpaired speech). The approach is tested with a corpus of dysarthric speech acquired by our research group, on speech of sixteen talkers with varying levels of dysarthria severity (as quantified by their intelligibility). This interpolation technique is tested in conjunction with the well-known maximum a posteriori (MAP) adaptation algorithm, and yields improvements of up to 8% absolute and up to 40% relative, over the standard MAP adapted baseline. 相似文献
7.
Command and control (C&C) speech recognition allows users to interact with a system by speaking commands or asking questions
restricted to a fixed grammar containing pre-defined phrases. Whereas C&C interaction has been commonplace in telephony and
accessibility systems for many years, only recently have mobile devices had the memory and processing capacity to support
client-side speech recognition. Given the personal nature of mobile devices, statistical models that can predict commands
based in part on past user behavior hold promise for improving C&C recognition accuracy. For example, if a user calls a spouse
at the end of every workday, the language model could be adapted to weight the spouse more than other contacts during that
time. In this paper, we describe and assess statistical models learned from a large population of users for predicting the
next user command of a commercial C&C application. We explain how these models were used for language modeling, and evaluate
their performance in terms of task completion. The best performing model achieved a 26% relative reduction in error rate compared
to the base system. Finally, we investigate the effects of personalization on performance at different learning rates via
online updating of model parameters based on individual user data. Personalization significantly increased relative reduction
in error rate by an additional 5%. 相似文献
8.
In this paper, the architecture of the first Iranian Farsi continuous speech recognizer and syntactic processor is introduced. In this system, by extracting suitable features of speech signal (cepstral, delta-cepstral, energy and zero-crossing rate) and using a hydrid architecture of neural networks (a Self-Organizing Feature Map, SOFM, at the first stage and a Multi-Layer Perceptron, MLP, at the second stage) the Iranian Farsi phonemes are recognized. Then the string of phonemes are corrected, segmented and converted to formal text by using a non-stochastic method. For syntactic processing, the symbolic (by using artificial intelligence techniques) and connectionist (by using artificial neural networks) approaches are used to determine the correctness, position and the kind of syntactic errors in Iranian Farsi sentences, as well. 相似文献
9.
《Computer Speech and Language》2007,21(1):1-25
Recently, minimum perfect hashing (MPH)-based language model (LM) lookup methods have been proposed for fast access of N-gram LM scores in lexical-tree based LVCSR (large vocabulary continuous speech recognition) decoding. Methods of node-based LM cache and LM context pre-computing (LMCP) have also been proposed to combine with MPH for further reduction of LM lookup time. Although these methods are effective, LM lookup still takes a large share of overall decoding time when trigram LM lookahead (LMLA) is used for lower word error rate than unigram or bigram LMLAs. Besides computation time, memory cost is also an important performance aspect of decoding systems. Most speedup methods for LM lookup obtain higher speed at the cost of increased memory demand, which makes system performance unpredictable when running on computers with smaller memory capacities. In this paper, an order-preserving LM context pre-computing (OPCP) method is proposed to achieve both fast speed and small memory cost in LM lookup. By reducing hashing operations through order-preserving access of LM scores, OPCP cuts down LM lookup time effectively. In the meantime, OPCP significantly reduces memory cost because of reduced size of hashing keys and the need for only last word index of each N-gram in LM storage. Experimental results are reported on two LVCSR tasks (Wall Street Journal 20K and Switchboard 33K) with three sizes of trigram LMs (small, medium, large). In comparison with above-mentioned existing methods, OPCP reduced LM lookup time from about 30–80% of total decoding time to about 8–14%, without any increase of word error rate. Except for the small LM, the total memory cost of OPCP for LM lookup and storage was about the same or less than the original N-gram LM storage, much less than the compared methods. The time and memory savings in LM lookup by using OPCP became more pronounced with the increase of LM size. 相似文献
10.
Multimedia Tools and Applications - This paper investigates language modeling with topical and positional information for large vocabulary continuous speech recognition. We first compare among a... 相似文献
11.
Robust speech recognition based on joint model and feature spaceoptimization of hidden Markov models
Seokyong Moon Jenq-Neng Hwang 《Neural Networks, IEEE Transactions on》1997,8(2):194-204
The hidden Markov model (HMM) inversion algorithm, based on either the gradient search or the Baum-Welch reestimation of input speech features, is proposed and applied to the robust speech recognition tasks under general types of mismatch conditions. This algorithm stems from the gradient-based inversion algorithm of an artificial neural network (ANN) by viewing an HMM as a special type of ANN. Given input speech features s, the forward training of an HMM finds the model parameters lambda subject to an optimization criterion. On the other hand, the inversion of an HMM finds speech features, s, subject to an optimization criterion with given model parameters lambda. The gradient-based HMM inversion and the Baum-Welch HMM inversion algorithms can be successfully integrated with the model space optimization techniques, such as the robust MINIMAX technique, to compensate the mismatch in the joint model and feature space. The joint space mismatch compensation technique achieves better performance than the single space, i.e. either the model space or the feature space alone, mismatch compensation techniques. It is also demonstrated that approximately 10-dB signal-to-noise ratio (SNR) gain is obtained in the low SNR environments when the joint model and feature space mismatch compensation technique is used. 相似文献
12.
Jia Zeng Zhi-Qiang Liu 《Fuzzy Systems, IEEE Transactions on》2006,14(3):454-467
This paper presents an extension of hidden Markov models (HMMs) based on the type-2 (T2) fuzzy set (FS) referred to as type-2 fuzzy HMMs (T2 FHMMs). Membership functions (MFs) of T2 FSs are three-dimensional, and this new third dimension offers additional degrees of freedom to evaluate the HMMs fuzziness. Therefore, T2 FHMMs are able to handle both random and fuzzy uncertainties existing universally in the sequential data. We derive the T2 fuzzy forward-backward algorithm and Viterbi algorithm using T2 FS operations. In order to investigate the effectiveness of T2 FHMMs, we apply them to phoneme classification and recognition on the TIMIT speech database. Experimental results show that T2 FHMMs can effectively handle noise and dialect uncertainties in speech signals besides a better classification performance than the classical HMMs. 相似文献
13.
为实现对沪语语音的识别和与家居机器人沪语语音交互,通过分析了沪语语言的语音、语调、语法特点,提出了沪语语音的识别基元的建模方法.该方法生成了新的声韵集作为识别基元,并建立了课题相关的沪语语音语料库,同时基于HTK初步构造了沪语语音的声学模型和3-Gramm语言模型.该系统模型在家居服务机器人中得到初步的应用,系统采用V... 相似文献
14.
15.
Nora Barroso Karmele López?de?Ipi?a Carmen Hernández Aitzol Ezeiza Manuel Gra?a 《International Journal of Speech Technology》2012,15(1):41-47
This paper describes the development of a Language Identification (LID) system oriented to robust Multilingual Speech Recognition
in the Basque context where coexist three languages: Basque, Spanish and French. The LID system is integrated in GorUP, a
Semantic Speech Recognition system for industrial complex environments described in Part I. The work presents hybrid strategies
for LID, based on the selection of system elements by several classifiers (Support Vector Machines and Multilayer Perceptron)
and Discriminant Analysis improved with robust regularized covariance matrix estimation methods oriented to under-resourced
languages and stochastic methods for speech recognition tasks (Hidden Markov Models and n-grams). The LID tool manages the
main elements of the Automatic Speech Recognition system (Acoustic Phonetic Decoder, Language Model and Lexicons). 相似文献
16.
Rose R. C. 《Computer Speech and Language》1995,9(4)
This paper describes a set of modeling techniques for detecting a small vocabulary of keywords in running conversational speech. The techniques are applied in the context of a hidden Markov model (HMM) based continuous speech recognition (CSR) approach to keyword spotting. The word spotting task is derived from the Switchboard conversational speech corpus, and involves unconstrained conversational speech utterances spoken over the public switched telephone network. The utterances in this task contain many of the artifacts that are characteristic of unconstrained speech as it appears in many telecommunications based automatic speech recognition (ASR) applications. Results are presented for an experimental study that was performed on this task. Performance was measured by computing the percentage correct keyword detection over a range of false alarm rates evaluated over 2·2 h of speech for a 20 keyword vocabulary. The results of the study demonstrate the importance of several techniques. These techniques include the use of decision tree based allophone clustering for defining acoustic subword units, different representations for non-vocabulary words appearing in the input utterance, and the definition of simple language models for keyword detection. Decision tree based allophone clustering resulted in a significant increase in keyword detection performance over that obtained using tri-phone based subword units while at the same time reducing the size of the inventory of subword acoustic models by 40%. More complex representations of non-vocabulary speech were also found to significantly improve keyword detection performance; however, these representations also resulted in a significant increase in computational complexity. 相似文献
17.
The use of a statistical language model to improve the performance of an algorithm for recognizing digital images of handwritten or machine-printed text is discussed. A word recognition algorithm first determines a set of words (called a neighborhood) from a lexicon that are visually similar to each input word image. Syntactic classifications for the words and the transition probabilities between those classifications are input to the Viterbi algorithm. The Viterbi algorithm determines the sequence of syntactic classes (the states of an underlying Markov process) for each sentence that have the maximum a posteriori probability, given the observed neighborhoods. The performance of the word recognition algorithm is improved by removing words from neighborhoods with classes that are not included on the estimated state sequence. An experimental application is demonstrated with a neighborhood generation algorithm that produces a number of guesses about the identity of each word in a running text. The use of zero, first and second order transition probabilities and different levels of noise in estimating the neighborhood are explored 相似文献
18.
Bj?rn Schuller Zixing Zhang Felix Weninger Felix Burkhardt 《International Journal of Speech Technology》2012,15(3):313-323
Recognizing speakers in emotional conditions remains a challenging issue, since speaker states such as emotion affect the acoustic parameters used in typical speaker recognition systems. Thus, it is believed that knowledge of the current speaker emotion can improve speaker recognition in real life conditions. Conversely, speech emotion recognition still has to overcome several barriers before it can be employed in realistic situations, as is already the case with speech and speaker recognition. One of these barriers is the lack of suitable training data, both in quantity and quality—especially data that allow recognizers to generalize across application scenarios (‘cross-corpus’ setting). In previous work, we have shown that in principle, the usage of synthesized emotional speech for model training can be beneficial for recognition of human emotions from speech. In this study, we aim at consolidating these first results in a large-scale cross-corpus evaluation on eight of most frequently used human emotional speech corpora, namely ABC, AVIC, DES, EMO-DB, eNTERFACE, SAL, SUSAS and VAM, covering natural, induced and acted emotion as well as a variety of application scenarios and acoustic conditions. Synthesized speech is evaluated standalone as well as in joint training with human speech. Our results show that the usage of synthesized emotional speech in acoustic model training can significantly improve recognition of arousal from human speech in the challenging cross-corpus setting. 相似文献
19.
The use of speaker-independent speech recognition in the development of Northern Telecom's automated alternate billing service (AABS) for collect calls, third-number-billed calls, and calling-card-billed calls is discussed. The AABS system automates a collect call by recording the calling party's name, placing a call to the called party, playing back the calling party's name to the called party, informing the called party that he or she has a collect call from that person, and asking. `Will you pay for the call?' The operation of AABS, the architecture of the voice interface, and the speech recognition algorithm are described, and the accuracy of the recognizer is discussed. AABS relies on isolated-word recognition, although more advanced techniques that can recognize continuous speech are being pursued 相似文献
20.
Cookhwan Kim Sungsik ParkKwiseok Kwon Woojin Chang 《Expert systems with applications》2012,39(1):117-128
Online marketplace, taken the form of “open market” where a very large number of buyers and sellers participate, has occupied a rapid increasing position in e-commerce, which resulting in sellers’ increasing investment on online advertising. Hence, there is a growing need to identify the effectiveness of online advertising in the online marketplaces such as eBay.com. However, it is problematic to directly apply the existing online advertising effect models for click-through data of online marketplaces. Therefore, there is a need for developing a model to estimate the effectiveness of online advertising in online marketplace considering its characteristics. In this paper, we develop an analytical Bayesian approach to modeling click-though data by employing the Poisson-gamma distribution. Our results have implications for online advertising effect measurement, and may help guide advertisers in decision-making. 相似文献