首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper describes a set of modeling techniques for detecting a small vocabulary of keywords in running conversational speech. The techniques are applied in the context of a hidden Markov model (HMM) based continuous speech recognition (CSR) approach to keyword spotting. The word spotting task is derived from the Switchboard conversational speech corpus, and involves unconstrained conversational speech utterances spoken over the public switched telephone network. The utterances in this task contain many of the artifacts that are characteristic of unconstrained speech as it appears in many telecommunications based automatic speech recognition (ASR) applications. Results are presented for an experimental study that was performed on this task. Performance was measured by computing the percentage correct keyword detection over a range of false alarm rates evaluated over 2·2 h of speech for a 20 keyword vocabulary. The results of the study demonstrate the importance of several techniques. These techniques include the use of decision tree based allophone clustering for defining acoustic subword units, different representations for non-vocabulary words appearing in the input utterance, and the definition of simple language models for keyword detection. Decision tree based allophone clustering resulted in a significant increase in keyword detection performance over that obtained using tri-phone based subword units while at the same time reducing the size of the inventory of subword acoustic models by 40%. More complex representations of non-vocabulary speech were also found to significantly improve keyword detection performance; however, these representations also resulted in a significant increase in computational complexity.  相似文献   

2.
The paralinguistic information in a speech signal includes clues to the geographical and social background of the speaker. This paper is concerned with automatic extraction of this information from a short segment of speech. A state-of-the-art language identification (LID) system is applied to the problems of regional accent recognition for British English, and ethnic group recognition within a particular accent. We compare the results with human performance and, for accent recognition, the ‘text dependent’ ACCDIST accent recognition measure. For the 14 regional accents of British English in the ABI-1 corpus (good quality read speech), our LID system achieves a recognition accuracy of 89.6%, compared with 95.18% for our best ACCDIST-based system and 58.24% for human listeners. The “Voices across Birmingham” corpus contains significant amounts of telephone conversational speech for the two largest ethnic groups in the city of Birmingham (UK), namely the ‘Asian’ and ‘White’ communities. Our LID system distinguishes between these two groups with an accuracy of 96.51% compared with 90.24% for human listeners. Although direct comparison is difficult, it seems that our LID system performs much better on the standard 12 class NIST 2003 Language Recognition Evaluation task or the two class ethnic group recognition task than on the 14 class regional accent recognition task. We conclude that automatic accent recognition is a challenging task for speech technology, and speculate that the use of natural conversational speech may be advantageous for these types of paralinguistic task.  相似文献   

3.
This paper presents a new method of constructing phonetic decision trees (PDTs) for acoustic model state tying based on implicitly induced prior knowledge. Our hypothesis is that knowledge on pronunciation variation in spontaneous, conversational speech contained in a relatively large corpus can be used for building domain-specific or speaker-dependent PDTs. In view of tree-structure adaptation, this method leads to transformation of tree topology in contrast to keeping fixed tree structure as in traditional methods of speaker adaptation. A Bayesian learning framework is proposed to incorporate prior knowledge on decision rules in a greedy search of new decision trees, where the prior is generated by a decision tree growing process on a large data set. Experimental results on the telemedicine automatic captioning task demonstrate that the proposed approach results in consistent improvement in model quality and recognition accuracy.  相似文献   

4.
While Hidden Markov Models (HMMs) have been successful in many speech recognition tasks, performance on conversational speech is somewhat less successful, arguably due in part to the greater variation in timing of articulatory events. Loosely Coupled or Factorial HMMs (FHMMs) represent a family of models that have more flexibility for modeling such variation in speech, but there are tradeoffs to be studied in terms of computation and potential added confusability. This paper investigates two specific instances – Mixed-Memory and Parameter-Tied FHMMs – that can both be thought of as loosely coupled HMMs for modelling multiple time series. The Parameter-Tied FHMM, introduced here, has a potential advantage for speech modelling since it allows a left-to-right topology across the product state space. Experimental results on the ISOLET task show both models are feasible for speech recognition; TI-DIGITS recognition results show the Parameter-Tied FHMM is competitive with Multiband Models. State occupancy and pruning analyses show trends related to asynchrony that hold across the different models.  相似文献   

5.
We have developed a broadcasting agent system, public opinion channel (POC) caster, which generates understandable conversational form from text-based documents. The POC caster circulates the opinions of community members by using conversational form in a broadcasting system on the Internet. We evaluated its transformation rules in two experiments. In experiment 1, we examined our transformation rules for conversational form in relation to sentence length. Twenty-four participants listened to two types of sentence (long sentences and short sentences) with conversational form or with single speech. In experiment 2, we investigated the relationship between conversational form and the user’s knowledge level. Forty-two participants (21 with a high knowledge level and 21 with a low knowledge level) were selected for a knowledge task and listened to two kinds of sentence (sentences about a well-known topic or sentences about an unfamiliar topic). Our results indicate that the conversational form aided comprehension, especially for long sentences and when users had little knowledge about the topic. We explore possible explanations and implications of these results with regard to human cognition and text comprehension.  相似文献   

6.
An input device should be natural and convenient for a user to transmit information to a computer, and should be designed from an understanding of the task to be performed and the interrelationship between the task and the device from the perspective of the user. In order to investigate the potential of speech input as a reality based interaction device, this paper presents the findings of a study that investigated speech input in a VR application. Two independent user trials were combined within the same experimental design to evaluate the commands that users employed when they used free speech in which they were not restricted to a specific vocabulary. The study also investigated when participants were told they were either talking to a machine (e.g. a speech recognition system) or instructing another person to complete a VR based task. Previous research has illustrated that when users are limited to a specific vocabulary, this can alter the interaction style employed. The findings from this research illustrate that the interaction style users employ are very different when they are told they are talking to a machine or another person. Using this knowledge, recommendations can be drawn for the development of speech input vocabularies for future VR applications.  相似文献   

7.
8.
Detecting laughter in spontaneous speech by constructing laughter bouts   总被引:1,自引:0,他引:1  
Laughter frequently occurs in spontaneous speech (e.g. conversational speech, meeting speech). Detecting laughter is quite important for semantic analysis, highlight extraction, spontaneous speech recognition, etc. In this paper, we first analyze the characteristic differences between speech and laughter, and then propose an approach for detecting laughter in spontaneous speech. In the proposed approach, non-silence signal segments are first extracted from spontaneous speech by using voice activity detection, and then split into syllables. Afterward, the possible laughter bouts are constructed by merging adjacent syllables (using symmetrical Itakura distance measure and duration threshold) instead of using a sliding fixed-length window. Finally, hidden Markov models (HMMs) are used to recognize the possible laughter bouts as laughs, speech sounds or other sounds. Experimental evaluations show that the proposed approach can achieve satisfactory results in detecting two types of audible laughs (audible solo and group laughs). Precision rate, recall rate, and F1-measure (harmonic mean of precision and recall rate) are 83.4%, 86.1%, and 84.7%, respectively. Compared with the sliding-window-based approach, 4.9% absolute improvements in F1-measure are obtained. In addition, the laughter boundary errors obtained by the proposed approach are smaller than that obtained by the sliding-window-based approach.  相似文献   

9.
The performance of isolated word speech recognition system has steadily improved over time as we learn more about how to represent the significant events in speech, and how to capture these events via appropriate analysis procedures and training algorithms. In particular, algorithms based on both template matching (via dynamic time warping (DTW) procedures) and hidden Markov models (HMMs) have been developed which yield high accuracy on several standard vocabularies, including the 10 digits (zero to nine) and the set of 26 letters of the English alphabet (A-Z). Results are given showing currently attainable performance of a laboratory system for both template-based (DTW) and HMM-based recognizers, operating in both speaker trained and speaker independent modes, on the digits and the alphabet vocabularies using telephone recordings. We show that the average error rates of these systems, on standard vocabularies, are significantly lower than those reported several years back on the exact same databases, thereby reflecting the progress which has been made in all aspects of the speech recognition process.  相似文献   

10.
Current high-accuracy speech understanding systems achieve their performance at the cost of highly constrained grammars over relatively small vocabularies. Less-constrained systems will need to compensate for their loss of top-down constraint by improving bottom-up performance. To do this, they will need to eliminate from consideration at each place in the utterance most words in their vocabularies solely on the basis of acoustic information and expected pronunciations of the words. Towards this goal, we present the design and performance of Noah, a bottom-up word hypothesizer which is capable of handling large vocabularies?more than 10 000 words. Noah takes (machine) segmented and labeled speech as input and produces word hypotheses. The primary concern of this work is the problem of word hypothesizing from large vocabularies. Particular attention has been paid to accuracy, knowledge representation, knowledge acquisition, and flexibility. In this paper we discuss the problem of word hypothesizing, describe how the design of Noah faces these problems, and present the performance of Noah as a function of the vocabulary size.  相似文献   

11.
Multiple-cluster schemes, such as cluster adaptive training (CAT) or eigenvoice systems, are a popular approach for rapid speaker and environment adaptation. Interpolation weights are used to transform a multiple-cluster, canonical, model to a standard hidden Markov model (HMM) set representative of an individual speaker or acoustic environment. Maximum likelihood training for CAT has previously been investigated. However, in state-of-the-art large vocabulary continuous speech recognition systems, discriminative training is commonly employed. This paper investigates applying discriminative training to multiple-cluster systems. In particular, minimum phone error (MPE) update formulae for CAT systems are derived. In order to use MPE in this case, modifications to the standard MPE smoothing function and the prior distribution associated with MPE training are required. A more complex adaptive training scheme combining both interpolation weights and linear transforms, a structured transform (ST), is also discussed within the MPE training framework. Discriminatively trained CAT and ST systems were evaluated on a state-of-the-art conversational telephone speech task. These multiple-cluster systems were found to outperform both standard and adaptively trained systems.  相似文献   

12.
13.
14.
Some applications of speech recognition, such as automatic directory information services, require very large vocabularies. In this paper, we focus on the task of recognizing surnames in an Interactive telephone-based Directory Assistance Services (IDAS) system, which supersedes other large vocabulary applications in terms of complexity and vocabulary size. We present a method for building compact networks in order to reduce the search space in very large vocabularies using Directed Acyclic Word Graphs (DAWGs). Furthermore, trees, graphs and full-forms (whole words with no merging of nodes) are compared in a straightforward way under the same conditions, using the same decoder and the same vocabularies. Experimental results showed that, as we move from full-form lexicons to trees and then to graphs, the size of the recognition network is reduced, as is the recognition time. However, recognition accuracy is retained since the same phoneme combinations are involved. Subsequently, we refine the N-best hypotheses' list provided by the speech recognizer by applying context-dependent phonological rules. Thus, a small number N in the N-best hypotheses' list produces multiple solutions sufficient to retain high accuracy and at the same time achieve real-time response. Recognition tests with a vocabulary of 88,000 surnames that correspond to 123,313 distinct pronunciations proved the efficiency of the approach. For N = 3 (a value that ensures we have fast performance), before the application of rules the recognition accuracy was 70.27%. After applying phonological rules the recognition performance rose to 86.75%.  相似文献   

15.
16.
Oral discourse is the primary form of human–human communication, hence, computer interfaces that communicate via unstructured spoken dialogues will presumably provide a more efficient, meaningful, and naturalistic interaction experience. Within the context of learning environments, there are theoretical positions supporting a speech facilitation hypothesis that predicts that spoken tutorial dialogues will increase learning more than typed dialogues. We evaluated this hypothesis in an experiment where 24 participants learned computer literacy via a spoken and a typed conversation with AutoTutor, an intelligent tutoring system with conversational dialogues. The results indicated that (a) enhanced content coverage was achieved in the spoken condition; (b) learning gains for both modalities were on par and greater than a no-instruction control; (c) although speech recognition errors were unrelated to learning gains, they were linked to participants' evaluations of the tutor; (d) participants adjusted their conversational styles when speaking compared to typing; (e) semantic and statistical natural language understanding approaches to comprehending learners' responses were more resilient to speech recognition errors than syntactic and symbolic-based approaches; and (f) simulated speech recognition errors had differential impacts on the fidelity of different semantic algorithms. We discuss the impact of our findings on the speech facilitation hypothesis and on human–computer interfaces that support spoken dialogues.  相似文献   

17.
SpeechActs is a prototype testbed for developing spoken natural language applications. In developing SpeechActs, our primary goal was to enable software developers without special expertise in speech or natural language to create effective conversational speech applications-that is, applications with which users can speak naturally, as if they were conversing with a personal assistant. We also wanted SpeechActs applications to work with one another without requiring that each have specific knowledge of other applications running in the same suite. A discourse management component is necessary to embody the information that allows such a natural conversational flow. Because technology changes so rapidly, we also did not want to tie developers to specific speech recognizers or synthesizers. We wanted them to be able to use these speech technologies as plug-in components  相似文献   

18.
In the last few years, a growing attention has been paid to the problem of human-human communication, trying to devise artificial systems able to mediate a conversational setting between two or more people. In this paper, we propose an automatic system based on a generative structure able to classify dialog scenarios. The generative model is composed by integrating a Gaussian mixture model and a (observed) Markovian influence model, and it is fed with a novel low-level acoustic feature termed steady conversational period (SCP). SCPs are built on duration of continuous slots of silence or speech, taking also into account conversational turn-taking. The interactional dynamics built upon the transitions among SCPs provides a behavioral blueprint of conversational settings without relying on segmental or continuous phonetic features, and may be important for predicting the evolution of typical conversational situations in different dialog scenarios. The model has been tested on an extensive set of real, dyadic and multi-person conversational settings, including a recent dyadic dataset and the AMI meeting corpus. Comparative tests are made using conventional acoustic features and classification methods, showing that the proposed scheme provides superior classification performances for all conversational settings in our datasets. Moreover, we prove that our approach is able to characterize the nature of multi-person conversation (namely, the role of the participants) in a very accurate way, thus demonstrating great versatility.  相似文献   

19.
In this survey, we argue that using structured vocabularies is capital to the success of image annotation. We analyze literature on image annotation uses and user needs, and we stress the need for automatic annotation. We briefly expose the difficulties posed to machines for this task and how it relates to controlled vocabularies. We survey contributions in the field showing how structures are introduced. First we present studies that use unstructured vocabulary, focusing on those introducing links between categories or between features. Then we review work using structured vocabularies as an input and analyze how the structure is exploited.  相似文献   

20.
An accurate estimation of sentence units (SUs) in spontaneous speech is important for (1) helping listeners to better understand speech content and (2) supporting other natural language processing tasks that require sentence information. There has been much research on automatic SU detection; however, most previous studies have only used lexical and prosodic cues, but have not used nonverbal cues, e.g., gesture. Gestures play an important role in human conversations, including providing semantic content, expressing emotional status, and regulating conversational structure. Given the close relationship between gestures and speech, gestures may provide additional contributions to automatic SU detection. In this paper, we have investigated the use of gesture cues for enhancing the SU detection. Particularly, we have focused on: (1) collecting multimodal data resources involving gestures and SU events in human conversations, (2) analyzing the collected data sets to enrich our knowledge about co-occurrence of gestures and SUs, and (3) building statistical models for detecting SUs using speech and gestural cues. Our data analyses suggest that some gesture patterns influence a word boundary’s probability of being an SU. On the basis of the data analyses, a set of novel gestural features were proposed for SU detection. A combination of speech and gestural features was found to provide more accurate SU predictions than using only speech features in discriminative models. Findings in this paper support the view that human conversations are processes involving multimodal cues, and so they are more effectively modeled using information from both verbal and nonverbal channels.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号