首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
本文设计与实现了一个全自动中文新闻字幕生成系统,输入为新闻视频,输出为视频对应的字幕文本.以<新闻联播>为语料,实现了音频提取、音频分类与切分、说话人识别、大词汇量连续语音识别、视频文件的播放和文本字幕的自动生成等多项功能.新闻字幕的自动生成,避免了繁重费时的人工字幕添加过程.实验表明,该系统识别率高,能够满足听障等特...  相似文献   

2.
Understanding of the scene content of a video sequence is very important for content-based indexing and retrieval of multimedia databases. Research in this area in the past several years has focused on the use of speech recognition and image analysis techniques. As a complimentary effort to the prior work, we have focused on using the associated audio information (mainly the nonspeech portion) for video scene analysis. As an example, we consider the problem of discriminating five types of TV programs, namely commercials, basketball games, football games, news reports, and weather forecasts. A set of low-level audio features are proposed for characterizing semantic contents of short audio clips. The linear separability of different classes under the proposed feature space is examined using a clustering analysis. The effective features are identified by evaluating the intracluster and intercluster scattering matrices of the feature space. Using these features, a neural net classifier was successful in separating the above five types of TV programs. By evaluating the changes between the feature vectors of adjacent clips, we also can identify scene breaks in an audio sequence quite accurately. These results demonstrate the capability of the proposed audio features for characterizing the semantic content of an audio sequence.  相似文献   

3.
4.
为实现音视频语音识别和同时对音频视频流进行准确的音素切分,该文提出一个新的多流异步三音素动态贝叶斯网络(MM-ADBN-TRI)模型,在词级别上描述了音频视频流的异步性,音频流和视频流都采用了词-三音素-状态-观测向量的层次结构,识别基元是三音素,描述了连续语音中的协同发音现象.实验结果表明:该模型在音视频语音识别和对音频视频流的音素切分方面,以及在确定音视频流的异步关系上,都具备较好的性能.  相似文献   

5.
Speech and language technologies for audio indexing and retrieval   总被引:6,自引:0,他引:6  
With the advent of essentially unlimited data storage capabilities and with the proliferation of the use of the Internet, it becomes reasonable to imagine a world in which it would be possible to access any of the stored information at will with a few keystrokes or voice commands. Since much of this data will be in the form of speech from various sources, it becomes important to develop the technologies necessary for indexing and browsing such audio data. This paper describes some of the requisite speech and language technologies that would be required and introduces an effort aimed at integrating these technologies into a system, called Rough `n' Ready, which indexes speech data, creates a structural summarization, and provides tools for browsing the stored data. The technologies highlighted in the paper include speaker-independent continuous speech recognition, speaker segmentation and identification, name spotting, topic classification, story segmentation, and information retrieval. The system automatically segments the continuous audio input stream by speaker, clusters audio segments from the same speaker, identifies speakers known to the system, and transcribes the spoken words. It also segments the input stream into stories, based on their topic content, and locates the names of persons, places, and organizations. These structural features are stored in a database and are used to construct highly selective search queries for retrieving specific content from large audio archives  相似文献   

6.
In stress speech recognition, a recognition model that is capable of processing multi-stress speech needs to be designed in the view points of accuracy and add-ability. This paper proposes addable stress speech recognition with multiplexing Hidden-Markov model (HMM). To achieve multi-stress speech, we propose a multiplexing topology that combines multiple stress speech models. Since each stress affects a speech in different way, having a speech recognition model that specifically trained to recognize words effected by the stress help improve the recognition rates. However, since each stress speech model gives it own independent recognized word, we need to have an effective decision module to choose the correct word. In each stress speech model, a MFCC is applied to the input speech. The result is fed into a HMM that is segmented into N parts. Each part of the segmentation provides its own tentative recognized word which in turn is an input to the proposed non-training decision module. Based on these tentative recognized words from segments of all stress speech models, the final recognized word is decided using coarse-to-fine concept performed by a majority vote, segment-weighted difference square score and next best score, respectively. Besides neutral speech, the proposed method was verified using three stresses including angry, loud, and Lombard. The results showed that the proposed method achieved 94.7 % recognition rate comparing to 94.2 % of the training-based decision method.  相似文献   

7.
Rapid increase in the amount of the digital audio collections presenting various formats, types, durations and other parameters that the digital multimedia world refers demands a generic framework for robust and efficient indexing and retrieval based on the aural content. Moreover, from the content-based multimedia retrieval point of view, the audio information can be even more important than the visual part as it is mostly unique and significantly stable within the entire duration of the content. A generic and robust audio-based multimedia indexing and retrieval framework, which has been developed and tested under the MUVIS system, is presented. This framework supports the dynamic integration of the audio feature extraction modules during the indexing and retrieval phases and therefore provides a test-bed platform for developing robust and efficient aural feature extraction techniques. Furthermore, the proposed framework is designed based on the high-level content classification and segmentation in order to improve the speed and accuracy of the aural retrievals. Both theoretical and experimental results are finally presented, including the comparative measures of retrieval performance with respect to the visual counterpart.  相似文献   

8.
针对电台接收信号依赖于人工值守还未实现自动化接收的问题,在软件无线电的理论基础上,提出一种电台音频采集管理系统总体技术方案,并设计了一款电台外接的电台音频处理终端,旨在电台工作时终端可以自动采集、记录、识别及回放接收到的音频信号,以提高信号接收效率。该终端由通用软件无线电模块、Morse音频信号处理模块以及语音识别模块构成,模块之间相互独立,具有一定的通用性和实用性。  相似文献   

9.
The primary advances in speech and audio signal processing that contributed to the maturing of multimedia applications are discussed in the areas of speech and audio signal compression, speech synthesis, acoustic processing and echo control, and network echo cancellation  相似文献   

10.
广播新闻语料识别中的自动分段和分类算法   总被引:1,自引:0,他引:1  
吕萍  颜永红 《电子与信息学报》2006,28(12):2292-2295
该介绍了中文广播新闻语料识别任务中的自动分段和自动分类算法。提出了3阶段自动分段系统。该方法通过粗分段、精细分段和平滑3个阶段,将音频流分割为易于识别的音频段。在精细分段阶段,文中提出两种算法:动态噪声跟踪分段算法和基于单音素解码的分段算法。仿效说话人鉴别中的方法,文中提出了基于混合高斯模型的分类算法。该算法较好地解决了音频段的多类判决问题。在新闻联播测试数据中的实验结果表明,该文提出的自动分段和分类算法性能与手工分段分类性能几乎相当。  相似文献   

11.
12.
13.
视觉特征提取是听视觉语音识别研究的热点问题。文章引入了一种稳健的基于Visemic LDA的口形动态特征,这种特征充分考虑了发音时口形轮廓的变化及视觉Viseme划分。文章同时提出了一利利用语音识别结果进行LDA训练数据自动标注的方法。这种方法免去了繁重的人工标注工作,避免了标注错误。实验表明,将'VisemicLDA视觉特征引入到听视觉语音识别中,可以大大地提高噪声条件下语音识别系统的识别率;将这种视觉特征与多数据流HMM结合之后,在信噪比为10dB的强噪声情况下,识别率仍可以达到80%以上。  相似文献   

14.
This paper presents a novel audio and video information fusion approach that greatly improves automatic recognition of people in video sequences. To that end, audio and video information is first used independently to obtain confidence values that indicate the likelihood that a specific person appears in a video shot. Finally, a post-classifier is applied to fuse audio and visual confidence values. The system has been tested on several news sequences and the results indicate that a significant improvement in the recognition rate can be achieved when both modalities are used together.  相似文献   

15.
马勇  鲍长春 《信号处理》2013,29(9):1190-1199
说话人分割聚类是近几年新兴起的语音信号处理研究方向,它主要研究如何确定连续语流中多说话人起止时间的位置,并标出每个语音段对应的说话人。这项研究对自动语音识别、多说话人识别和基于内容的音频分析等都具有重要的意义。根据说话人分割和聚类实现过程不同,本文从异步策略和同步策略的角度回顾了十年来国内外研究的主流算法、技术和代表系统,对比了不同代表系统在近几年NIST富信息转写评测的结果,最后讨论了目前还存在的问题,并对未来的发展进行了展望。   相似文献   

16.
Results from a series of experiments that use neural networks to process the visual speech signals of a male talker are presented. In these preliminary experiments, the results are limited to static images of vowels. It is demonstrated that these networks are able to extract speech information from the visual images and that this information can be used to improve automatic vowel recognition. The structure of speech and its corresponding acoustic and visual signals are reviewed. The specific data that was used in the experiments along with the network architectures and algorithms are described. The results of integrating the visual and auditory signals for vowel recognition in the presence of acoustic noise are presented  相似文献   

17.
18.
This paper describes a study on automatic music segmentation and summarization from audio signals. The paper inquires scientifically into the nature of human perception of music and offers a practical solution to difficult problems of machine intelligence for automated multimedia content analysis and information retrieval. Specifically, three problems are addressed: segmentation based on tonality analysis, segmentation based on recurrent structural analysis, and summarization. Experimental results are evaluated quantitatively, demonstrating the promise of the proposed methods  相似文献   

19.
本文研究了一种结合"声学信息"和"音素配位学信息"进行语言辨识的新算法,首先在预处理中对语音进行自动分段,在特征层上引入带有长时信息的段级特征参数--段级移位差分倒谱,在模型层上利用高斯混合模型(Gaussi-an Mixture Model,GMM)将语音信号自动标识为符号序列,进而引入多元语言模型(Multi-gram Language Model,MLM)来对"音素配位学信息"进行建模,最后将"GMM得分"和"MLM得分"送入后端多分类支持向量机模型得到最终识别结果.相关实验表明,新系统不需手工标识的语料,识别速度快,对OGI标准语料库中的五种语言获得了开集正识率为78.84%的结果.  相似文献   

20.
于春雪 《电声技术》2012,36(1):55-59,73
采用ARM处理器$3C2440A构建嵌入式系统,利用音频芯片UDA1341TS对语音信号进行编解码,应用语音识别技术实现语音控制。介绍了系统设计原理和工作机制,并阐述了控制选单的软硬件设计方案和识别算法原理,给出测试方法。实验结果表明,系统能实现特定指令的语音控制,识别率高、实时性好,可适应复杂的工作环境。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号