首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
本文设计与实现了一个全自动中文新闻字幕生成系统,输入为新闻视频,输出为视频对应的字幕文本.以<新闻联播>为语料,实现了音频提取、音频分类与切分、说话人识别、大词汇量连续语音识别、视频文件的播放和文本字幕的自动生成等多项功能.新闻字幕的自动生成,避免了繁重费时的人工字幕添加过程.实验表明,该系统识别率高,能够满足听障等特...  相似文献   

A system capable of producing near video-realistic animation of a speaker given only speech inputs is presented. The audio input is a continuous speech signal, requires no phonetic labelling and is speaker-independent. The system requires only a short video training corpus of a subject speaking a list of viseme-targeted words in order to achieve convincing realistic facial synthesis. The system learns the natural mouth and face dynamics of a speaker to allow new facial poses, unseen in the training video, to be synthesised. To achieve this the authors have developed a novel approach which utilises a hierarchical and nonlinear principal components analysis (PCA) model which couples speech and appearance. Animation of different facial areas, defined by the hierarchy, is performed separately and merged in post-processing using an algorithm which combines texture and shape PCA data. It is shown that the model is capable of synthesising videos of a speaker using new audio segments from both previously heard and unheard speakers.  相似文献   

随着流媒体应用的蓬勃兴起,基于媒体内容的检索和管理逐渐成为当前的学术研究热点。新闻节目作为电视节目的一种常见形式,对其主题进行自动提取检索具有重要的实际意义。该文从电视新闻节目的音频入手,综合应用了播音室语音/非播音室语音分类、说话人转换点检测以及按说话人聚类等多种技术,实现了对电视新闻节目的主题的检索和聚类。实验表明,该文中的方法能够找到新闻节目中96%以上的播音室段落,并对其进行准确归类,显示了这种方法的可行性和潜在价值。  相似文献   

This paper describes an indexing system that automatically creates metadata for multimedia broadcast news content by integrating audio, speech, and visual information. The automatic multimedia content indexing system includes acoustic segmentation (AS), automatic speech recognition (ASR), topic segmentation (TS), and video indexing features. The new spectral-based features and smoothing method in the AS module improved the speech detection performance from the audio stream of the input news content. In the speech recognition module, automatic selection of acoustic models achieved both a low WER, as with parallel recognition using multiple acoustic models, and fast recognition, as with the single acoustic model. The TS method using word concept vectors achieved more accurate results than the conventional method using local word frequency vectors. The information integration module provides the functionality of integrating results from the AS module, TS module, and SC module. The story boundary detection accuracy was improved by combining it with the AS results and the SC results compared to the sole TS results  相似文献   

为实现音视频语音识别和同时对音频视频流进行准确的音素切分,该文提出一个新的多流异步三音素动态贝叶斯网络(MM-ADBN-TRI)模型,在词级别上描述了音频视频流的异步性,音频流和视频流都采用了词-三音素-状态-观测向量的层次结构,识别基元是三音素,描述了连续语音中的协同发音现象.实验结果表明:该模型在音视频语音识别和对音频视频流的音素切分方面,以及在确定音视频流的异步关系上,都具备较好的性能.  相似文献   

马勇  鲍长春 《信号处理》2013,29(9):1190-1199
说话人分割聚类是近几年新兴起的语音信号处理研究方向,它主要研究如何确定连续语流中多说话人起止时间的位置,并标出每个语音段对应的说话人。这项研究对自动语音识别、多说话人识别和基于内容的音频分析等都具有重要的意义。根据说话人分割和聚类实现过程不同,本文从异步策略和同步策略的角度回顾了十年来国内外研究的主流算法、技术和代表系统,对比了不同代表系统在近几年NIST富信息转写评测的结果,最后讨论了目前还存在的问题,并对未来的发展进行了展望。   相似文献   

广播新闻语料识别中的自动分段和分类算法   总被引:1,自引:0,他引:1  
吕萍  颜永红 《电子与信息学报》2006,28(12):2292-2295
该介绍了中文广播新闻语料识别任务中的自动分段和自动分类算法。提出了3阶段自动分段系统。该方法通过粗分段、精细分段和平滑3个阶段,将音频流分割为易于识别的音频段。在精细分段阶段,文中提出两种算法:动态噪声跟踪分段算法和基于单音素解码的分段算法。仿效说话人鉴别中的方法,文中提出了基于混合高斯模型的分类算法。该算法较好地解决了音频段的多类判决问题。在新闻联播测试数据中的实验结果表明,该文提出的自动分段和分类算法性能与手工分段分类性能几乎相当。  相似文献   

基于CNN的连续语音说话人声纹识别   总被引:1,自引:0,他引:1  
近年来,随着社会生活水平的不断提高,人们对机器智能人声识别的要求越来越高.高斯混合—隐马尔可夫模型(Gaussian of mixture-hidden Markov model,GMM-HMM)是说话人识别研究领域中最重要的模型.由于该模型对大语音数据的建模能力不是很好,对噪声的顽健性也比较差,模型的发展遇到了瓶颈.为了解决该问题,研究者开始关注深度学习技术.引入了CNN深度学习模型研究连续语音说话人识别问题,并提出了CNN连续说话人识别(continuous speaker recognition of convolutional neural network,CSR-CNN)算法.模型提取固定长度、符合语序的语音片段,形成时间线上的有序语谱图,通过CNN提取特征序列,经过奖惩函数对特征序列组合进行连续测量.实验结果表明,CSR-CNN算法在连续—片段说话人识别领域取得了比GMM-HMM更好的识别效果.  相似文献   

This paper describes an audio response unit used in data communication services. The speech segments necessary to respond are stored in a large capacity magnetic drum in terms of partial autocorrelation (PARCOR) coefficients and excitation source information. PARCOR coefficient is a new parameter to express accurately the spectrum envelope of speech signals. Multiple speech signals can be synthesized simultaneously by means of a timemultiplexed digital filter composed of a high-speed arithmetic unit. The unit is able to respond to more than 7000 speech segments of 1-s duration.  相似文献   

随着网络与语音信号处理技术的快速发展,把说话人识别系统应用于Internet,使其作为身份识别的一种方法是势在必行。文中介绍了一个基于TCP/IP的实时说话人确认系统,它基于C/S(客户/服务器)模型,采用TCP/IP,以期能够实现Internet上的语音登录系统。介绍了该系统的框架及具体算法,给出了实验结果及其分析。  相似文献   

在研究传统语音录放电路的基础上,提出了一种基于AT89S52的音频信号采集、存储与处理系统。该系统以单片机AT89S52为控制器,采用键盘和LCD作为人机界面,ADC0809采集音频信号,扩展8MB闪速存储器K9F6408UOA作为数字化音频信号的存储器,通过软件滤波滤除噪音;采用PWM产生声音的原理,使存储在Flash中的音频数据控制PWM每个波形的占空比,通过低通滤波器将声音从PWM的脉冲中分离,并驱动扬声器。实验表明:8kHz采样频率和8位采样位数可获得清晰的语音以及较好的音乐声,语音存储时间达15min。  相似文献   

Due to the recent popularization of mobile multimedia broadcasting, broadcasting continuous media data such as audio and video has attracted great deal of attention. In general continuous media data broadcasting, since clients have to wait to receive data before playing it, various schemes to reduce waiting time have been studied. Some reduce the waiting time by dividing the data into several segments and broadcasting preceding segments frequently with a single channel. However, by dividing the data into numerous segments and producing an effective broadcast schedule, the waiting time can be further reduced. In this paper, we propose a scheduling protocol to reduce waiting time with large-scale data segmentation.  相似文献   

高畅  李海峰  马琳 《信号处理》2012,28(6):851-858
压缩感知理论依据信号的稀疏性质进行压缩测量,将信号的获取方式从对信号的采样上升为对信息的感知,是信号处理领域的一场革命。本文提出一种基于非确定基字典(Uncertainty Basis Dictionary, UBD)对语音信号进行稀疏表示的方法,将压缩感知理论应用于对语音信号稀疏表示的压缩,并提出了基于求解线性规划问题的方法重构语音信号的算法。通过语音识别、话者识别和情感识别实验,从面向内容分析的角度,研究这种基于压缩感知理论的信息感知方法是否保留了语音信号的主要内容。实验结果表明,语音识别、话者识别和情感识别的准确率,与目前这些领域研究方法得到的结果基本一致,说明基于压缩感知理论的信息感知方法能够很好地获取语音信号的语义、话者和情感方面的信息。   相似文献   

Considerable progress has been made in speech-recognition technology and nowhere has this progress been more evident than in the area of large-vocabulary recognition (LVR). Laboratory systems are capable of transcribing continuous speech from any speaker with average word-error rates between 5% and 10%. If speaker adaptation is allowed, then after 2 or 3 minutes of speech, the error rate will drop well below 5% for most speakers. LVR systems had been limited to dictation applications since the systems were speaker dependent and required words to be spoken with a short pause between them. However, the capability to recognize natural continuous-speech input from any speaker opens up many more applications. This article discusses the principles and architecture of LVR systems and identifies the key issues affecting their future deployment. To illustrate the various points raised, the Cambridge University HTK system is described. This system is a modern design that gives state-of-the-art performance, and it is typical of the current generation of recognition systems  相似文献   

本征音子说话人自适应算法在自适应数据量充足时可以取得很好的自适应效果,但在自适应数据量不足时会出现严重的过拟合现象。为此该文提出一种基于本征音子说话人子空间的说话人自适应算法来克服这一问题。首先给出基于隐马尔可夫模型-高斯混合模型(HMM-GMM)的语音识别系统中本征音子说话人自适应的基本原理。其次通过引入说话人子空间对不同说话人的本征音子矩阵间的相关性信息进行建模;然后通过估计说话人相关坐标矢量得到一种新的本征音子说话人子空间自适应算法。最后将本征音子说话人子空间自适应算法与传统说话人子空间自适应算法进行了对比。基于微软语料库的汉语连续语音识别实验表明,与本征音子说话人自适应算法相比,该算法在自适应数据量极少时能大幅提升性能,较好地克服过拟合现象。与本征音自适应算法相比,该算法以较小的性能牺牲代价获得了更低的空间复杂度而更具实用性。  相似文献   

杨毅  宋辉  刘加 《电子与信息学报》2011,33(5):1234-1237
该文针对美国国家标准与技术研究院(NIST)的 NIST评测,构建了一套多距离麦克风说话人分类及定位语音处理系统,针对NIST富标注评测中提出的说话人分类问题,提出改进的结合时延估计和聚类的说话人分类方法,在保证稳定性的前提下降低说话人分类的复杂度并提高准确率;提出一种新的相邻阵元间时延构造矩阵方程算法,可得到多个说话人的方向角。实验在标准会议环境下采集真实语音数据进行算法验证,说话人分类算法的正确率接近目前主要说话人分类系统的正确率,定位方向角误差在3以内。实验结果说明,适当条件下多距离麦克风系统可作为合适的语音信号输入设备应用于多人多方会议环境。  相似文献   

Audio-visual integration in multimodal communication   总被引:7,自引:0,他引:7  
We review recent research that examines audio-visual integration in multimodal communication. The topics include bimodality in human speech, human and automated lip reading, facial animation, lip synchronization, joint audio-video coding, and bimodal speaker verification. We also study the enabling technologies for these research topics, including automatic facial-feature tracking and audio-to-visual mapping. Recent progress in audio-visual research shows that joint processing of audio and video provides advantages that are not available when the audio and video are processed independently  相似文献   

Digital speech technology is reviewed, with the emphasis on applications demanding high-quality reproduction of the speech signal. Examples of such applications are network telephony, ISDN terminals for audio teleconferencing, and systems for the storage of audio signals, which include the important subclass of wideband speech. Depending on the application, the bandwidth of input speech can vary from about 3 kHz to nearly 20 kHz. Coding for digital telephony at 4 and 8 kb/s, network quality coding at 16 kb/s, and coding for audio at 7 and 20 kHz are examined. Future directions in the field are discussed with respect to anticipated technology applications and the algorithms needed to support these technologies  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号