首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
Digital audio & video data have become an integral part of multimedia information systems. To reduce storage and bandwidth requirements, they are commonly stored in a compressed format, such as MPEG-1. Increasing amounts of MPEG encoded audio and video documents are available online and in proprietary collections. In order to effectively utilise them, we need tools and techniques to automatically analyse, segment, and classify MPEG video content. Several techniques have been developed both in the audio and visual domain to analyse videos. This paper presents a survey of audio and visual analysis techniques on MPEG-1 encoded media that are useful in supporting a variety of video applications. Although audio and visual feature analyses have been carried out extensively, they become useful to applications only when they convey a semantic meaning of the video content. Therefore, we also present a survey of works that provide semantic analysis on MPEG-1 encoded videos.  相似文献   

2.
An overview of audio information retrieval   总被引:28,自引:0,他引:28  
The problem of audio information retrieval is familiar to anyone who has returned from vacation to find an answering machine full of messages. While there is not yet an “AltaVista” for the audio data type, many workers are finding ways to automatically locate, index, and browse audio using recent advances in speech recognition and machine listening. This paper reviews the state of the art in audio information retrieval, and presents recent advances in automatic speech recognition, word spotting, speaker and music identification, and audio similarity with a view towards making audio less “opaque”. A special section addresses intelligent interfaces for navigating and browsing audio and multimedia documents, using automatically derived information to go beyond the tape recorder metaphor.  相似文献   

3.
This paper presents a study on implementing the ASR‐based CALL (computer‐assisted language learning based upon automatic speech recognition) system embedded with both formative and summative feedback approaches and using implicit and explicit strategies to enhance adult and young learners' English pronunciation. Two groups of learners including 18 adults and 16 seventh graders participated in the study. The results indicate that the formative feedback had a positive impact on improving the learners' speaking articulation, and the summative feedback aided the learners' self‐reflection and helped them to track their speaking progress. Furthermore, the implicit information such as model pronunciation with full sentences and audio recast benefitted the adult learners, whereas the young learners preferred the explicit learning information such as textual information of individual words for self‐correction. In addition, the results of this study also confirm that learners have different perceptions of the media modalities designed with implicit and explicit strategies in the feedback. Feedback with audio modality is more suitable for adults, whereas juxtaposed textual and audio modalities are better for young learners.  相似文献   

4.
In two experiments, the multimedia contradiction paradigm was used to investigate whether learners map information conveyed through the audio and the picture track of a video. In Experiment 1 (N = 85), the information conveyed through the audio track and the picture track was always consistent (control group) or was made inconsistent by changing the audio track at one point in time (text-wrong group). Experiment 2 (N = 143) added a second inconsistent condition by changing the picture track at one point in time (picture-wrong group). In both experiments, the learners' gaze behaviour differed from that of the control group when inconsistent information was presented in the experimental groups. This indicates that mapping processes—which are an indispensable part of integration—occur when learners process videos. Regarding learning outcomes, no differences between groups were observed. In addition, only a few learners remembered the conflict after learning. Further, recall shifted towards the pictorial information when learners encountered conflict.  相似文献   

5.

In voice recognition, the two main problems are speech recognition (what was said), and speaker recognition (who was speaking). The usual method for speaker recognition is to postulate a model where the speaker identity corresponds to the parameters of the model, which estimation could be time-consuming when the number of candidate speakers is large. In this paper, we model the speaker as a high dimensional point cloud of entropy-based features, extracted from the speech signal. The method allows indexing, and hence it can manage large databases. We experimentally assessed the quality of the identification with a publicly available database formed by extracting audio from a collection of YouTube videos of 1,000 different speakers. With 20 second audio excerpts, we were able to identify a speaker with 97% accuracy when the recording environment is not controlled, and with 99% accuracy for controlled recording environments.

  相似文献   

6.
Indexing and Retrieval of Audio: A Survey   总被引:3,自引:0,他引:3  
With more and more audio being captured and stored, there is a growing need for automatic audio indexing and retrieval techniques that can retrieve relevant audio pieces quickly on demand. This paper provides a comprehensive survey of audio indexing and retrieval techniques. We first describe main audio characteristics and features and discuss techniques for classifying audio into speech and music based on these features. Indexing and retrieval of speech and music is then described separately. Finally, significance of audio in multimedia indexing and retrieval is discussed.  相似文献   

7.
The Cambridge University Multimedia Document Retrieval (CU-MDR) Demo System is a web-based application that allows the user to query a database of radio broadcasts that are available on the Internet. The audio from several radio stations is downloaded and transcribed automatically. This gives a collection of text and audio documents that can be searched by a user. The paper describes how speech recognition and information retrieval techniques are combined in the CU-MDR Demo System and shows how the user can interact with it.  相似文献   

8.
One of the most common effects among aphasia patients is the difficulty to recall names or words. Typically, word retrieval problems can be treated through word naming therapeutic exercises. In fact, the frequency and the intensity of speech therapy are key factors in the recovery of lost communication functionalities. In this sense, speech and language technology can have a relevant contribution in the development of automatic therapy methods. In this work, we present an on-line system designed to behave as a virtual therapist incorporating automatic speech recognition technology that permits aphasia patients to perform word naming training exercises. We focus on the study of the automatic word naming detector module and on its utility for both global evaluation and treatment. For that purpose, a database consisting of word naming therapy sessions of aphasic Portuguese native speakers has been collected. In spite of the different patient characteristics and speech quality conditions of the collected data, encouraging results have been obtained thanks to a calibration method that makes use of the patients’ word naming ability to automatically adapt to the patients’ speech particularities.  相似文献   

9.
Smart speakers with voice assistants like Google Home or Amazon Alexa are increasingly popular and essential in our daily lives due to their convenience of issuing voice commands. Ensuring that these voice assistants are equitable to different population subgroups is crucial. In this paper, we present the first framework, AudioAcc, to help evaluate the performance of various accents. AudioAcc takes in videos from YouTube and generates composite commands. We further propose a new metric called Consistency of Results (COR) that developers can use to avoid the incorrect translation of the produced results by rewriting the skill to improve the Word Error Rate (WER) performance. We evaluate AudioAcc on complete sentences extracted from YouTube videos. The result reveals that our composite sentences generated by AudioAcc are close to the complete sentences. Our evaluation of diverse audiences shows that first, speech from native speakers, particularly Americans, exhibits the best WER performance by 9.5% in comparison to speech from other native and nonnative speakers. Second, speech from American professional speakers has significantly more fairness and the best WER performance by 8.3% in comparison to speech from German professional speakers and German and American amateur speakers. Moreover, we show that using the COR metric could help developers to rewrite the skill to improve the WER accuracy, which we used to improve the accuracy of the Russian accent.  相似文献   

10.
This paper presents an integrated approach to spot the spoken keywords in digitized Tamil documents by combining word image matching and spoken word recognition techniques. The work involves the segmentation of document images into words, creation of an index of keywords, and construction of word image hidden Markov model (HMM) and speech HMM for each keyword. The word image HMMs are constructed using seven dimensional profile and statistical moment features and used to recognize a segmented word image for possible inclusion of the keyword in the index. The spoken query word is recognized using the most likelihood of the speech HMMs using the 39 dimensional mel frequency cepstral coefficients derived from the speech samples of the keywords. The positional details of the search keyword obtained from the automatically updated index retrieve the relevant portion of text from the document during word spotting. The performance measures such as recall, precision, and F-measure are calculated for 40 test words from the four groups of literary documents to illustrate the ability of the proposed scheme and highlight its worthiness in the emerging multilingual information retrieval scenario.  相似文献   

11.
Many children with speech sound disorders cannot pronounce the sibilant consonants correctly. We have developed a serious game, which is controlled by the children's voices in real time, with the purpose of helping children on practicing the production of European Portuguese (EP) sibilant consonants. For this, the game uses a sibilant consonant classifier. Since the game does not require any type of adult supervision, children can practice producing these sounds more often, which may lead to faster improvements of their speech. Recently, the use of deep neural networks has given considerable improvements in the classification of a variety of use cases, from image classification to speech and language processing. Here, we propose to use deep convolutional neural networks to classify sibilant phonemes of EP in our serious game for speech and language therapy. We compared the performance of several different artificial neural networks that used Mel frequency cepstral coefficients or log Mel filterbanks. Our best deep learning model achieves classification scores of 95.48% using a 2D convolutional model with log Mel filterbanks as input features. Such results are then further improved for specific classes with simple binary classifiers.  相似文献   

12.
利用InternetReadFile API函数下载二进制大文件   总被引:1,自引:0,他引:1  
编写应用程序时,使用InternetReadFile API函数将二进制文件如图片、视频、音频文件从服务器上下载到客户端,通过编程实例详细阐述了使用该函数下载二进制大文件及其解决办法。  相似文献   

13.
Effectiveness of hypermedia annotations for foreign language reading   总被引:2,自引:0,他引:2  
Abstract This study first explores intermediate-level English learners' preferences for hypermedia annotations while they are engaged in reading a hypermedia text. Second, it examines whether multimedia annotations facilitate reading comprehension in the second language. The participants were 44 adult learners of English as a foreign language studying English for Academic Purposes. Data were collected through a tracking tool, a reading comprehension test, a questionnaire, and interviews. Results indicate that learners preferred visual annotations significantly more than textual and audio annotations. On the other hand, a negative relationship was found between annotation use and reading comprehension. Especially, pronunciations, audiorecordings, and videos were found to affect reading comprehension negatively. However, the qualitative data revealed that the participants had positive attitudes towards annotations and hypermedia reading in general.  相似文献   

14.
15.
16.
17.
18.
Because of the media digitization, a large amount of information such as speech, audio and video data is produced everyday. In order to retrieve data from these databases quickly and precisely, multimedia technologies for structuring and retrieving of speech, audio and video data are strongly required. In this paper, we overview the multimedia technologies such as structuring and retrieval of speech, audio and video data, speaker indexing, audio summarization and cross media retrieval existing today for TV news detabase. The main purpose of structuring is to produce tables of contents and indices from audio and video data automatically. In order to make these technologies feasible, first, processing units such as words on audio data and shots on video data are extracted. On a second step, they are meaningfully integrated into topics. Furthermore, the units extracted from different types of media are integrated for higher functions. Yasuo Ariki, Ph.D.: He is a Professor in the Department of Electronics and Informatics at the Ryukoku University. He received his B.E., M.E. and Ph.D. in information science from Kyoto University in 1974, 1976 and 1979, respectively. He had been an Assistant in Kyoto University from 1980 to 1990, and stayed at Edinburgh University as visiting academic from 1987 to 1990. His research interests are in speech and image recognition and in information retrieval and database. He is a member of IPSJ, IEICE, ASJ, Soc. Artif. Intel. and IEEE.  相似文献   

19.
针对现有的情感分析方法缺乏对短视频中信息的充分考虑,从而导致不恰当的情感分析结果.基于音视频的多模态情感分析(AV-MSA)模型便由此产生,模型通过利用视频帧图像中的视觉特征和音频信息来完成短视频的情感分析.模型分为视觉与音频2分支,音频分支采用卷积神经网络(CNN)架构来提取音频图谱中的情感特征,实现情感分析的目的;...  相似文献   

20.
基于视频三音子的汉语双模态语料库的建立   总被引:2,自引:0,他引:2  
为实现可视语音合成和双模态语音识别,需要建立符合条件的双模态语料库。该文提出了一种汉语双模态语料库的建立方法。根据视频中唇部发音特征,对已有的三音子模型聚类,形成视频三音子。在视频三音子的基础上,利用评估函数对原始语料中的句子打分,并实现语料的自动选取。与其他双模态语料库相比,该文所建立的语料库在覆盖率、覆盖效率和高频词分布律有了较大改进,能够更加真实反映汉语中的双模态语言现象。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号