共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
Source recording device recognition is an important emerging research field in digital media forensics. The literature has mainly focused on the source recording device identification problem, whereas few studies have focused on the source recording device verification problem. Sparse representation based classification methods have shown promise for many applications. This paper proposes a source cell phone verification scheme based on sparse representation. It can be further divided into three schemes which utilize exemplar dictionary, unsupervised learned dictionary and supervised learned dictionary respectively. Specifically, the discriminative dictionary learned by supervised learning algorithm, which considers the representational and discriminative power simultaneously compared to the unsupervised learning algorithm, is utilized to further improve the performances of verification systems based on sparse representation. Gaussian supervectors (GSVs) based on MFCCs, which have shown to be effective in capturing the intrinsic characteristics of recording devices, are utilized for constructing and learning dictionary. SCUTPHONE, which is a corpus of speech recordings from 15 cell phones, is presented. Evaluation experiments are conducted on three corpora of speech recordings from cell phones and demonstrate the effectiveness of the proposed methods for cell phone verification. In addition, the influences of number of target examples in the exemplar dictionary and size of the unsupervised learned dictionary on source cell phone verification performance are also analyzed. 相似文献
3.
Schultz T. Black A.W. Vogel S. Woszczyna M. 《IEEE transactions on audio, speech, and language processing》2006,14(2):403-411
Speech translation research has made significant progress over the years with many high-visibility efforts showing that translation of spontaneously spoken speech from and to diverse languages is possible and applicable in a variety of domains. As language and domains continue to expand, practical concerns such as portability and reconfigurability of speech come into play: system maintenance becomes a key issue and data is never sufficient to cover the changing domains over varying languages. In this paper, we discuss strategies to overcome the limits of today's speech translation systems. In the first part, we describe our layered system architecture that allows for easy component integration, resource sharing across components, comparison of alternative approaches, and the migration toward hybrid desktop/PDA or stand-alone PDA systems. In the second part, we show how flexibility and reconfigurability is implemented by more radically relying on learning approaches and use our English-Thai two-way speech translation system as a concrete example. 相似文献
4.
As communication becomes increasingly automated and transnational, the need for rapid, computer-aided speech translation grows. The Janus-II system uses paraphrasing and interactive error correction to boost performance. Janus-II operates on spontaneous conversational human dialogue in limited domains with vocabularies of 3,000 or more words. Current experiments involve 10,000 to 40,000 word vocabularies. It now accepts English, German, Japanese, Spanish, and Korean input, which it translates into any other of these languages. Beyond translating syntactically well-formed speech or carefully structured human-to-machine speech utterances, Janus-II research has focused on the more difficult task of translating spontaneous conversational speech between humans. This naturally requires a suitable database and task domain 相似文献
5.
Vidal E. Casacuberta F. Rodriguez L. Civera J. Hinarejos C.D.M. 《IEEE transactions on audio, speech, and language processing》2006,14(3):941-951
Current machine translation systems are far from being perfect. However, such systems can be used in computer-assisted translation to increase the productivity of the (human) translation process. The idea is to use a text-to-text translation system to produce portions of target language text that can be accepted or amended by a human translator using text or speech. These user-validated portions are then used by the text-to-text translation system to produce further, hopefully improved suggestions. There are different alternatives of using speech in a computer-assisted translation system: From pure dictated translation to simple determination of acceptable partial translations by reading parts of the suggestions made by the system. In all the cases, information from the text to be translated can be used to constrain the speech decoding search space. While pure dictation seems to be among the most attractive settings, unfortunately perfect speech decoding does not seem possible with the current speech processing technology and human error-correcting would still be required. Therefore, approaches that allow for higher speech recognition accuracy by using increasingly constrained models in the speech recognition process are explored here. All these approaches are presented under the statistical framework. Empirical results support the potential usefulness of using speech within the computer-assisted translation paradigm. 相似文献
6.
As multimedia becomes the dominant form of entertainment through an ever increasing range of digital formats, there has been a growing interest in obtaining information from entertainment media. Speech is one of the core resources in multimedia, providing a foundation for the extraction of semantic information. Thus, detecting speech is a critical first step for speech-based information retrieval systems. This work focuses on speech detection in one of the dominant forms of entertainment media: feature films. A novel approach for voice activity detection (VAD) in film audio is proposed. The approach uses correlation to analyze associations of Mel Frequency Cepstral Coefficient (MFCC) pairs in speech and non-speech data. This information then drives feature selection for the creation of MFCC cross-covariance feature vectors (MFCC-CCs) which are used to train a random forest classifier to solve a binary speech/non-speech classification problem on audio data from entertainment media. The classifier performance is evaluated on a number of test sets and achieves a classification accuracy of up to 94%. The approach is also compared with state of the art and contemporary VAD algorithms, and demonstrates competitive results. 相似文献
7.
Sean A. Ramprashad 《International Journal of Speech Technology》1999,2(4):359-372
A two stage hybrid embedded speech/audio coding structure and algorithm are proposed. The first stage of the structure consists of a core speech coder which provides a minimum output bit rate and acceptable performance on clean speech inputs. The second stage is a perceptual/transform based coder which provides a separate optional bitstream for the enhancement of the core stage output.The two stage structure can be used to enhance the quality of an existing codec without modification of the original coding algorithm. In this regard it can be considered a value added option that can be used with a standard (existing) system. The structure can also be used in systems in which many users/systems force the coding algorithm to work simultaneously under multiple constraints of bitrate, complexity, delay, and coding quality.Informal testing of the algorithm has been done using ITU-T standard G.723.1 at 5.3 kb/s as a core coder. The maximum combined bitrate from the core and enhancement stages for the tests is 16 kb/s. The tests show that the second stage significantly improves the quality of the core output in the cases of music and speech with background noise. Compared to the non-embedded fixed rate standard LD-CELP G.728 at 16 kb/s, the quality of the two stage structure is generally lower on these inputs; the embedded feature does affect quality. On clean speech the quality of the two stage structure at 16 kb/s is close to if not better than that of G.728 at 16 kb/s. 相似文献
8.
Kikui G. Yamamoto S. Takezawa T. Sumita E. 《IEEE transactions on audio, speech, and language processing》2006,14(5):1674-1682
This paper investigates issues in preparing corpora for developing speech-to-speech translation (S2ST). It is impractical to create a broad-coverage parallel corpus only from dialog speech. An alternative approach is to have bilingual experts write conversational-style texts in the target domain, with translations. There is, however, a risk of losing fidelity to the actual utterances. This paper focuses on balancing a tradeoff between these two kinds of corpora through the analysis of two newly developed corpora in the travel domain: a bilingual parallel corpus with 420 K utterances and a collection of in-domain dialogs using actual S2ST systems. We found that the first corpus is effective for covering utterances in the second corpus if complimented with a small number of utterances taken from monolingual dialogs. We also found that characteristics of in-domain utterances become closer to those of the first corpus when more restrictive conditions and instructions to speakers are given. These results suggest the possibility of a bootstrap-style of development of corpora and S2ST systems, where an initial S2ST system is developed with parallel texts, and is then gradually improved with in-domain utterances collected by the system as restrictions are relaxed. 相似文献
9.
JMF技术和实时语音通信的实现 总被引:3,自引:0,他引:3
文章介绍了流媒体的基本概念,描述了由JMF RTP APIs所提供的对实时媒体流的支持,阐述了怎样通过网络发送和接收流媒体数据。 相似文献
10.
Single-channel speech enhancement using implicit Wiener filter for high-quality speech communication
Jaiswal Rahul Kumar Yeduri Sreenivasa Reddy Cenkeramaddi Linga Reddy 《International Journal of Speech Technology》2022,25(3):745-758
International Journal of Speech Technology - Speech enables easy human-to-human communication as well as human-to-machine interaction. However, the quality of speech degrades due to background... 相似文献
11.
12.
13.
14.
Verónica López-Ludeña Rubén San-Segundo Carlos González Morcillo Juan Carlos López José M. Pardo Muñoz 《Expert systems with applications》2013,40(4):1312-1322
This paper describes a new version of a speech into sign language translation system with new tools and characteristics for increasing its adaptability to a new task or a new semantic domain. This system is made up of a speech recognizer (for decoding the spoken utterance into a word sequence), a natural language translator (for converting a word sequence into a sequence of signs belonging to the sign language), and a 3D avatar animation module (for playing back the signs). In order to increase the system adaptability, this paper presents new improvements in all the three main modules for generating automatically the task dependent information from a parallel corpus: automatic generation of Spanish variants when generating the vocabulary and language model for the speech recogniser, an acoustic adaptation module for the speech recogniser, data-oriented language and translation models for the machine translator and a list of signs to design. The avatar animation module includes a new editor for rapidly design of the required signs. These developments have been necessary to reduce the effort when adapting a Spanish into Spanish sign language (LSE: Lengua de Signos Española) translation system to a new domain. The whole translation presents a SER (Sign Error Rate) lower than 10% and a BLEU higher than 90% while the effort for adapting the system to a new domain has been reduced more than 50%. 相似文献
15.
Nicolas Staelens Jonas De Meulenaere Lizzy Bleumers Glenn Van Wallendael Jan De Cock Koen Geeraert Nick Vercammen Wendy Van den Broeck Brecht Vermeulen Rik Van de Walle Piet Demeester 《Multimedia Systems》2012,18(6):445-457
Lip synchronization is considered a key parameter during interactive communication. In the case of video conferencing and television broadcasting, the differential delay between audio and video should remain below certain thresholds, as recommended by several standardization bodies. However, further research has also shown that these thresholds can be relaxed, depending on the targeted application and use case. In this article, we investigate the influence of lip sync on the ability to perform real-time language interpretation during video conferencing. Furthermore, we are also interested in determining proper lip sync visibility thresholds applicable to this use case. Therefore, we conducted a subjective experiment using expert interpreters, which were required to perform a simultaneous translation, and non-experts. Our results show that significant differences are obtained when conducting subjective experiments with expert interpreters. As interpreters are primarily focused on performing the simultaneous translation, lip sync detectability thresholds are higher compared with existing recommended thresholds. As such, primary focus and the targeted application and use case are important factors to be considered when selecting proper lip sync acceptability thresholds. 相似文献
16.
17.
Theories of computer-mediated communication typically rest upon the assumption that communication via computers lacks visual and auditory cues. However, recent technological advances, such as webcams and microphones, as well as their increased use question this assumption. Moreover, the question arises of what characterizes individuals who use such devices. Drawing on a survey of 1060 adolescents, we found that 57% of adolescents at least occasionally used webcams during instant messaging, while 32% at least sometimes used microphones. If adolescents perceived the lack of visual cues in online communication to be important, they used webcams less frequently. For early and middle adolescents, greater levels of social anxiety reduced the use of webcams, whereas higher levels of private self-consciousness increased it. Our results suggest that the nature of computer-mediated communication may change considerably in the next years. Theories of computer-mediated communication need to more strongly integrate these changes into theory building. 相似文献
18.
主要介绍了应用于语音压缩及多媒体技术中静音抑制算法,并通过该算法中语音检测算法和噪声发生器算法,以实现降低语音间隙的发送比特率,实现非连续发送。 相似文献
19.
20.
Moon-Sang Lee Sang-Kwon Lee Joonwon Lee Seung-Ryoul Maeng 《Computer Architecture Letters》2006,5(1):26-29
User-level communication alleviates the software overhead of the communication subsystem by allowing applications to access the network interface directly. For that purpose, efficient address translation of virtual address to physical address is critical. In this study, we propose a system call based address translation scheme where every translation is done by the kernel instead of a translation cache on a network interface controller as in the previous cache based address translation. According to our experiments, our scheme achieves up to 4.5 % reduction in application execution time compared to the previous cache based approach. 相似文献