排序方式: 共有15条查询结果,搜索用时 15 毫秒
1.
Perakakis M. Potamianos A. 《IEEE transactions on audio, speech, and language processing》2008,16(6):1194-1206
The usage patterns of speech and visual input modes are investigated as a function of relative input mode efficiency for both desktop and personal digital assistant (PDA) working environments. For this purpose the form-filling part of a multimodal dialogue system is implemented and evaluated; three multimodal modes of interaction are implemented: ldquoClick-to-Talk,rdquo ldquoOpen-Mike,rdquo and ldquoModality-Selection.rdquo ldquoModality-Selectionrdquo implements an adaptive interface where the system selects the most efficient input mode at each turn, effectively alternating between a ldquoClick-to-Talkrdquo and ldquoOpen-Mikerdquo interaction style as proposed in ldquoModality tracking in the multimodal Bell Labs Communicator,rdquo in Proceedings of the Automatic Speech Recognition and Understanding Workshop, by A. Potamianos, , 2003. The multimodal systems are evaluated and compared with the unimodal systems. Objective and subjective measures used include task completion, task duration, turn duration, and overall user satisfaction. Turn duration is broken down into interaction time and inactivity time to better measure the efficiency of each input mode. Duration statistics and empirical probability density functions are computed as a function of interaction context and user. Results show that the multimodal systems outperform the unimodal systems in terms of objective and subjective criteria. Also, users tend to use the most efficient input mode at each turn; however, biases towards the default input modality and a general bias towards the speech modality also exists. Results demonstrate that although users exploit some of the available synergies in multimodal dialogue interaction, further efficiency gains can be achieved by designing adaptive interfaces that fully exploit these synergies. 相似文献
2.
Zhenqiu Zhang Gerasimos Potamianos Andrew W. Senior Thomas S. Huang 《Signal, Image and Video Processing》2007,1(2):163-178
The paper introduces a novel detection and tracking system that provides both frame-view and world-coordinate human location
information, based on video from multiple synchronized and calibrated cameras with overlapping fields of view. The system
is developed and evaluated for the specific scenario of a seminar lecturer presenting in front of an audience inside a “smart
room”, its aim being to track the lecturer’s head centroid in the three-dimensional (3D) space and also yield two-dimensional
(2D) face information in the available camera views. The proposed approach is primarily based on a statistical appearance
model of human faces by means of well-known AdaBoost-like face detectors, extended to address the head pose variation observed
in the smart room scenario of interest. The appearance module is complemented by two novel components and assisted by a simple
tracking drift detection mechanism. The first component of interest is the initialization module, which employs a spatio-temporal
dynamic programming approach with appropriate penalty functions to obtain optimal 3D location hypotheses. The second is an
adaptive subspace learning based 2D tracking scheme with a novel forgetting mechanism, introduced to reduce tracking drift
and increase robustness. System performance is benchmarked on an extensive database of realistic human interaction in the
lecture smart room scenario, collected as part of the European integrated project “CHIL”. The system consistently achieves
excellent tracking precision, with a 3D mean tracking error of less than 16 cm, and is demonstrated to outperform four alternative
tracking schemes. Furthermore, the proposed system performs relatively well in detecting frontal and near-frontal faces in
the available frame views.
This work was performed while Zhenqiu Zhang was on a summer internship with the Human Language Technology Department at the
IBM T.J. Watson Research Center. 相似文献
3.
Djamel Mostefa Nicolas Moreau Khalid Choukri Gerasimos Potamianos Stephen M. Chu Ambrish Tyagi Josep R. Casas Jordi Turmo Luca Cristoforetti Francesco Tobia Aristodemos Pnevmatikakis Vassilis Mylonakis Fotios Talantzis Susanne Burger Rainer Stiefelhagen Keni Bernardin Cedrick Rochet 《Language Resources and Evaluation》2007,41(3-4):389-407
4.
Potamianos G.G. Goutsias J.K. 《IEEE transactions on information theory / Professional Technical Group on Information Theory》1993,39(4):1322-1332
A Monte Carlo simulation technique for estimating the partition function of a general Gibbs random field image is proposed. By expressing the partition function as an expectation, an importance sampling approach for estimating it using Monte Carlo simulations is developed. As expected, the resulting estimators are unbiased and consistent. Computations can be performed iteratively by using simple Monte Carlo algorithms with remarkable success, as demonstrated by simulations. The work concentrates on binary, second-order Gibbs random fields defined on a rectangular lattice. However, the proposed methods can be easily extended to more general Gibbs random fields. Their potential contribution to optimal parameter estimation and hypothesis testing problems for general Gibbs random field images using a likelihood approach is anticipated 相似文献
5.
P Potamianos AA Amis AJ Forester M McGurk M Bircher 《Canadian Metallurgical Quarterly》1998,212(5):383-393
During pregnancy, mouse uterine epithelial cells produce and secrete a large amount of macrophage colony-stimulating factor (M-CSF/CSF-1). Macrophages accumulate and proliferate in the undecidualized endometrium of the pregnant uterus. Observations showed that macrophages expressed scavenger receptor class A (type I and type II) and class C (macrosialin). Scavenger receptors appeared to be involved in the removal of apoptotic cells in the degenerated decidual tissue. The expression of class A and class C scavenger receptor mRNAs in the uterus of pregnant mice was elevated but the expression of class B scavenger receptor (CD36) mRNA was similar to that of non-pregnant mice. The expression of various cytokines and chemokines, including M-CSF, monocyte chemoattractant protein-1 (MCP-1) and macrophage inflammatory protein 1-alpha (MIP1-alpha), was enhanced in the uterus of pregnant mice, suggesting that these molecules regulate macrophage chemotaxis and immunological function in the uterus. These findings imply that the pregnant uterus provides a microenvironment for the recruitment, differentiation, and proliferation of macrophages and the regulation of scavenger receptor and cytokine expression for a successful pregnancy. 相似文献
6.
Recent advances in the automatic recognition of audiovisual speech 总被引:11,自引:0,他引:11
Potamianos G. Neti C. Gravier G. Garg A. Senior A.W. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2003,91(9):1306-1326
Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audiovisual automatic speech recognition (ASR) and present novel contributions in two main areas: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovisual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audiovisual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual adaptation. We apply our algorithms to three multisubject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves ASR over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks. 相似文献
7.
Serdar Yildirim Shrikanth Narayanan Alexandros Potamianos 《Computer Speech and Language》2011,25(1):29-44
The automatic recognition of user’s communicative style within a spoken dialog system framework, including the affective aspects, has received increased attention in the past few years. For dialog systems, it is important to know not only what was said but also how something was communicated, so that the system can engage the user in a richer and more natural interaction. This paper addresses the problem of automatically detecting “frustration”, “politeness”, and “neutral” attitudes from a child’s speech communication cues, elicited in spontaneous dialog interactions with computer characters. Several information sources such as acoustic, lexical, and contextual features, as well as, their combinations are used for this purpose. The study is based on a Wizard-of-Oz dialog corpus of 103 children, 7–14 years of age, playing a voice activated computer game. Three-way classification experiments, as well as, pairwise classification between polite vs. others and frustrated vs. others were performed. Experimental results show that lexical information has more discriminative power than acoustic and contextual cues for detection of politeness, whereas context and acoustic features perform best for frustration detection. Furthermore, the fusion of acoustic, lexical and contextual information provided significantly better classification results. Results also showed that classification performance varies with age and gender. Specifically, for the “politeness” detection task, higher classification accuracy was achieved for females and 10–11 years-olds, compared to males and other age groups, respectively. 相似文献
8.
Gerasimos Potamianos Chalapathy Neti Giridharan Iyengar Andrew W. Senior Ashish Verma 《International Journal of Speech Technology》2001,4(3-4):193-208
We propose a three-stage pixel-based visual front end for automatic speechreading (lipreading) that results in significantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speaker's mouth area. The first stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis-based data projection, which is applied on a concatenation of a small amount of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multivariate normal distribution with diagonal covariance. We applied the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 8-hour long, large vocabulary, continuous speech audio-visual database. We demonstrated significant classification accuracy gains by each added stage of the proposed algorithm which, when combined, can achieve up to 27% improvement. Overall, we achieved a 60% (49%) visual-only frame-level visemic classification accuracy with (without) use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results. 相似文献
9.
Potamianos A. Fosler-Lussier E. Ammicht E. Perakakis M. 《Multimedia, IEEE Transactions on》2007,9(3):550-566
For pt.1see ibid., vol. 9, p. 3 (2007). In this paper, the task and user interface modules of a multimodal dialogue system development platform are presented. The main goal of this work is to provide a simple, application-independent solution to the problem of multimodal dialogue design for information seeking applications. The proposed system architecture clearly separates the task and interface components of the system. A task manager is designed and implemented that consists of two main submodules: the electronic form module that handles the list of attributes that have to be instantiated by the user, and the agenda module that contains the sequence of user and system tasks. Both the electronic forms and the agenda can be dynamically updated by the user. Next a spoken dialogue module is designed that implements the speech interface for the task manager. The dialogue manager can handle complex error correction and clarification user input, building on the semantics and pragmatic modules presented in Part I of this paper. The spoken dialogue system is evaluated for a travel reservation task of the DARPA Communicator research program and shown to yield over 90% task completion and good performance for both objective and subjective evaluation metrics. Finally, a multimodal dialogue system which combines graphical and speech interfaces, is designed, implemented and evaluated. Minor modifications to the unimodal semantic and pragmatic modules were required to build the multimodal system. It is shown that the multimodal system significantly outperforms the unimodal speech-only system both in terms of efficiency (task success and time to completion) and user satisfaction for a travel reservation task 相似文献
10.
Ioannou D.E. Cristoloveanu S. Potamianos C.N. Zhong X. McLarty P.K. Hughes H.L. 《Electron Devices, IEEE Transactions on》1991,38(3):463-468
A comprehensive electrical characterization study which was conducted to optimize the fabrication of SIMOX substrates for VLSI is discussed. The oxygen implantation was carried out using medium-current and high-current implanters. The wafers were annealed at 1275°C and 1300°C to produce high-quality, precipitate-free material. The effect of dose, the effect of multiple implantation (by sequentially implanting and annealing), and the effect of the anneal ambient gas and the capping layer during annealing were studied. MOSFETs of various geometries with a gate oxide of ~20 nm were fabricated by a CMOS process incorporating the addition of a thin epitaxial Si layer. A general evaluation of each transistor was conducted by studying its static characteristics. The interface states, bulk traps, and carrier generation phenomena were studied. Good-quality interfaces were obtained. Better implantation control reduced contamination and suppressed deep traps below the detection limit. Multiple implantation resulted in superior material quality. as evidenced by very long generation lifetime values (> 100 μs) 相似文献