共查询到20条相似文献,搜索用时 31 毫秒
1.
Prosodic and other Long-Term Features for Speaker Diarization 总被引:1,自引:0,他引:1
《IEEE transactions on audio, speech, and language processing》2009,17(5):985-993
2.
利用EHMM和CLR的说话人分割聚类算法 总被引:1,自引:0,他引:1
针对传统的说话人分割聚类系统中,由于聚类时话者信息不足而影响切分准确度的问题,本文提出了一种基于进化隐马尔科夫模型和交叉对数似然比距离测度的多层次说话人分割聚类算法,在传统的话者分割聚类算法的基础上引入了重分割和重聚类的机制,以及基于距离测度和贝叶斯信息准则的分层聚类算法,有效的解决了传统方法中切分准确度受到话者信息制约的问题.在美国国家标准技术署(NIST)2003 Spring RT数据库上的实验结果表明,本文提出的算法比传统算法系统性能相对提高了41%. 相似文献
3.
4.
5.
李敬阳李锐王莉王晓笛 《数据采集与处理》2017,32(1):54-61
说话人聚类是说话人分离中的一个重要过程,然而传统的以贝叶斯信息准则作为距离测度的层次聚类方式,会出现聚类误差向上传递的情况。本文提出了一种逐级算法增强处理机制。当片段之间的最小贝叶斯信息准则距离超过设定的门限值时,或者类别个数到达一定程度时,将当前聚类结果作为初始类中心,通过变分贝叶斯迭代法重新对每个类别中的片段调优,最后再依据概率线性判别分析得分门限确定说话人个数。实验表明,本文方法在美国国家标准技术署08 summed测试集上,使得“类纯度”和“说话人纯度”比传统算法都有了一定提升,且使得说话人分离整体性能相对提升了27.6%。 相似文献
6.
随着音频数据的不断增加,说话人识别已经变得越来越困难。本文提出了一种新颖的方法,在已有的说话人识别系统(GMM-UBM系统)的基础上,综合利用Index和Simulation,以很小的代价,极大地提高了说话人识别的速度,从而使说话人搜索成为可能。具体而言,就是采用两遍搜索策略,首先通过建立索引,在索引空间,比较索引间的欧氏距离,粗略地筛选出一定量的候选说话人目标;然后在此基础上,通过更精细的Simulation模型匹配,找出最佳的识别结果。实验结果显示我们的方法能以很小的代价,显著地提高说话人识别的速度。 相似文献
7.
Anguera X. Wooters C. Hernando J. 《IEEE transactions on audio, speech, and language processing》2007,15(7):2011-2022
When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they cannot outperform the simpler case of using the best single microphone. In this paper, the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are presenting include blind reference-channel selection, two-step time delay of arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task. 相似文献
8.
《Multimedia, IEEE Transactions on》2009,11(4):658-669
9.
10.
Nikolaos Sarafianos Theodoros Giannakopoulos Sergios Petridis 《Multimedia Tools and Applications》2016,75(1):115-130
Speaker diarization aims to automatically answer the question “who spoke when” given a speech signal. In this work, we have focused on applying the FLsD approach, a semi-supervised version of Fisher Linear Discriminant analysis, both in the audio and the video signals to form a complete multimodal speaker diarization system. Extensive experiments have proven that the FLsD method boosts the performance of the face diarization task (i.e. the task of discovering faces over time given only the visual signal). In addition, we have proven through experimentation that applying the FLsD method for discriminating between faces is also independent of the initial feature space and remains relatively unaffected as the number of faces increases. Finally, a fusion method is proposed that leads to performance improvement in comparison to the best individual modality, which is the audio signal. 相似文献
11.
12.
《Digital Signal Processing》2000,10(1-3):93-112
Dunn, Robert B., Reynolds, Douglas A., and Quatieri, Thomas F., Approaches to Speaker Detection and Tracking in Conversational Speech, Digital Signal Processing10(2000), 93–112.Two approaches to detecting and tracking speakers in multispeaker audio are described. Both approaches use an adapted Gaussian mixture model, universal background model (GMM-UBM) speaker detection system as the core speaker recognition engine. In one approach, the individual log-likelihood ratio scores, which are produced on a frame-by-frame basis by the GMM-UBM system, are used to first partition the speech file into speaker homogenous regions and then to create scores for these regions. We refer to this approach as internal segmentation. Another approach uses an external segmentationalgorithm, based on blind clustering, to partition the speech file into speaker homogenous regions. The adapted GMM-UBM system then scores each of these regions as in the single-speaker recognition case. We show that the external segmentation system outperforms the internal segmentation system for both detection and tracking. In addition, we show how different components of the detection and tracking algorithms contribute to the overall system performance. 相似文献
13.
Ananth N. Iyer Uchechukwu O. Ofoegbu Robert E. Yantorno Brett Y. Smolenski 《International Journal of Speech Technology》2007,10(2-3):95-107
Speaker discrimination is a vital aspect of speaker recognition applications such as speaker identification, verification, clustering, indexing and change-point detection. These tasks are usually performed using distance-based approaches to compare speaker models or features from homogeneous speaker segments in order to determine whether or not they belong to the same speaker. Several distance measures and features have been examined for all the different applications, however, no single distance or feature has been reported to perform optimally for all applications in all conditions. In this paper, a thorough analysis is made to determine the behavior of some frequently used distance measures, as well as features, in distinguishing speakers for different data lengths. Measures studied include the Mahalanobis distance, Kullback-Leibler (KL) distance, T 2 statistic, Hellinger distance, Bhattacharyya distance, Generalized Likelihood Ratio (GLR), Levenne distance, L 2 and L ∞ distances. The Mel-Scale Frequency Cepstral Coefficient (MFCC), Linear Predictive Cepstral Coefficients (LPCC), Line Spectral Pairs (LSP) and the Log Area Ratios (LAR) comprise the features investigated. The usefulness of these measures is studied for different data lengths. Generally, a larger data size for each speaker results in better speaker differentiating capability, as more information can be taken into account. However, in some applications such as segmentation of telephone data, speakers change frequently, making it impossible to obtain large speaker-consistent utterances (especially when speaker change-points are unknown). A metric is defined for determining the probability of speaker discrimination error obtainable for each distance measure using each feature set, and the effect of data size on this probability is observed. Furthermore, simple distance-based speaker identification and clustering systems are developed, and the performances of each distance and feature for various data sizes are evaluated on these systems in order to illustrate the importance of choosing the appropriate distance and feature for each application. Results show that for tasks which do not involve any limitation of data length, such as speaker identification, the Kullback Leibler distance with the MFCCs yield the highest speaker differentiation performance, which is comparable to results obtained using more complex state-of-the-art speaker identification systems. Results also indicate that the Hellinger and Bhattacharyya distances with the LSPs yield the best performance for small data sizes. 相似文献
14.
《Multimedia, IEEE Transactions on》2008,10(8):1541-1552
15.
Barrington L. Chan A. B. Lanckriet G. 《IEEE transactions on audio, speech, and language processing》2010,18(3):602-612
16.
Wichern G. Xue J. Thornburg H. Mechtley B. Spanias A. 《IEEE transactions on audio, speech, and language processing》2010,18(3):688-707
17.
《Digital Signal Processing》2000,10(1-3):113-132
Koolwaaij, Johan, and Boves, Lou, Local Normalization and Delayed Decision Making in Speaker Detection and Tracking, Digital Signal Processing, 10(2000), 113–132.This paper describes A2RT's speaker detection and tracking system and its performance on the 1999 NIST speaker recognition evaluation data. The system does not consist of concatenated modules such as, for instance, silence–speech detection, handset and gender detection, and finally speaker detection or tracking, where each module builds on the hard decisions from previous modules, but rather applies the principle of delayed decision making and postpones all hard decisions until the final stage of the detection process. This paper focuses on two important locality issues in detecting or tracking speakers in a telephone conversation, for which the speaker change frequency is usually high. First, channel estimation needs sufficiently long but homogeneous segments. Several kinds of local channel normalization are compared in this paper. Second, local estimation of speaker likelihoods critically depends on the segmentation of the conversation. Our experiments show that a global level of segmentation really improves speaker tracking performance, whereas a more detailed segmentation is needed for speaker detection, because likelihood computation over clusters of segments depends on the purity of the segments. Furthermore, choosing the appropriate type of channel normalization can give a small but consistent improvement in speaker tracking performance. 相似文献
18.
19.
20.
Kotti M. Benetos E. Kotropoulos C. 《IEEE transactions on audio, speech, and language processing》2008,16(5):920-933
An algorithm for automatic speaker segmentation based on the Bayesian information criterion (BIC) is presented. BIC tests are not performed for every window shift, as previously, but when a speaker change is most probable to occur. This is done by estimating the next probable change point thanks to a model of utterance durations. It is found that the inverse Gaussian fits best the distribution of utterance durations. As a result, less BIC tests are needed, making the proposed system less computationally demanding in time and memory, and considerably more efficient with respect to missed speaker change points. A feature selection algorithm based on branch and bound search strategy is applied in order to identify the most efficient features for speaker segmentation. Furthermore, a new theoretical formulation of BIC is derived by applying centering and simultaneous diagonalization. This formulation is considerably more computationally efficient than the standard BIC, when the covariance matrices are estimated by other estimators than the usual maximum-likelihood ones. Two commonly used pairs of figures of merit are employed and their relationship is established. Computational efficiency is achieved through the speaker utterance modeling, whereas robustness is achieved by feature selection and application of BIC tests at appropriately selected time instants. Experimental results indicate that the proposed modifications yield a superior performance compared to existing approaches. 相似文献