首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Several algorithms have been developed for tracking formant frequency trajectories of speech signals, however most of these algorithms are either not robust in real-life noise environments or are not suitable for real-time implementation. The algorithm presented in this paper obtains formant frequency estimates from voiced segments of continuous speech by using a time-varying adaptive filterbank to track individual formant frequencies. The formant tracker incorporates an adaptive voicing detector and a gender detector for formant extraction from continuous speech, for both male and female speakers. The algorithm has a low signal delay and provides smooth and accurate estimates for the first four formant frequencies at moderate and high signal-to-noise ratios. Thorough testing of the algorithm has shown that it is robust over a wide range of signal-to-noise ratios for various types of background noises.  相似文献   

2.
3.
4.
In this paper, we provide a comparative study of spectral front-end features used as representations for speech signals by processing multitaper magnitude and phase spectra, for speaker verification with expressive speech. In particular, the multitaper modified group delay function (MT-MOGDF) and multitaper magnitude (MT-MAG) spectra of the speech signals are employed to obtain low variance estimates of speech spectra. We observe that the cues that aid in representation of expressive speech are evident in the MT-MOGDF spectrum than the MT-MAG spectrum in terms of mean Formant value and Formant bandwidth. Our extensive experimental study on a speaker verification system with a Gaussian mixture model based universal background model classifier on expressive speech using the IITKGP-SESC and EMODB databases show that MT-MOGDF performs better than MT-MAG technique, in terms of equal error rate and minimum decision cost function. This improvement due to MT-MOGDF is owed to a better representation and a low-variance estimate of the speech spectrum. Our results highlight the utility of MT-MOGDF as a potential alternative for MT-MAG representation for speaker verification problems in general.  相似文献   

5.
The fine spectral structure related to pitch information is conveyed in Mel cepstral features, with variations in pitch causing variations in the features. For speaker recognition systems, this phenomenon, known as "pitch mismatch" between training and testing, can increase error rates. Likewise, pitch-related variability may potentially increase error rates in speech recognition systems for languages such as English in which pitch does not carry phonetic information. In addition, for both speech recognition and speaker recognition systems, the parsing of the raw speech signal into frames is traditionally performed using a constant frame size and a constant frame offset, without aligning the frames to the natural pitch cycles. As a result the power spectral estimation that is done as part of the Mel cepstral computation may include artifacts. Pitch synchronous methods have addressed this problem in the past, at the expense of adding some complexity by using a variable frame size and/or offset. This paper introduces Pseudo Pitch Synchronous (PPS) signal processing procedures that attempt to align each individual frame to its natural cycle and avoid truncation of pitch cycles while still using constant frame size and frame offset, in an effort to address the above problems. Text independent speaker recognition experiments performed on NIST speaker recognition tasks demonstrate a performance improvement when the scores produced by systems using PPS are fused with traditional speaker recognition scores. In addition, a better distribution of errors across trials may be obtained for similar error rates, and some insight regarding of role of the fundamental frequency in speaker recognition is revealed. Speech recognition experiments run on the Aurora-2 noisy digits task also show improved robustness and better accuracy for extremely low signal-to-noise ratio (SNR) data.  相似文献   

6.
We study the unique trust management, and more precisely reputation management and revocation of malicious nodes in the context of ad hoc networks used for emergency communications.Unlike in centralized systems, reputation management and revocation in ad hoc networks is non-trivial. This difficulty is due to the fact that the nodes have to collaboratively calculate the reputation value of a particular node and then revoke the node if the reputation value goes below a threshold. A major challenge in this scheme is to prevent a malicious node from discrediting other genuine nodes. The decision to revoke a node has to be communicated to all the nodes of the network. In traditional ad hoc networks the overhead of broadcasting the message throughout the network may be very high. We solve the problem of reputation management and node revocation in ad hoc networks of cell phones by using a threshold cryptography based scheme. Each node of the network would have a set of anonymous referees, which would store the reputation information of the node and issue reputation certificates to the node with timestamps. The misbehavior of a particular cell phone is reported to its anonymous referees, who issue certificates which reflect the positive and negative recommendations.  相似文献   

7.
8.
Speaker recognition is a major challenge in various languages for researchers. For programmed speaker recognition structure prepared by utilizing ordinary speech, shouting creates a confusion between the enlistment and test, henceforth minimizing the identification execution as extreme vocal exertion is required during shouting. Speaker recognition requires more time for classification of data, accuracy is optimized, and the low root-mean-square error rate is the major problem. The objective of this work is to develop an efficient system of speaker recognition. In this work, an improved method of Wiener filter algorithm is applied for better noise reduction. To obtain the essential feature vector values, Mel-frequency cepstral coefficient feature extraction method is used on the noise-removed signals. Furthermore, input samples are created by using these extracted features after the dimensions have been reduced using probabilistic principal component analysis. Finally, recurrent neural network-bidirectional long-short-term memory is used for the classification to improve the prediction accuracy. For checking the effectiveness, the proposed work is compared with the existing methods based on accuracy, sensitivity, and error rate. The results obtained with the proposed method demonstrate an accuracy of 95.77%.  相似文献   

9.
突发事件情景下,合理的进行地铁站人员应急疏散是减少人员伤亡和财产损失的有效途径。从地铁站人员应急疏散的仿真研究、不同规则下疏散模型的研究和智能算法在疏散建模中的应用三个角度对突发事件下地铁站人员应急疏散问题进行综述,并分析了当前疏散模型、求解算法以及行人特征数据的不完善之处。提出了把行人的心理行为特征、建筑物设施因素引入地铁站人员应急疏散问题中,并利用改进的蚁群算法求解最优疏散路径的重要价值。最后,展望了今后地铁站人员应急疏散问题研究的发展趋势。  相似文献   

10.
This paper presents the Irish Political Speech Database, an English-language database collected from Irish political recordings. The database is collected with automated indexing and content retrieval in mind, and thus is gathered from real-world recordings (such as television interviews and election rallies) which represent the nature and quality of recordings which will be encountered in practical applications. The database is labelled for six speaker attributes: boring; charismatic; enthusiastic; inspiring; likeable; and persuasive. Each of these traits is linked to the perceived ability or appeal of the speaker, and as such are relevant to a range of content retrieval and speech analysis tasks. The six base attributes are combined to form a metric of Overall Speaker Appeal. A set of baseline experiments is presented, which demonstrate the potential of this database for affective computing studies. Classification accuracies of up to 76% are achieved, with little feature or system optimisation.  相似文献   

11.
This paper presents a new feature extraction technique for speaker recognition using Radon transform (RT) and discrete cosine transform (DCT). The spectrogram is compact, efficient in representation and carries information about acoustic features in the form of pattern. In the proposed method, speaker specific features have been extracted by applying image processing techniques to the pattern available in the spectrogram. Radon transform has been used to derive the effective acoustic features from the speech spectrogram. Radon transform adds up the pixel values in the given image along a straight line in a particular direction and at a specific displacement. The proposed technique computes Radon projections for seven orientations and captures the acoustic characteristics of the spectrogram. DCT applied on Radon projections yields low dimensional feature vector. The technique is computationally efficient, text-independent, robust to session variations and insensitive to additive noise. The performance of the proposed algorithm has been evaluated using the Texas Instruments and Massachusetts Institute of Technology (TIMIT) and our own created Shri Guru Gobind Singhji (SGGS) databases. The recognition rate of the proposed algorithm on TIMIT database (consisting of 630 speakers) is 96.69% and for SGGS database (consisting of 151 speakers) is 98.41%. These results highlight the superiority of the proposed method over some of the existing algorithms.  相似文献   

12.
Multimedia Tools and Applications - Smartphones are evolving in various ways, recording technology has also developed. However, most of research is the study of the Call Record and high quality....  相似文献   

13.

In emergency situations, actions that save lives and limit the impact of hazards are crucial. In order to act, situational awareness is needed to decide what to do. Geolocalized photos and video of the situations as they evolve can be crucial in better understanding them and making decisions faster. Cameras are almost everywhere these days, either in terms of smartphones, installed CCTV cameras, UAVs or others. However, this poses challenges in big data and information overflow. Moreover, most of the time there are no disasters at any given location, so humans aiming to detect sudden situations may not be as alert as needed at any point in time. Consequently, computer vision tools can be an excellent decision support. The number of emergencies where computer vision tools has been considered or used is very wide, and there is a great overlap across related emergency research. Researchers tend to focus on state-of-the-art systems that cover the same emergency as they are studying, obviating important research in other fields. In order to unveil this overlap, the survey is divided along four main axes: the types of emergencies that have been studied in computer vision, the objective that the algorithms can address, the type of hardware needed and the algorithms used. Therefore, this review provides a broad overview of the progress of computer vision covering all sorts of emergencies.

  相似文献   

14.
15.
The recognition of a person by his voice or “speaker recognition”, is a biometric specialty increasingly used in electronic commerce and electronic banking transactions and forensic investigations, among others. Speaker recognition is supported by the discriminative information contained in the speech of a person and its main challenge is the variability that exists between different speech samples of the same person, used for training and evaluation, or “session variability”. When a speech communication is transmitted over the internet, for example, the coding–decoding process “codec” of the speech causes loss of such information and affects the effectiveness of the speaker recognition. Some methods have been proposed to mitigate this effect. This work makes a study of the degree of affectation of this information for some commonly used codec types and proposes our own solution, to compensate the session variability provoked by the codec. The influence of some types of codec in the quality of the sample was evaluated first with a set of synthesized speech samples. Later, experiments were carried out with speech samples of international competitions, retransmitted over two different codecs, and the effect on the speaker recognition effectiveness was checked. Finally, the variability compensation was applied, with an improvement of the recognition effectiveness, measured by the equal error rate, of 20.8% for the g.722 codec and 27.8% for the gsm 6.20 codec.  相似文献   

16.

In this paper, we propose a hybrid speech enhancement system that exploits deep neural network (DNN) for speech reconstruction and Kalman filtering for further denoising, with the aim to improve performance under unseen noise conditions. Firstly, two separate DNNs are trained to learn the mapping from noisy acoustic features to the clean speech magnitudes and line spectrum frequencies (LSFs), respectively. Then the estimated clean magnitudes are combined with the phase of the noisy speech to reconstruct the estimated clean speech, while the LSFs are converted to linear prediction coefficients (LPCs) to implement Kalman filtering. Finally, the reconstructed speech is Kalman-filtered for further removing the residual noises. The proposed hybrid system takes advantage of both the DNN based reconstruction and traditional Kalman filtering, and can work reliably in either matched or unmatched acoustic environments. Computer based experiments are conducted to evaluate the proposed hybrid system with comparison to traditional iterative Kalman filtering and several state-of-the-art DNN based methods under both seen and unseen noises. It is shown that compared to the DNN based methods, the hybrid system achieves similar performance under seen noise, but notably better performance under unseen noise, in terms of both speech quality and intelligibility.

  相似文献   

17.
Automatic Speaker Recognition (ASR) refers to the task of identifying a person based on his or her voice with the help of machines. ASR finds its potential applications in telephone based financial transactions, purchase of credit card and in forensic science and social anthropology for the study of different cultures and languages. Results of ASR are highly dependent on database, i.e., the results obtained in ASR are meaningless if recording conditions are not known. In this paper, a methodology and a typical experimental setup used for development of corpora for various tasks in the text-independent speaker identification in different Indian languages, viz., Marathi, Hindi, Urdu and Oriya have been described. Finally, an ASR system is presented to evaluate the corpora.  相似文献   

18.
The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively.  相似文献   

19.
Recently, we usually use these words “Edge of Innovation”. It is caused by deployment of smartphone and various hand-held device including wearable devices. In most cases, the devices can easily communicate with other people or devices, and most of modern IT services are the converged services. These are very common phenomenon even in management of the disaster and safety fields. Generally, the emergency situations involve extreme situations such as a horrible congestion, bottleneck, some damaged infrastructure, and so on. Nonetheless, the infrastructure should support communications to the users in these situations. In this case, P2P communications and networking can be one of the best alternatives. In addition, if the P2P communications and networking method which is based on the device location can operate, it will be better solution than others. However, these solutions have been developing now, and the technology level is just a toddler stage, especially measuring indoor location. This is why we proposed the efficient peer-to-peer context awareness data forwarding scheme based on the devices location. The proposed P2P scheme has 2 operation modes. One is the normal mode, the other is the emergency mode. In normal mode, although the proposed P2P scheme is almost same as the existing P2P communication and networking scheme, the proposed P2P scheme is based on the synchronization between each peer when they need to communicate with other peers. In addition, the proposed P2P scheme dynamically assigns a bandwidth to users by the traffic types when the scheme is aware of P2P context. This is reason that it is a purpose to increase user throughput and to guarantee minimum user throughput. On the other hand, the propose P2P scheme will operate not only to find a best path, but also to transfer an emergency message, and the operation mode will automatically change to the emergency mode when the scheme is aware of the emergency situations based on P2P context awareness. Based on the proposed P2P scheme, the user can communicate with other people, and relay the message to outside in the emergency mode. To prove the excellence of proposed P2P scheme, we verified that the proposed P2P scheme outperforms than legacy scheme in various aspects, and we show these simulation results.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号