共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents a novel over-sampling method based on document content to handle the class imbalance problem in text classification. The new technique, COS-HMM (Content-based Over-Sampling HMM), includes an HMM that is trained with a corpus in order to create new samples according to current documents. The HMM is treated as a document generator which can produce synthetical instances formed on what it was trained with.To demonstrate its achievement, COS-HMM is tested with a Support Vector Machine (SVM) in two medical documental corpora (OHSUMED and TREC Genomics), and is then compared with the Random Over-Sampling (ROS) and SMOTE techniques. Results suggest that the application of over-sampling strategies increases the global performance of the SVM to classify documents. Based on the empirical and statistical studies, the new method clearly outperforms the baseline method (ROS), and offers a greater performance than SMOTE in the majority of tested cases. 相似文献
2.
Julián D. Arias-Londoño Author Vitae Juan I. Godino-Llorente Author Vitae Nicolás Sáenz-Lechón Author Vitae Author Vitae Germán Castellanos-Domínguez Author Vitae 《Pattern recognition》2010,43(9):3100-3112
This paper presents new a feature transformation technique applied to improve the screening accuracy for the automatic detection of pathological voices. The statistical transformation is based on Hidden Markov Models, obtaining a transformation and classification stage simultaneously and adjusting the parameters of the model with a criterion that minimizes the classification error. The original feature vectors are built up using classic short-term noise parameters and mel-frequency cepstral coefficients. With respect to conventional approaches found in the literature of automatic detection of pathological voices, the proposed feature space transformation technique demonstrates a significant improvement of the performance with no addition of new features to the original input space. In view of the results, it is expected that this technique could provide good results in other areas such as speaker verification and/or identification. 相似文献
3.
文语转换是人机交互的一项关键技术。当前的基于隐马尔可夫模型的语音合成系统已经能够合成出较高自然度和可懂度的声音,但与自然语音相比,韵律的节奏感不强,其主要原因是受时长的影响。提出在生成状态时长时同时优化状态、音子和音节三层模型的似然值,并通过考虑状态和长时时长的信息,使在重估计的过程中减少状态时长的错误。在普通话语料库上的实验证明,优化后的时长模型能够产生更加准确的状态时长,与状态级的基线系统相比较,均方根误差由19.90提高到了17.45。主观评测也显示改进后的模型优于基线模型。 相似文献
4.
Julian Fierrez Javier Ortega-Garcia Daniel Ramos Joaquin Gonzalez-Rodriguez 《Pattern recognition letters》2007,28(16):2325-2334
A function-based approach to on-line signature verification is presented. The system uses a set of time sequences and Hidden Markov Models (HMMs). Development and evaluation experiments are reported on a subcorpus of the MCYT bimodal biometric database comprising more than 7000 signatures from 145 subjects. The system is compared to other state-of-the-art systems based on the results of the First International Signature Verification Competition (SVC 2004). A number of practical findings related to feature extraction and modeling are obtained. 相似文献
5.
Traditional approaches for text data stream classification usually require the manual labeling of a number of documents, which is an expensive and time consuming process. In this paper, to overcome this limitation, we propose to classify text streams by keywords without labeled documents so as to reduce the burden of labeling manually. We build our base text classifiers with the help of keywords and unlabeled documents to classify text streams, and utilize classifier ensemble algorithms to cope with concept drifting in text data streams. Experimental results demonstrate that the proposed method can build good classifiers by keywords without manual labeling, and when the ensemble based algorithm is used, the concept drift in the streams can be well detected and adapted, which performs better than the single window algorithm. 相似文献
6.
Thanks to a wide and dynamic research community on short term production scheduling, a large number of modelling options and solving methods have been developed in the recent years both in chemical production and manufacturing domains. This trend is expected to grow in the future as the number of publications is constantly increasing because of industrial interest in the current economic context. The frame of this work is the development of a decision-support system to work out an assignment strategy between scheduling problems, mathematical modelling options and appropriate solving methods. The system must answer the question about which model and which solution method should be applied to solve a new scheduling problem in the most convenient way. The decision-support system is to be built on the foundations of Case Based Reasoning (CBR). CBR is based on a data base which encompasses previously successful experiences. The three major contributions of this paper are: (i) the proposition of an extended and a more exhaustive classification and notation scheme in order to obtain an efficient scheduling case representation (based on previous ones), (ii) a method for bibliographic analysis used to perform a deep study to fill the case base on the one hand, and to examine the topics the more or the less examined in the scheduling domain and their evolution over time on the other hand, and (iii) the proposition of criteria to extract relevant past experiences during the retrieval step of the CBR. The capabilities of our decision support system are illustrated through a case study with typical constraints related to process engineering production in beer industry. 相似文献
7.
In knowledge discovery in a text database, extracting and returning a subset of information highly relevant to a user's query is a critical task. In a broader sense, this is essentially identification of certain personalized patterns that drives such applications as Web search engine construction, customized text summarization and automated question answering. A related problem of text snippet extraction has been previously studied in information retrieval. In these studies, common strategies for extracting and presenting text snippets to meet user needs either process document fragments that have been delimitated a priori or use a sliding window of a fixed size to highlight the results. In this work, we argue that text snippet extraction can be generalized if the user's intention is better utilized. It overcomes the rigidness of existing approaches by dynamically returning more flexible start-end positions of text snippets, which are also semantically more coherent. This is achieved by constructing and using statistical language models which effectively capture the commonalities between a document and the user intention. Experiments indicate that our proposed solutions provide effective personalized information extraction services. 相似文献
8.
This paper presents a new test to distinguish between meaningful and non-meaningful HMM-modeled activity patterns in human activity recognition systems. Operating as a hypothesis test, alternative models are generated from available classes and the decision is based on a likelihood ratio test (LRT). The proposed test differs from traditional LRTs in two aspects. Firstly, the likelihood ratio, which is called pairwise likelihood ratio (PLR), is based on each pair of HMMs. Models for non-meaningful patterns are not required. Secondly, the distribution of the likelihood ratios, rather than a fixed threshold, is used as the measurement. Multiple measurements from multiple PLR tests are combined to improve the rejection accuracy. The advantage of the proposed test is that the establishment of such a test relies only on the meaningful samples. 相似文献
9.
The improvement of safety and dependability in systems that physically interact with humans requires investigation with respect to the possible states of the user’s motion and an attempt to recognize these states. In this study, we propose a method for real-time visual state classification of a user with a walking support system. The visual features are extracted using principal component analysis and classification is performed by hidden Markov models, both for real-time fall detection (one-class classification) and real-time state recognition (multi-class classification). The algorithms are used in experiments with a passive-type walker robot called “RT Walker” equipped with servo brakes and a depth sensor (Microsoft Kinect). The experiments are performed with 10 subjects, including an experienced physiotherapist who can imitate the walking pattern of the elderly and people with disabilities. The results of the state classification can be used to improve fall-prevention control algorithms for walking support systems. The proposed method can also be used for other vision-based classification applications, which require real-time abnormality detection or state recognition. 相似文献
10.
11.
某种车辆电源系统故障诊断方法研究 总被引:1,自引:0,他引:1
通过分析车辆电源系统的信号特征,提出了基于小波包与隐马尔可夫相结合的故障诊断方法。利用小波包分解提取电源系统各种状态下的信号特征,基于模拟退火思想改进K均值算法选取HMM初值,用特征向量训练连续HMM,再用训练好的HMM进行电源系统的状态监测与故障诊断,实验结果表明用少量样本就能取得很好的诊断效果。 相似文献
12.
传统的生物医学命名实体识别方法需要大量目标领域的标注数据,但是标注数据代价高昂。为了降低生物医学文本中命名实体识别对目标领域标注数据的需求,将生物医学文本中的命名实体识别问题化为基于迁移学习的隐马尔可夫模型问题。对要进行命名实体识别的目标领域数据集无须进行大量数据标注,通过迁移学习的方法实现对目标领域的识别分类。以相关领域数据为辅助数据集,利用数据引力的方法评估辅助数据集的样本在目标领域学习中的贡献程度,在辅助数据集和目标领域数据集上计算权值进行迁移学习。基于权值学习模型,构建基于迁移学习的隐马尔可夫模型算法BioTrHMM。在GENIA语料库的数据集上的实验表明,BioTrHMM算法比传统的隐马尔可夫模型算法具有更好的性能;仅需要少量的目标领域标注数据,即可具有较好的命名实体识别性能。 相似文献
13.
Automated text classification of near-misses from safety reports: An improved deep learning approach
Examining past near-miss reports can provide us with information that can be used to learn about how we can mitigate and control hazards that materialise on construction sites. Yet, the process of analysing near-miss reports can be a time-consuming and labour-intensive process. However, automatic text classification using machine learning and ontology-based approaches can be used to mine reports of this nature. Such approaches tend to suffer from the problem of weak generalisation, which can adversely affect the classification performance. To address this limitation and improve classification accuracy, we develop an improved deep learning-based approach to automatically classify near-miss information contained within safety reports using Bidirectional Transformers for Language Understanding (BERT). Our proposed approach is designed to pre-train deep bi-directional representations by jointly extracting context features in all layers. We validate the effectiveness and feasibility of our approach using a database of near-miss reports derived from actual construction projects that were used to train and test our model. The results demonstrate that our approach can accurately classify ‘near misses’, and outperform prevailing state-of-the-art automatic text classification approaches. Understanding the nature of near-misses can provide site managers with the ability to identify work-areas and instances where the likelihood of an accident may occur. 相似文献
14.
基于类别的特征选择算法的文本分类系统 总被引:1,自引:0,他引:1
目前的索引词选择算法大多是基于词频的,没有利用训练样本中的类别信息,为此提出了一种新的基于类别的特征选择算法。该算法根据某个词是否存在于文档中导致该类文档相似度的区别,来确定该词区分不同文档的分辨力,以此分辨力作为选取关键词的重要度。以该算法为基础,设计了一个英文文本自动分类系统,并对该系统进行了测试和结果分析。 相似文献
15.
Liang-Teh Lee Author Vitae Chen-Feng Wu Author Vitae 《Computers & Electrical Engineering》2007,33(3):153-165
The multimedia services are getting to become the major trend in next-generation cellular networks. Call admission control (CAC) plays the key role for guaranteeing the quality of service (QoS) in cellular networks. The goal which keeps both the call dropping probability (CDP) and call blocking probability (CBP) below a certain level is more difficult owing to user indeterminate mobility. In this paper, the Hidden Markov Models (HMMs) concept which is suitable for solving a dynamic situation is introduced and applied to the call admission control policy. The prediction of user mobility can be modeled and resolved as the decoding problem of the HMMs. According to the prediction result, the proposed CAC method can reserve appropriate bandwidths for a handoff call beforehand. Thus, the call dropping probability can be kept below a lower level. Moreover, the call blocking probability is not sacrificed too much since the proposed method can reserve the suitable bandwidths in the appropriate cells but not reserve stationary ones which are always adopted by traditional CAC methods. Therefore, the proposed method not only can satisfy both CDP and CBP issues, but also improve the system utilization. 相似文献
16.
With the rapid growth of textual content on the Internet, automatic text categorization is a comparatively more effective solution in information organization and knowledge management. Feature selection, one of the basic phases in statistical-based text categorization, crucially depends on the term weighting methods In order to improve the performance of text categorization, this paper proposes four modified frequency-based term weighting schemes namely; mTF, mTFIDF, TFmIDF, and mTFmIDF. The proposed term weighting schemes take the amount of missing terms into account calculating the weight of existing terms. The proposed schemes show the highest performance for a SVM classifier with a micro-average F1 classification performance value of 97%. Moreover, benchmarking results on Reuters-21578, 20Newsgroups, and WebKB text-classification datasets, using different classifying algorithms such as SVM and KNN show that the proposed schemes mTF, mTFIDF, and mTFmIDF outperform other weighting schemes such as TF, TFIDF, and Entropy. Additionally, the statistical significance tests show a significant enhancement of the classification performance based on the modified schemes. 相似文献
17.
Monojit Choudhury Rahul Saraf Vijit Jain Animesh Mukherjee Sudeshna Sarkar Anupam Basu 《International Journal on Document Analysis and Recognition》2007,10(3-4):157-174
Language usage over computer mediated discourses, such as chats, emails and SMS texts, significantly differs from the standard
form of the language and is referred to as texting language (TL). The presence of intentional misspellings significantly decrease
the accuracy of existing spell checking techniques for TL words. In this work, we formally investigate the nature and type
of compressions used in SMS texts, and develop a Hidden Markov Model based word-model for TL. The model parameters have been
estimated through standard machine learning techniques from a word-aligned SMS and standard English parallel corpus. The accuracy
of the model in correcting TL words is 57.7%, which is almost a threefold improvement over the performance of Aspell. The
use of simple bigram language model results in a 35% reduction of the relative word level error rates. 相似文献
18.
In order to achieve an optimum and successful operation of an industrial process, it is important firstly to detect upsets, equipment malfunctions or other abnormal events as early as possible and secondly to identify and remove the cause of those events. Univariate and multivariate statistical process control methods have been widely applied in process industries for early fault detection and localization.The primary objective of the proposed research is the design of an anomaly detection and visualization tool that is able to present to the shift operator – and to the various levels of plant operation and company management – an early, global, accurate and consolidated presentation of the operation of major subgroups or of the whole plant, aided by a graphical form.Piecewise Aggregate Approximation (PAA) and Symbolic Aggregate Approximation (SAX) are considered as two of the most popular representations for time series data mining, including clustering, classification, pattern discovery and visualization in time series datasets. However SAX is preferred since it is able to transform a time series into a set of discrete symbols, e.g. into alphabet letters, being thus far more appropriate for a graphical representation of the corresponding information, especially for the shift operator. The methods are applied on individual time records of each process variable, as well as on entire groups of time records of process variables in combination with Hidden Markov Models. In this way, the proposed visualization tool is not only associated with a process defect, but it allows also identifying which specific abnormal situation occurred and if this has also occurred in the past. Case studies based on the benchmark Tennessee Eastman process demonstrate the effectiveness of the proposed approach. The results indicate that the proposed visualization tool captures meaningful information hidden in the observations and shows superior monitoring performance. 相似文献
19.
20.
路网匹配是基于位置服务中的关键预处理步骤,它将GPS轨迹点匹配到实际路网上。以此为基础对数据进行分析和挖掘,能够辅助解决城市计算中相关问题,例如建立智能交通系统,协助用户规划出行。本文对国内外学者在该研究领域取得的成果进行了分类总结,发现这些匹配算法可以较好地解决高采样率的路网匹配问题。但是随着城市交通的快速发展,获取和处理车辆位置信息的成本不断提高,低频采样点越来越多,现有算法匹配精确度大大下降。于是近几年,出现基于隐马尔科夫模型(Hidden Markov Model,HMM)的路网匹配算法。隐马尔可夫模型可以较为平滑地将噪声数据和路径约束进行整合,从有许多可能状态的路径中选择一条最大似然路径。重点总结了基于隐马尔科夫模型的路网匹配算法,主要是从特点与实验结果的角度对其进行对比总结,有的实验结果的正确率在一定条件下最高可以达到90%,说明了基于隐马尔可夫模型的路网匹配算法在低采样率下的有效性。最后对未来研究可能采取的方法进行了展望。 相似文献