首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
快速、准确和全面地从大量互联网文本信息中定位情感倾向是当前大数据技术领域面临的一大挑战.文本情感分类方法大致分为基于语义理解和基于有监督的机器学习两类.语义理解处理情感分类的优势在于其对不同领域的文本都可以进行情感分类,但容易受到中文存在的不同句式及搭配的影响,分类精度不高.有监督的机器学习虽然能够达到比较高的情感分类精度,但在一个领域方面得到较高分类能力的分类器不适应新领域的情感分类.在使用信息增益对高维文本做特征降维的基础上,将优化的语义理解和机器学习相结合,设计了一种新的混合语义理解的机器学习中文情感分类算法框架.基于该框架的多组对比实验验证了文本信息在不同领域中高且稳定的分类精度.  相似文献   

2.
局部保持投影(LPP)是一种新的数据降维技术,但其本身是一种非监督学习算法,对于分类问题效果不是太好。基于自适应最近邻,结合LPP算法,提出了一种有监督的局部保持投影算法(ANNLPP)。该方法通过修改LPP算法中的权值矩阵,在降维的同时,增加了类别信息,是一种有监督学习算法。通过二维数据可视化和UMIST、ORL 人脸识别实验,表明该方法对于分类问题具有较好的降维效果。  相似文献   

3.
非线性降维和半监督学习都是近年来机器学习的热点。将半监督的方法运用到非线性降维中,提出了基于图的半监督降维的算法。该算法用等式融合的方法推出了标记传播算法的另一种表达形式,用标记传播的结果作为初始的数据映射,然后在图谱张成的线性空间中寻找最逼近初始映射的数据作为最后的半监督降维的结果。实验表明,所提算法可以获得平滑的数据映射,更接近于理想的降维效果。与标记传播算法、图谱逼近算法、无监督的降维算法的比较也体现出本算法的优越性。  相似文献   

4.
分类技术在心电图自动诊断模型中的应用比较   总被引:2,自引:0,他引:2  
吴萍  黄勇 《计算机应用》2003,23(11):63-65,105
提高心电图诊断的有效性和准确性的关键在于心电图分类的质量。文中针对这一情况,详细论述了利用各种分类技术对提取的心电图特征数据进行分类的方法,并在比较各种分类算法的基础上,提出了一种基于CBR的心电图自动诊断系统的结构模型。  相似文献   

5.
局部保持投影算法(locality preserving projections,LPP)作为降维算法,在机器学习和模式识别中有着广泛应用。在识别分类中,为了更好的利用类别信息,在保持样本点的局部特征外,有效地从高维数据中提取出低维的人脸图像信息并提高人脸图像的识别率和识别速度,使分类达到一定优化,基于LPP算法结合流形学习思想,通过构造一种吸引向量的方法提出一种改进的局部保持投影算法(reformation locality preserve projections ,RLPP)。将数据集利用极端学习机分类器进行分类后,在标准人脸数据库上的进行试验,实验结果证明,改进后算法的识别率优于LPP算法、局部保持平均邻域边际最大化算法和鲁棒线性降维算法,具有较强的泛化能力和较高的识别率。  相似文献   

6.
《软件》2016,(9):27-33
机器学习是人工智能的主要内容之一,文本分类正是机器学习中典型的监督学习场景。而机器学习在在线教育平台中的应用正是现阶段的发展趋势。首先介绍了文本分类的背景及意义,文本分类系统中的文本预处理部分,介绍了信息增益算法、主要成分分析等相关技术;文本分类的分类算法部分,主要介绍了Ada Boost技术。在遵循文本分类流程的基础上,设计了一个3模块文本分类系统:一、中文分词及去停止词模块;二、文本向量化及特征降维模块;三、分类器模块。文本分类系统的具体实现上,全部采用开源工具完成,使用Ansj实现模块一,Weka实现模块二、三。按照文本分类流程,利用文本分类系统进行了实验,并对实验中得到的数据进行了分析和总结。为了提升最后的分类效果,在特征降维这一步中,添加了IG-LSA(信息增益(IG)-潜在语义分析(LSA))的混合降维方法。  相似文献   

7.
目的 特征降维是机器学习领域的热点研究问题。现有的低秩稀疏保持投影方法忽略了原始数据空间和降维后的低维空间之间的信息损失,且现有的方法不能有效处理少量有标签数据和大量无标签数据的情况,针对这两个问题,提出基于低秩稀疏图嵌入的半监督特征选择方法(LRSE)。方法 LRSE方法包含两步:第1步是充分利用有标签数据和无标签数据分别学习其低秩稀疏表示,第2步是在目标函数中同时考虑数据降维前后的信息差异和降维过程中的结构信息保持,其中通过最小化信息损失函数使数据中有用的信息尽可能地保留下来,将包含数据全局结构和内部几何结构的低秩稀疏图嵌入在低维空间中使得原始数据空间中的结构信息保留下来,从而能选择出更有判别性的特征。结果 将本文方法在6个公共数据集上进行测试,对降维后的数据采用KNN分类验证本文方法的分类准确率,并与其他现有的降维算法进行实验对比,本文方法分类准确率均有所提高,在其中的5个数据集上本文方法都有最高的分类准确率,其分类准确率分别在Wine数据集上比次高算法鲁棒非监督特征选择算法(RUFS)高11.19%,在Breast数据集上比次高算法RUFS高0.57%,在Orlraws10P数据集上比次高算法多聚类特征选择算法(MCFS)高1%,在Coil20数据集上比次高算法MCFS高1.07%,在数据集Orl64上比次高算法MCFS高2.5%。结论 本文提出的基于低秩稀疏图嵌入的半监督特征选择算法使得降维后的数据能最大限度地保留原始数据包含的信息,且能有效处理少量有标签样本和大量无标签样本的情况。实验结果表明,本文方法比现有算法的分类效果更好,此外,由于本文方法基于所有的特征都在线性流形上的假设,所以本文方法只适用于线性流形上的数据。  相似文献   

8.
为了对军用软件进行科学系统的过时淘汰评估,提出基于机器学习的软件过时淘汰评估模型。首先使用机器学习预处理与缩放技术处理相关的特征数据,然后基于主成分分析模型进行特征提取和降维,消除特征数据中的噪音值并选择重要的军用软件过时淘汰特征数据,使用由粒子群优化算法改进的支持向量机模型进行分类和评估建模,并使用混淆矩阵的精度评估模型,最后通过案例验证模型有效性、适用性和科学性。  相似文献   

9.
针对流形学习用于监督分类时效果不尽人意的问题,提出了一种有监督的宏流形学习算法。算法根据给定的训练样本构造子流形,子流形沿着边界粘连构成父流形。在充分利用训练集的类别标签信息和类内近邻信息的基础上,计算出最优非线性映射函数,对训练样本的高维特征进行降维,同时利用非线性核回归技术处理样本外点学习问题,使降维后得到的低维嵌入更有利于分类。将提出的算法与多种经典降维算法在2个典型测试数据集,即21类地物数据集和UCI数据集,分别进行分类实验。实验结果表明所提出的算法能够取得更好的分类效果。  相似文献   

10.
针对模式分类算法不直观的问题,提出一种基于径向坐标可视化分析高维数据的方法。由最大似然原理估计高维数据的本征维数,用较少的变量结合径向坐标可视化方法对高维数据进行可视化降维分析。在径向坐标中揭示高维数据集中类别和特征间的关系,寻找基于不同特征排列顺序的最优映射,并结合多种机器学习方法对数据集进行分类。应用于UCI数据库中的6个数据集的结果表明,该方法具有较好的可视化和分类效果。  相似文献   

11.
The objective of this paper is to construct a lightweight Intrusion Detection System (IDS) aimed at detecting anomalies in networks. The crucial part of building lightweight IDS depends on preprocessing of network data, identifying important features and in the design of efficient learning algorithm that classify normal and anomalous patterns. Therefore in this work, the design of IDS is investigated from these three perspectives. The goals of this paper are (i) removing redundant instances that causes the learning algorithm to be unbiased (ii) identifying suitable subset of features by employing a wrapper based feature selection algorithm (iii) realizing proposed IDS with neurotree to achieve better detection accuracy. The lightweight IDS has been developed by using a wrapper based feature selection algorithm that maximizes the specificity and sensitivity of the IDS as well as by employing a neural ensemble decision tree iterative procedure to evolve optimal features. An extensive experimental evaluation of the proposed approach with a family of six decision tree classifiers namely Decision Stump, C4.5, Naive Baye’s Tree, Random Forest, Random Tree and Representative Tree model to perform the detection of anomalous network pattern has been introduced.  相似文献   

12.
Software defect prediction is an important decision support activity in software quality assurance. The limitation of the labelled modules usually makes the prediction difficult, and the class‐imbalance characteristic of software defect data leads to negative influence on decision of classifiers. Semi‐supervised learning can build high‐performance classifiers by using large amount of unlabelled modules together with the labelled modules. Ensemble learning achieves a better prediction capability for class‐imbalance data by using a series of weak classifiers to reduce the bias generated by the majority class. In this paper, we propose a new semi‐supervised software defect prediction approach, non‐negative sparse‐based SemiBoost learning. The approach is capable of exploiting both labelled and unlabelled data and is formulated in a boosting framework. In order to enhance the prediction ability, we design a flexible non‐negative sparse similarity matrix, which can fully exploit the similarity of historical data by incorporating the non‐negativity constraint into sparse learning for better learning the latent clustering relationship among software modules. The widely used datasets from NASA projects are employed as test data to evaluate the performance of all compared methods. Experimental results show that non‐negative sparse‐based SemiBoost learning outperforms several representative state‐of‐the‐art semi‐supervised software defect prediction methods. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

13.
The performance of supervised classification algorithms is highly dependent on the quality of training data. Ambiguous training patterns may misguide the classifier leading to poor classification performance. Further, the manual exploration of class labels is an expensive and time consuming process. An automatic method is needed to identify noisy samples in the training data to improve the decision making process. This article presents a new classification technique by combining an unsupervised learning technique (i.e. fuzzy c-means clustering (FCM)) and supervised learning technique (i.e. back-propagation artificial neural network (BPANN)) to categorize benign and malignant tumors in breast ultrasound images. Unsupervised learning is employed to identify ambiguous examples in the training data. Experiments were conducted on 178 B-mode breast ultrasound images containing 88 benign and 90 malignant cases on MATLAB® software platform. A total of 457 features were extracted from ultrasound images followed by feature selection to determine the most significant features. Accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC) and Mathew's correlation coefficient (MCC) were used to access the performance of different classifiers. The result shows that the proposed approach achieves classification accuracy of 95.862% when all the 457 features were used for classification. However, the accuracy is reduced to 94.138% when only 19 most relevant features selected by multi-criterion feature selection approach were used for classification. The results were discussed in light of some recently reported studies. The empirical results suggest that eliminating doubtful training examples can improve the decision making performance of expert systems. The proposed approach show promising results and need further evaluation in other applications of expert and intelligent systems.  相似文献   

14.
Coronary Artery Disease (CAD), caused by the buildup of plaque on the inside of the coronary arteries, has a high mortality rate. To efficiently detect this condition from echocardiography images, with lesser inter-observer variability and visual interpretation errors, computer based data mining techniques may be exploited. We have developed and presented one such technique in this paper for the classification of normal and CAD affected cases. A multitude of grayscale features (fractal dimension, entropies based on the higher order spectra, features based on image texture and local binary patterns, and wavelet based features) were extracted from echocardiography images belonging to a huge database of 400 normal cases and 400 CAD patients. Only the features that had good discriminating capability were selected using t-test. Several combinations of the resultant significant features were used to evaluate many supervised classifiers to find the combination that presents a good accuracy. We observed that the Gaussian Mixture Model (GMM) classifier trained with a feature subset made up of nine significant features presented the highest accuracy, sensitivity, specificity, and positive predictive value of 100%. We have also developed a novel, highly discriminative HeartIndex, which is a single number that is calculated from the combination of the features, in order to objectively classify the images from either of the two classes. Such an index allows for an easier implementation of the technique for automated CAD detection in the computers in hospitals and clinics.  相似文献   

15.
An important tool for the heart disease diagnosis is the analysis of electrocardiogram (ECG) signals, since the non-invasive nature and simplicity of the ECG exam. According to the application, ECG data analysis consists of steps such as preprocessing, segmentation, feature extraction and classification aiming to detect cardiac arrhythmias (i.e., cardiac rhythm abnormalities). Aiming to made a fast and accurate cardiac arrhythmia signal classification process, we apply and analyze a recent and robust supervised graph-based pattern recognition technique, the optimum-path forest (OPF) classifier. To the best of our knowledge, it is the first time that OPF classifier is used to the ECG heartbeat signal classification task. We then compare the performance (in terms of training and testing time, accuracy, specificity, and sensitivity) of the OPF classifier to the ones of other three well-known expert system classifiers, i.e., support vector machine (SVM), Bayesian and multilayer artificial neural network (MLP), using features extracted from six main approaches considered in literature for ECG arrhythmia analysis. In our experiments, we use the MIT-BIH Arrhythmia Database and the evaluation protocol recommended by The Association for the Advancement of Medical Instrumentation. A discussion on the obtained results shows that OPF classifier presents a robust performance, i.e., there is no need for parameter setup, as well as a high accuracy at an extremely low computational cost. Moreover, in average, the OPF classifier yielded greater performance than the MLP and SVM classifiers in terms of classification time and accuracy, and to produce quite similar performance to the Bayesian classifier, showing to be a promising technique for ECG signal analysis.  相似文献   

16.
检测恶意URL对防御网络攻击有着重要意义. 针对有监督学习需要大量有标签样本这一问题, 本文采用半监督学习方式训练恶意URL检测模型, 减少了为数据打标签带来的成本开销. 在传统半监督学习协同训练(co-training)的基础上进行了算法改进, 利用专家知识与Doc2Vec两种方法预处理的数据训练两个分类器, 筛选两个分类器预测结果相同且置信度高的数据打上伪标签(pseudo-labeled)后用于分类器继续学习. 实验结果表明, 本文方法只用0.67%的有标签数据即可训练出检测精确度(precision)分别达到99.42%和95.23%的两个不同类型分类器, 与有监督学习性能相近, 比自训练与协同训练表现更优异.  相似文献   

17.
Active learning is understood as any form of learning in which the learning algorithm has some control over the input samples due to a specific sample selection process based on which it builds up the model. In this paper, we propose a novel active learning strategy for data-driven classifiers, which is based on unsupervised criterion during off-line training phase, followed by a supervised certainty-based criterion during incremental on-line training. In this sense, we call the new strategy hybrid active learning. Sample selection in the first phase is conducted from scratch (i.e. no initial labels/learners are needed) based on purely unsupervised criteria obtained from clusters: samples lying near cluster centers and near the borders of clusters are expected to represent the most informative ones regarding the distribution characteristics of the classes. In the second phase, the task is to update already trained classifiers during on-line mode with the most important samples in order to dynamically guide the classifier to more predictive power. Both strategies are essential for reducing the annotation and supervision effort of operators in off-line and on-line classification systems, as operators only have to label an exquisite subset of the off-line training data resp. give feedback only on specific occasions during on-line phase. The new active learning strategy is evaluated based on real-world data sets from UCI repository and collected at on-line quality control systems. The results show that an active learning based selection of training samples (1) does not weaken the classification accuracies compared to when using all samples in the training process and (2) can out-perform classifiers which are built on randomly selected data samples.  相似文献   

18.
The paper presents a novel approach for voice activity detection. The main idea behind the presented approach is to use, next to the likelihood ratio of a statistical model-based voice activity detector, a set of informative distinct features in order to, via a supervised learning approach, enhance the detection performance. The statistical model-based voice activity detector, which is chosen based on the comparison to other similar detectors in an earlier work, models the spectral envelope of the signal and we derive the likelihood ratio thereof. Furthermore, the likelihood ratio together with 70 other various features was meticulously analyzed with an input variable selection algorithm based on partial mutual information. The resulting analysis produced a 13 element reduced input vector which when compared to the full input vector did not undermine the detector performance. The evaluation is performed on a speech corpus consisting of recordings made by six different speakers, which were corrupted with three different types of noises and noise levels. In the end, we tested three different supervised learning algorithms for the task, namely, support vector machine, Boost, and artificial neural networks. The experimental analysis was performed by 10-fold cross-validation due to which threshold averaged receiver operating characteristics curves were constructed. Also, the area under the curve score and Matthew's correlation coefficient were calculated for both the three supervised learning classifiers and the statistical model-based voice activity detector. The results showed that the classifier with the reduced input vector significantly outperformed the standalone detector based on the likelihood ratio, and that among the three classifiers, Boost showed the most consistent performance.  相似文献   

19.
Prototype classifiers have been studied for many years. However, few methods can realize incremental learning. On the other hand, most prototype classifiers need users to predetermine the number of prototypes; an improper prototype number might undermine the classification performance. To deal with these issues, in the paper we propose an online supervised algorithm named Incremental Learning Vector Quantization (ILVQ) for classification tasks. The proposed method has three contributions. (1) By designing an insertion policy, ILVQ incrementally learns new prototypes, including both between-class incremental learning and within-class incremental learning. (2) By employing an adaptive threshold scheme, ILVQ automatically learns the number of prototypes needed for each class dynamically according to the distribution of training data. Therefore, unlike most current prototype classifiers, ILVQ needs no prior knowledge of the number of prototypes or their initial value. (3) A technique for removing useless prototypes is used to eliminate noise interrupted into the input data. Results of experiments show that the proposed ILVQ can accommodate the incremental data environment and provide good recognition performance and storage efficiency.  相似文献   

20.
This paper introduces different classification systems based on artificial neural networks for the automatic detection of epileptic spikes in electroencephalogram records. Different multilayer perceptron networks are constructed and trained with different algorithms. The inputs of the networks consist of either raw data or extracted features. To improve the generalization performance of the classifiers, “training with noise” method is used whereby new training data is constructed by adding uncorrelated Gaussian noise to real data. The performances of the constructed classifiers are examined and compared both with each other and with other similar systems found in literature based on sensitivity, specificity and selectivity measures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号