首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The nearest neighbor (NN) classifier represents one of the most popular non-parametric classification approaches and has been successfully applied in several pattern recognition problems. The two main limitations of this technique are its computational complexity and its sensitivity to the presence of outliers in the training set. Though the first problem has been partially overcome thanks to the availability of inexpensive memory and high processing speeds, the second one still persists, and several editing and condensing techniques have been proposed, aimed at selecting a proper set of prototypes from the training set. In this work, an editing technique is proposed, based on the idea of rewarding the patterns that contribute to a correct classification and punishing those that provide a wrong one. The analysis is carried out both at local and at global level, by analyzing the training set at different scales. A score is calculated for each pattern, and the patterns whose score is lower than a predefined threshold are edited out. An extensive experimentation has been conducted on several classification problems both to evaluate the efficacy of the proposed technique with respect to other editing approaches and to investigate the advantage of using reward–punishment editing in combination with condensing techniques or as a pre-processing stage when classifiers different from the NN are adopted.  相似文献   

2.
The logical analysis of data (LAD) is one of the most promising data mining and machine learning techniques developed to date for extracting knowledge from data. The LAD is based on the concepts of combinatorics, optimization, and Boolean functions. The key feature of the LAD is the capability of detecting hidden patterns in the data. Since patterns are basically combinations of certain attributes, they can be used to build a decision boundary for classification in the LAD by providing important information to distinguish observations in one class from those in the other class. The use of patterns may result in a more stable performance in terms of being able to classify both positive and negative classes due to their robustness to measurement errors. The patterns are also interpretable and can serve as an essential tool for understanding the problem. These desirable properties of the patterns generated from the LAD motivate the use of the LAD patterns as input variables to other classification techniques to achieve a more stable and accurate performance. In this paper, the patterns generated from the LAD are used as the input variables to the decision tree and k-nearest neighbor classification methods. The applicability and usefulness of the LAD patterns for classification are investigated experimentally. The classification accuracy and sensitivity of the classification results for different classifiers in the original and pattern spaces are compared using several public data. The experimental results show that classifications in the pattern space can yield better and stable performance than those in the original space in terms of accuracy when the classification accuracy of the LAD is relatively good (i.e., the LAD patterns are of good quality), the ratio of the number of patterns to the total number of attributes is small, or the data set for classification is balanced between two classes.  相似文献   

3.
Outlier detection is an imperative field of data mining that has several applications in the field of medical research. Mining outliers based on the notion of rare patterns can be a promising solution for medical diagnosis as it attempts to identify the unconventional and abnormal risk patterns present in medical data. A crucial issue in medical data analysis is the continuous growth of medical databases due to the addition of new records. Existing outlier detection techniques are capable of handling only static data and thus re-execute from scratch to identify the outliers from incremental medical data. This paper introduces an efficient rare pattern based outlier detection (RPOD) method that identifies outliers by mining rare patterns from incremental data. To avoid multiple database scans and expensive candidate generation steps performed by existent rare pattern mining techniques and facilitate incremental mining, a single pass prefix tree-based rare pattern mining technique is proposed. The proposed rare pattern mining technique is a modification of the well-known FP-Growth frequent pattern mining algorithm. Furthermore, to identify the outliers based on the set of generated rare patterns, an outlier detection technique is also presented. The significance of proposed RPOD approach is demonstrated using several well-known medical datasets. Comparative performance evaluation substantiates the predominance of RPOD approach over existing outlier mining methods.  相似文献   

4.
From a data mining perspective, sequence classification is to build a classifier using frequent sequential patterns. However, mining for a complete set of sequential patterns on a large dataset can be extremely time-consuming and the large number of patterns discovered also makes the pattern selection and classifier building very time-consuming. The fact is that, in sequence classification, it is much more important to discover discriminative patterns than a complete pattern set. In this paper, we propose a novel hierarchical algorithm to build sequential classifiers using discriminative sequential patterns. Firstly, we mine for the sequential patterns which are the most strongly correlated to each target class. In this step, an aggressive strategy is employed to select a small set of sequential patterns. Secondly, pattern pruning and serial coverage test are done on the mined patterns. The patterns that pass the serial test are used to build the sub-classifier at the first level of the final classifier. And thirdly, the training samples that cannot be covered are fed back to the sequential pattern mining stage with updated parameters. This process continues until predefined interestingness measure thresholds are reached, or all samples are covered. The patterns generated in each loop form the sub-classifier at each level of the final classifier. Within this framework, the searching space can be reduced dramatically while a good classification performance is achieved. The proposed algorithm is tested in a real-world business application for debt prevention in social security area. The novel sequence classification algorithm shows the effectiveness and efficiency for predicting debt occurrences based on customer activity sequence data.  相似文献   

5.
针对高维数据集中存在不相关的属性与冗余数据导致无法检测出异常值的问题,提出了一种新的基于稀疏子空间的局部异常值检测算法(SSLOD)。根据数据对象在每个维度上的局部密度定义了对象的异常因子;依据异常因子阈值约简数据集中与局部异常值不相关的属性以及冗余的数据对象;用改进的粒子群优化算法在约简后的数据集中搜索稀疏子空间,该子空间中的数据对象即为异常值。通过在仿真数据集和真实数据集上的综合实验验证了该算法的有效性和准确性。  相似文献   

6.
一种基于神经网络覆盖构造法的模糊分类器   总被引:10,自引:1,他引:10       下载免费PDF全文
首先介绍了一种M-P模型几何表示,以及利用这种几何表示可将神经网络的训练问题转化为点集覆盖问题,并在此基础上分析了神经网络训练的一种几何方法.针对该方法可构造十分复杂的分类边界,但其时间复杂度很高.提出一种将神经网络覆盖算法与模糊集合思想相结合的方法,该分类器可改善训练速度、减少覆盖的球领域数目,即减少神经网络的隐结点数目.同时模糊化方法可方便地为大规模模式识别问题提供多选结果.用700类手写汉字的识别构造一个大规模模式识别问题测试提出的方法,实验结果表明,该方法对于大规模模式识别问题很有潜力.  相似文献   

7.
江晶晶  王志海  原继东 《计算机科学》2017,44(7):167-174, 202
依据从大规模数据中抽取的模式来建立分类模型是模式挖掘的重要研究问题之一。一种可行的方法是根据模式集合建立贝叶斯分类模型。然而,目前基于模式的贝叶斯分类模型大多是针对静态数据集合的,通常不能适应于高速动态变化与无限的数据流环境。对此,提出一种数据流环境下基于模式发现的贝叶斯分类学习模型,其采用半懒惰式学习策略,针对分类实例在不断更新的频繁项集合上建立局部的分类模型;为加快流数据处理的速度,提出了结构更为简单的混合树结构,同时提出了给定项限制的模式抽取机制以减少候选项集的生成;对数据流中模式抽取不完全的情况,使用平滑技术处理未被抽取的项。大量实验分析证明,相较于其他数据流分类器,所提模型具有更高的分类正确率。  相似文献   

8.
Mihoko M  Eguchi S 《Neural computation》2002,14(8):1859-1886
Blind source separation is aimed at recovering original independent signals when their linear mixtures are observed. Various methods for estimating a recovering matrix have been proposed and applied to data in many fields, such as biological signal processing, communication engineering, and financial market data analysis. One problem these methods have is that they are often too sensitive to outliers, and the existence of a few outliers might change the estimate drastically. In this article, we propose a robust method of blind source separation based on the beta divergence. Shift parameters are explicitly included in our model instead of the conventional way which assumes that original signals have zero mean. The estimator gives smaller weights to possible outliers so that their influence on the estimate is weakened. Simulation results show that the proposed estimator significantly improves the performance over the existing methods when outliers exist; it keeps equal performance otherwise.  相似文献   

9.
The problem addressed in this paper concerns the complexity reduction of the nearest feature plane classifier, so that it may be applied also in dataset where the training set contains many patterns. This classifier considers, to classify a test pattern, the subspaces created by each combination of three training patterns. The main problem is that in dataset of high cardinality this method is unfeasible.A genetic algorithm is here used for dividing the training patterns in several clusters which centroids are used to build the feature planes used to classify the test set.The performance improvement with respect to other nearest neighbor based classifiers is validated through experiments with several benchmark datasets.  相似文献   

10.
11.
A novel successive learning algorithm based on a Test Feature Classifier is proposed for efficient handling of sequentially provided training data. The fundamental characteristics of the successive learning are considered. In the learning, after recognition of a set of unknown data by a classifier, they are fed into the classifier in order to obtain a modified performance. An efficient algorithm is proposed for the incremental definition of prime tests which are irreducible combinations of features and capable of classifying training patterns into correct classes. Four strategies for addition of training patterns are investigated with respect to their precision and performance using real pattern data. A real-world problem of classification of defects on wafer images has been dealt with by the proposed classifier, obtaining excellent performance even through efficient addition strategies.  相似文献   

12.
Instance selection aims at filtering out noisy data (or outliers) from a given training set, which not only reduces the need for storage space, but can also ensure that the classifier trained by the reduced set provides similar or better performance than the baseline classifier trained by the original set. However, since there are numerous instance selection algorithms, there is no concrete winner that is the best for various problem domain datasets. In other words, the instance selection performance is algorithm and dataset dependent. One main reason for this is because it is very hard to define what the outliers are over different datasets. It should be noted that, using a specific instance selection algorithm, over-selection may occur by filtering out too many ‘good’ data samples, which leads to the classifier providing worse performance than the baseline. In this paper, we introduce a dual classification (DuC) approach, which aims to deal with the potential drawback of over-selection. Specifically, performing instance selection over a given training set, two classifiers are trained using both a ‘good’ and ‘noisy’ sets respectively identified by the instance selection algorithm. Then, a test sample is used to compare the similarities between the data in the good and noisy sets. This comparison guides the input of the test sample to one of the two classifiers. The experiments are conducted using 50 small scale and 4 large scale datasets and the results demonstrate the superior performance of the proposed DuC approach over the baseline instance selection approach.  相似文献   

13.
Dealing with high-dimensional data has always been a major problem with the research of pattern recognition and machine learning, and linear discriminant analysis (LDA) is one of the most popular methods for dimensionality reduction. However, it suffers from the problem of being too sensitive to outliers. Hence to solve this problem, fuzzy membership can be introduced to enhance the performance of algorithms by reducing the effects of outliers. In this paper, we analyze the existing fuzzy strategies and propose a new effective one based on Markov random walks. The new fuzzy strategy can maintain high consistency of local and global discriminative information and preserve statistical properties of dataset. In addition, based on the proposed fuzzy strategy, we then derive an efficient fuzzy LDA algorithm by incorporating the fuzzy membership into learning. Theoretical analysis and extensive simulations show the effectiveness of our algorithm. The presented results demonstrate that our proposed algorithm can achieve significantly improved results compared with other existing algorithms.  相似文献   

14.
In this paper, a rapid adaptive pedestrian detection method based on cascade classifier with ternary pattern is proposed. The proposed method achieves its goal by employing the following three new strategies: (1) A method for adjusting the key parameters of the trained cascade classifier dynamically for detecting pedestrians in unseen scenes using only a small amount of labeled data from the new scenes. (2) An efficient optimization method is proposed, based on the cross entropy method and a priori knowledge of the scenes, to solve the classifier parameter optimization problem. (3) In order to further speed up pedestrian detection in unseen scenes, each strong classifier in the cascade employs a ternary detection pattern. In our experiments, two significantly different datasets, AHHF and NICTA, were used as the training set and testing set, respectively. The experimental results showed that the proposed method can quickly adapt a previously trained detector for pedestrian detection in various scenes compared with other existing methods.  相似文献   

15.
提出了一种称为核加权组稀疏表示分类器(kernel weighted group sparse representation classifier, KWGSC)的新型模式分类算法. 通过在核特征空间而非原输入空间引入组稀疏性和保局性,KWGSC能够获得更有效的鉴别性重构系数用于分类表示. 为获得最优重构系数,提出了一种新的迭代更新策略进行模型求解并给出了相应的收敛性证明以及复杂度分析. 对比现存表示型分类算法,KWGSC具有的优势包括:1)通过隐含映射变换,巧妙地规避了经典线性表示算法所固有的规范化问题;2)通过联合引入距离加权约束和重构冗余约束,精确地推导出查询样本的目标类别标签;3)引入l\\-2,p正则项调整协作机制中的稀疏性,获得更佳的分类性能. 人造数值实验表明:经典线性表示型算法在非范数归一化条件下无法找到正确的重构样本,而KWGSC却未受影响. 实际的公共数据库验证了所提分类算法具有鲁棒的鉴别力,其综合性能明显优于现存算法.  相似文献   

16.
The paper presents a new approach to the dynamic classifier selection in an ensemble by applying the best suited classifier for the particular testing sample. It is based on the area under curve (AUC) of the receiver operating characteristic (ROC) of each classifier. To allow application of different types of classifiers in an ensemble and to reduce the influence of outliers, the quantile representation of the signals is used. The quantiles divide the ordered data into essentially equal-sized data subsets providing approximately uniform distribution of [0–1] support for each data point. In this way the recognition problem is less sensitive to the outliers, scales and noise contained in the input attributes. The numerical results presented for the chosen benchmark data-mining sets and for the data-set of images representing melanoma and non-melanoma skin lesions have shown high efficiency of the proposed approach and superiority to the existing methods.  相似文献   

17.
A simple and fast multi-class piecewise linear classifier is proposed and implemented. For a pair of classes, the piecewise linear boundary is a collection of segments of hyperplanes created as perpendicular bisectors of line segments linking centroids of the classes or parts of classes. For a multi-class problem, a binary partition tree is initially created which represents a hierarchical division of given pattern classes into groups, with each non-leaf node corresponding to some group. After that, a piecewise linear boundary is constructed for each non-leaf node of the partition tree as for a two-class problem. The resulting piecewise linear boundary is a set of boundaries corresponding to all non-leaf nodes of the tree. The basic data structures of algorithms of synthesis of a piecewise linear classifier and classification of unknown patterns are described. The proposed classifier is compared with a number of known pattern classifiers by benchmarking with the use of real-world data sets.  相似文献   

18.
It is well known that least absolute deviation (LAD) criterion or L(1)-norm used for estimation of parameters is characterized by robustness, i.e., the estimated parameters are totally resistant (insensitive) to large changes in the sampled data. This is an extremely useful feature, especially, when the sampled data are known to be contaminated by occasionally occurring outliers or by spiky noise. In our previous works, we have proposed the least absolute deviation neural network (LADNN) to solve unconstrained LAD problems. The theoretical proofs and numerical simulations have shown that the LADNN is Lyapunov-stable and it can globally converge to the exact solution to a given unconstrained LAD problem. We have also demonstrated its excellent application value in time-delay estimation. More generally, a practical LAD application problem may contain some linear constraints, such as a set of equalities and/or inequalities, which is called constrained LAD problem, whereas the unconstrained LAD can be considered as a special form of the constrained LAD. In this paper, we present a new neural network called constrained least absolute deviation neural network (CLADNN) to solve general constrained LAD problems. Theoretical proofs and numerical simulations demonstrate that the proposed CLADNN is Lyapunov stable and globally converges to the exact solution to a given constrained LAD problem, independent of initial values. The numerical simulations have also illustrated that the proposed CLADNN can be used to robustly estimate parameters for nonlinear curve fitting, which is extensively used in signal and image processing.  相似文献   

19.
Unnatural patterns in the control charts can be associated with a specific set of assignable causes for process variation. Hence pattern recognition is very useful in identifying process problem. A common difficulty in existing control chart pattern recognition approaches is that of discrimination between different types of patterns which share similar features. This paper proposes an artificial neural network based model, which employs a pattern discrimination algorithm to recognise unnatural control chart patterns. The pattern discrimination algorithm is based on several special-purpose networks trained for specific recognition tasks. The performance of the proposed model was evaluated by simulation using two criteria: the percentage of correctly recognised patterns and the average run length (ARL). Numerical results show that the false recognition problem has been effectively addressed. In comparison with previous control chart approaches, the proposed model is capable of superior ARL performance while the type of the unnatural pattern can also be accurately identified.  相似文献   

20.
Imbalanced classification using support vector machine ensemble   总被引:1,自引:0,他引:1  
Imbalanced data sets often have detrimental effects on the performance of a conventional support vector machine (SVM). To solve this problem, we adopt both strategies of modifying the data distribution and adjusting the classifier. Both minority and majority classes are resampled to increase the generalization ability. For minority class, an one-class support vector machine model combined with synthetic minority oversampling technique is used to oversample the support vector instances. For majority class, we propose a new method to decompose the majority class into clusters and remove two clusters using a distance measure to lessen the effect of outliers. The remaining clusters are used to build an SVM ensemble with the oversampled minority patterns, the SVM ensemble can achieve better performance by considering potentially suboptimal solutions. Experimental results on benchmark data sets are provided to illustrate the effectiveness of the proposed method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号