首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 21 毫秒
1.
Automatic text classification is usually based on models constructed through learning from training examples. However, as the size of text document repositories grows rapidly, the storage requirements and computational cost of model learning is becoming ever higher. Instance selection is one solution to overcoming this limitation. The aim is to reduce the amount of data by filtering out noisy data from a given training dataset. A number of instance selection algorithms have been proposed in the literature, such as ENN, IB3, ICF, and DROP3. However, all of these methods have been developed for the k-nearest neighbor (k-NN) classifier. In addition, their performance has not been examined over the text classification domain where the dimensionality of the dataset is usually very high. The support vector machines (SVM) are core text classification techniques. In this study, a novel instance selection method, called Support Vector Oriented Instance Selection (SVOIS), is proposed. First of all, a regression plane in the original feature space is identified by utilizing a threshold distance between the given training instances and their class centers. Then, another threshold distance, between the identified data (forming the regression plane) and the regression plane, is used to decide on the support vectors for the selected instances. The experimental results based on the TechTC-100 dataset show the superior performance of SVOIS over other state-of-the-art algorithms. In particular, using SVOIS to select text documents allows the k-NN and SVM classifiers perform better than without instance selection.  相似文献   

2.
Cluster ensembles in collaborative filtering recommendation   总被引:1,自引:0,他引:1  
Recommender systems, which recommend items of information that are likely to be of interest to the users, and filter out less favored data items, have been developed. Collaborative filtering is a widely used recommendation technique. It is based on the assumption that people who share the same preferences on some items tend to share the same preferences on other items. Clustering techniques are commonly used for collaborative filtering recommendation. While cluster ensembles have been shown to outperform many single clustering techniques in the literature, the performance of cluster ensembles for recommendation has not been fully examined. Thus, the aim of this paper is to assess the applicability of cluster ensembles to collaborative filtering recommendation. In particular, two well-known clustering techniques (self-organizing maps (SOM) and k-means), and three ensemble methods (the cluster-based similarity partitioning algorithm (CSPA), hypergraph partitioning algorithm (HGPA), and majority voting) are used. The experimental results based on the Movielens dataset show that cluster ensembles can provide better recommendation performance than single clustering techniques in terms of recommendation accuracy and precision. In addition, there are no statistically significant differences between either the three SOM ensembles or the three k-means ensembles. Either the SOM or k-means ensembles could be considered in the future as the baseline collaborative filtering technique.  相似文献   

3.
Intrusion detection is a necessary step to identify unusual access or attacks to secure internal networks. In general, intrusion detection can be approached by machine learning techniques. In literature, advanced techniques by hybrid learning or ensemble methods have been considered, and related work has shown that they are superior to the models using single machine learning techniques. This paper proposes a hybrid learning model based on the triangle area based nearest neighbors (TANN) in order to detect attacks more effectively. In TANN, the k-means clustering is firstly used to obtain cluster centers corresponding to the attack classes, respectively. Then, the triangle area by two cluster centers with one data from the given dataset is calculated and formed a new feature signature of the data. Finally, the k-NN classifier is used to classify similar attacks based on the new feature represented by triangle areas. By using KDD-Cup ’99 as the simulation dataset, the experimental results show that TANN can effectively detect intrusion attacks and provide higher accuracy and detection rates, and the lower false alarm rate than three baseline models based on support vector machines, k-NN, and the hybrid centroid-based classification model by combining k-means and k-NN.  相似文献   

4.
The k-nearest neighbors classifier is one of the most widely used methods of classification due to several interesting features, such as good generalization and easy implementation. Although simple, it is usually able to match, and even beat, more sophisticated and complex methods. However, no successful method has been reported so far to apply boosting to k-NN. As boosting methods have proved very effective in improving the generalization capabilities of many classification algorithms, proposing an appropriate application of boosting to k-nearest neighbors is of great interest.Ensemble methods rely on the instability of the classifiers to improve their performance, as k-NN is fairly stable with respect to resampling, these methods fail in their attempt to improve the performance of k-NN classifier. On the other hand, k-NN is very sensitive to input selection. In this way, ensembles based on subspace methods are able to improve the performance of single k-NN classifiers. In this paper we make use of the sensitivity of k-NN to input space for developing two methods for boosting k-NN. The two approaches modify the view of the data that each classifier receives so that the accurate classification of difficult instances is favored.The two approaches are compared with the classifier alone and bagging and random subspace methods with a marked and significant improvement of the generalization error. The comparison is performed using a large test set of 45 problems from the UCI Machine Learning Repository. A further study on noise tolerance shows that the proposed methods are less affected by class label noise than the standard methods.  相似文献   

5.
In this paper, we consider two-way deterministic machines which have counters, each capable of storing any nonnegative integer, but not having the ability to detect empty counter, and which accept by final state. We call these machines two-way deterministic multi-weak-counter machines. Let 2NWC(k) and 2DWC(k) denote the classes of languages recognized by two-way nondeterministic and deterministic k-weak-counter machines, respectively. In particular, for k = 1, we denote the corresponding classes by 2NWC and 2DWC. The following results are shown: (1) 2DWC(k) = 2DWCfork ?1. (2) A bounded language in 2DWC is a bounded semilinear language. (3) 2NWC ≠ 2DWC.  相似文献   

6.
We develop a learning-based automated assume-guarantee (AG) reasoning framework for verifying ω-regular properties of concurrent systems. We study the applicability of non-circular (AG-NC) and circular (AG-C) AG proof rules in the context of systems with infinite behaviors. In particular, we show that AG-NC is incomplete when assumptions are restricted to strictly infinite behaviors, while AG-C remains complete. We present a general formalization, called LAG, of the learning based automated AG paradigm. We show how existing approaches for automated AG reasoning are special instances of LAG. We develop two learning algorithms for a class of systems, called ∞-regular systems, that combine finite and infinite behaviors. We show that for ∞-regular systems, both AG-NC and AG-C are sound and complete. Finally, we show how to instantiate LAG to do automated AG reasoning for ∞-regular, and ω-regular, systems using both AG-NC and AG-C as proof rules.  相似文献   

7.
Semi-supervised model-based document clustering: A comparative study   总被引:4,自引:0,他引:4  
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete. Editor: Andrew Moore  相似文献   

8.
This study developed a methodology for formulating water level models to forecast river stages during typhoons, comparing various models by using lazy and eager learning approaches. Two lazy learning models were introduced: the locally weighted regression (LWR) and the k-nearest neighbor (kNN) models. Their efficacy was compared with that of three eager learning models, namely, the artificial neural network (ANN), support vector regression (SVR), and linear regression (REG). These models were employed to analyze the Tanshui River Basin in Taiwan. The data collected comprised 50 historical typhoon events and relevant hourly hydrological data from the river basin during 1996–2007. The forecasting horizon ranged from 1 h to 4 h. Various statistical measures were calculated, including the correlation coefficient, mean absolute error, and root mean square error. Moreover, significance, computation efficiency, and Akaike information criterion were evaluated. The results indicated that (a) among the eager learning models, ANN and SVR yielded more favorable results than REG (based on statistical analyses and significance tests). Although ANN, SVR, and REG were categorized as eager learning models, their predictive abilities varied according to various global learning optimizers. (b) Regarding the lazy learning models, LWR performed more favorably than kNN. Although LWR and kNN were categorized as lazy learning models, their predictive abilities were based on diverse local learning optimizers. (c) A comparison of eager and lazy learning models indicated that neither were effective or yielded favorable results, because the distinct approximators of models that can be categorized as either eager or lazy learning models caused the performance to be dependent on individual models.  相似文献   

9.
We consider stateless counter machines which mix the features of one-head counter machines and special two-head Watson?CCrick automata (WK-automata). These biologically motivated machines have heads that read the input starting from the two extremes. The reading process is finished when the heads meet. The machine is realtime or non-realtime depending on whether the heads are required to advance at each move. A counter machine is k -reversal if each counter makes at most k alternations between increasing mode and decreasing mode on any computation, and reversal bounded if it is k-reversal for some k. In this paper we concentrate on the properties of deterministic stateless realtime WK-automata with counters that are reversal bounded. We give examples and establish hierarchies with respect to counters and reversals.  相似文献   

10.
Vector quantization(VQ) can perform efficient feature extraction from electrocardiogram (ECG) with the advantages of dimensionality reduction and accuracy increase. However, the existing dictionary learning algorithms for vector quantization are sensitive to dirty data, which compromises the classification accuracy. To tackle the problem, we propose a novel dictionary learning algorithm that employs k-medoids cluster optimized by k-means++ and builds dictionaries by searching and using representative samples, which can avoid the interference of dirty data, and thus boost the classification performance of ECG systems based on vector quantization features. We apply our algorithm to vector quantization feature extraction for ECG beats classification, and compare it with popular features such as sampling point feature, fast Fourier transform feature, discrete wavelet transform feature, and with our previous beats vector quantization feature. The results show that the proposed method yields the highest accuracy and is capable of reducing the computational complexity of ECG beats classification system. The proposed dictionary learning algorithm provides more efficient encoding for ECG beats, and can improve ECG classification systems based on encoded feature.  相似文献   

11.
In the past several decades, classifier design has attracted much attention. Inspired by the locality preserving idea of manifold learning, here we give a local linear regression (LLR) classifier. The proposed classifier consists of three steps: first, search k nearest neighbors of a pointed sample from each special class, respectively; second, reconstruct the pointed sample using the k nearest neighbors from each special class, respectively; and third, classify the test sample according to the minimum reconstruction error. The experimental results on the ETH80 database, the CENPAMI handwritten number database and the FERET face image database demonstrate that LLR works well, leading to promising image classification performance.  相似文献   

12.
Context: A number of approaches have been proposed for the general problem of software component evaluation and selection. Most approaches come from the field of Component-Based Software Development (CBSD), tackle the problem of Commercial-off-the-shelf component selection and use goal-oriented requirements modelling and multi-criteria decision making techniques. Evaluation of the suitability of components is carried out largely manually and partly relies on subjective judgement. However, in dynamic, distributed environments with high demands for transparent selection processes leading to trustworthy, auditable decisions, subjective judgements and vendor claims are not considered sufficient. Furthermore, continuous monitoring and re-evaluation of components after integration is sometimes needed.Objective: This paper describes how an evidence-based approach to component evaluation can improve repeatability and reproducibility of component selection under the following conditions: (1) Functional homogeneity of candidate components and (2) High number of components and selection problem instances.Method: Our evaluation and selection method and tool empirically evaluate candidate components in controlled experiments by applying automated measurements. By analysing the differences to system-development-oriented scenarios, the paper shows how the process of utility analysis can be tailored to fit the problem space, and describes a method geared towards automated evaluation in an empirical setting. We describe tool support and a framework for automated measurements.We further present a taxonomy of decision criteria for the described scenario and discuss the data collection means needed for each category of criteria.Results: To evaluate our approach, we discuss a series of case studies in the area of digital preservation. We analyse the criteria defined in these case studies, classify them according to the taxonomy, and discuss the quantitative coverage of automated measurements.Conclusion: The results of the analysis show that an automated measurement, evaluation and selection framework is necessary and feasible to ensure trusted and repeatable decisions.  相似文献   

13.
Ensemble methods aim at combining multiple learning machines to improve the efficacy in a learning task in terms of prediction accuracy, scalability, and other measures. These methods have been applied to evolutionary machine learning techniques including learning classifier systems (LCSs). In this article, we first propose a conceptual framework that allows us to appropriately categorize ensemble‐based methods for fair comparison and highlights the gaps in the corresponding literature. The framework is generic and consists of three sequential stages: a pre‐gate stage concerned with data preparation; the member stage to account for the types of learning machines used to build the ensemble; and a post‐gate stage concerned with the methods to combine ensemble output. A taxonomy of LCSs‐based ensembles is then presented using this framework. The article then focuses on comparing LCS ensembles that use feature selection in the pre‐gate stage. An evaluation methodology is proposed to systematically analyze the performance of these methods. Specifically, random feature sampling and rough set feature selection‐based LCS ensemble methods are compared. Experimental results show that the rough set‐based approach performs significantly better than the random subspace method in terms of classification accuracy in problems with high numbers of irrelevant features. The performance of the two approaches are comparable in problems with high numbers of redundant features.  相似文献   

14.
The ability to accurately predict business failure is a very important issue in financial decision-making. Incorrect decision-making in financial institutions is very likely to cause financial crises and distress. Bankruptcy prediction and credit scoring are two important problems facing financial decision support. As many related studies develop financial distress models by some machine learning techniques, more advanced machine learning techniques, such as classifier ensembles and hybrid classifiers, have not been fully assessed. The aim of this paper is to develop a novel hybrid financial distress model based on combining the clustering technique and classifier ensembles. In addition, single baseline classifiers, hybrid classifiers, and classifier ensembles are developed for comparisons. In particular, two clustering techniques, Self-Organizing Maps (SOMs) and k-means and three classification techniques, logistic regression, multilayer-perceptron (MLP) neural network, and decision trees, are used to develop these four different types of bankruptcy prediction models. As a result, 21 different models are compared in terms of average prediction accuracy and Type I & II errors. By using five related datasets, combining Self-Organizing Maps (SOMs) with MLP classifier ensembles performs the best, which provides higher predication accuracy and lower Type I & II errors.  相似文献   

15.
An abdominal aortic aneurysm (AAA) is a localized abnormal enlargement of the abdominal aorta with fatal consequences if not treated on time. The endovascular aneurysm repair (EVAR) is a minimal invasive therapy that reduces recovery times and improves survival rates in AAA cases. Nevertheless, post-operation difficulties can appear influencing the evolution of treatment. The objective of this work is to develop a pilot computer-supported diagnosis system for an automated characterization of EVAR progression from CTA images. The system is based on the extraction of texture features from post-EVAR thrombus aneurysm samples and on posterior classification. Three conventional texture-analysis methods, namely the gray level co-occurrence matrix (GLCM), the gray level run length matrix (GLRLM), the gray level difference method (GLDM), and a new method proposed by the authors, the run length matrix of local co-occurrence matrices (RLMLCM), were applied to each sample. Several classification schemes were experimentally evaluated. The ensembles of a k-nearest neighbor (k-NN), a multilayer perceptron neural network (MLP-NN), and a support vector machine (SVM) classifier fed with a reduced version of texture features resulted in a better performance (Az = 94.35 ± 0.30), as compared to the classification performance of the other alternatives.  相似文献   

16.
The k-Nearest Neighbor (k-NN) technique has become extremely popular for a variety of forest inventory mapping and estimation applications. Much of this popularity may be attributed to the non-parametric, multivariate features of the technique, its intuitiveness, and its ease of use. When used with satellite imagery and forest inventory plot data, the technique has been shown to produce useful estimates of many forest attributes including forest/non-forest, volume, and basal area. However, variance estimators for quantifying the uncertainty of means or sums of k-NN pixel-level predictions for areas of interest (AOI) consisting of multiple pixels have not been reported. The primary objectives of the study were to derive variance estimators for AOI estimates obtained from k-NN predictions and to compare precision estimates resulting from different approaches to k-NN prediction and different interpretations of those predictions. The approaches were illustrated by estimating proportion forest area, tree volume per unit area, tree basal area per unit area, and tree density per unit area for 10-km AOIs. Estimates obtained using k-NN approaches and traditional inventory approaches were compared and found to be similar. Further, variance estimates based on different interpretations of k-NN predictions were similar. The results facilitate small area estimation and simultaneous and consistent mapping and estimation of multiple forest attributes.  相似文献   

17.
Manual inspection and evaluation of quality control data is a tedious task that requires the undistracted attention of specialized personnel. On the other hand, automated monitoring of a production process is necessary, not only for real time product quality assessment, but also for potential machinery malfunction diagnosis. For this reason, control chart pattern recognition (CCPR) methods have received a lot of attention over the last two decades. Current state-of-the-art control monitoring methodology includes K charts which are based on support vector machines (SVM). Although K charts have some profound benefits, their performance deteriorate when the learning examples for the normal class greatly outnumbers the ones for the abnormal class. Such problems are termed imbalanced and represent the vast majority of the real life control pattern classification problems. Original SVM demonstrate poor performance when applied directly to these problems. In this paper, we propose the use of weighted support vector machines (WSVM) for automated process monitoring and early fault diagnosis. We show the benefits of WSVM over traditional SVM, compare them under various fault scenarios. We evaluate the proposed algorithm in binary and multi-class environments for the most popular abnormal quality control patterns as well as a real application from wafer manufacturing industry.  相似文献   

18.
Image annotation can be formulated as a classification problem. Recently, Adaboost learning with feature selection has been used for creating an accurate ensemble classifier. We propose dynamic Adaboost learning with feature selection based on parallel genetic algorithm for image annotation in MPEG-7 standard. In each iteration of Adaboost learning, genetic algorithm (GA) is used to dynamically generate and optimize a set of feature subsets on which the weak classifiers are constructed, so that an ensemble member is selected. We investigate two methods of GA feature selection: a binary-coded chromosome GA feature selection method used to perform optimal feature subset selection, and a bi-coded chromosome GA feature selection method used to perform optimal-weighted feature subset selection, i.e. simultaneously perform optimal feature subset selection and corresponding optimal weight subset selection. To improve the computational efficiency of our approach, master-slave GA, a parallel program of GA, is implemented. k-nearest neighbor classifier is used as the base classifier. The experiments are performed over 2000 classified Corel images to validate the performance of the approaches.  相似文献   

19.
With the advent of Big Data, data is being collected at an unprecedented fast pace, and it needs to be processed in a short time. To deal with data streams that flow continuously, classical batch learning algorithms cannot be applied and it is necessary to employ online approaches. Online learning consists of continuously revising and refining a model by incorporating new data as they arrive, and it allows important problems such as concept drift or management of extremely high-dimensional datasets to be solved. In this paper, we present a unified pipeline for online learning which covers online discretization, feature selection and classification. Three classical methods—the k-means discretizer, the χ2 filter and a one-layer artificial neural network—have been reimplemented to be able to tackle online data, showing promising results on both synthetic and real datasets.  相似文献   

20.
Pipelines carrying energy products play vital roles in economic wealth and public safety, but incidents continue occurring. Condition assessment of pipelines is essential to identify anomalies timely. Advanced sensing technologies obtain informative data for condition assessment, while data analysis by human has limited efficiency, accuracy, and reliability. Advances in machine learning offer exciting opportunities for automated condition assessment with minimum human intervention. This paper reviews machine learning approaches to detect, classify, locate, and quantify pipeline anomalies based on intelligent interpretation of routine operation data, nondestructive testing data, and computer vision data. Statistics and uncertainties of performance metrics of machine learning approaches are discussed. An analysis on strengths, weaknesses, opportunities, and threats (SWOT) is performed. Guides for practitioners to perform automated pipeline condition assessment are recommended. This review provide insights into the machine learning approaches for automated pipeline condition assessment. The SWOT analysis will support decision making in the pipeline industry.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号