首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
一种提高K-近邻算法效率的新算法   总被引:1,自引:0,他引:1       下载免费PDF全文
K-近邻(K-Nearest-Neighbor,KNN)算法是一种最基本的基于实例的学习方法,被广泛应用于机器学习与数据挖掘。其学习过程只是简单地存储已知的训练数据。当遇到新的查询实例时,一系列相似的实例被从存储器中取出,并用来分类新的查询实例。KNN的一个不足是分类新实例的开销可能很大。这是因为几乎所有的计算都发生在分类时,而不是在第一次遇到训练实例时。所以,如何有效地索引训练实例,以减少查询时所需计算是一个重要的实践问题。为解决这个问题,提出了一种新的算法。该算法把部分原本发生在分类阶段的计算移到训练阶段来完成。实验表明,算法能够提高KNN效率80%以上。此外,算法的思想还可以应用于KNN的所有变体中。  相似文献   

2.
We develop a customized classification learning method QPL, which is based on query projections. Given an instance to be classified (query instance), QPL explores the projections of the query instance (QPs), which are essentially subsets of attribute values shared by the query and training instances. QPL investigates the associated training data distribution of a QP to decide whether it is useful. The final prediction for the query is made by combining some statistics of the selected useful QPs. Unlike existing instance-based learning, QPL does not need to compute a distance measure between instances. The utilization of QPs for learning can explore a richer hypothesis space and achieve a balance between precision and robustness. Another characteristic of QPL is that the target class may vary for different query instances in a given data set. We have evaluated our method with synthetic and benchmark data sets. The results demonstrate that QPL can achieve good performance and high reliability.  相似文献   

3.
以当前的"消极学习型分类法"加"动态更新训练集"的组合模式,不足以解决好动态文本分类中的概念漂移问题.为此,受消极分类法基本思想的启发,并借鉴k-NN算法的优点,提出了针对概念漂移问题的"消极特征选择模式"的概念和基于此模式的动态文本分类算法.测试结果表明,新算法很好地解决了当前存在的难点问题,具有高可靠性、高实用性等优点.  相似文献   

4.
In many machine learning settings, labeled examples are difficult to collect while unlabeled data are abundant. Also, for some binary classification problems, positive examples which are elements of the target concept are available. Can these additional data be used to improve accuracy of supervised learning algorithms? We investigate in this paper the design of learning algorithms from positive and unlabeled data only. Many machine learning and data mining algorithms, such as decision tree induction algorithms and naive Bayes algorithms, use examples only to evaluate statistical queries (SQ-like algorithms). Kearns designed the statistical query learning model in order to describe these algorithms. Here, we design an algorithm scheme which transforms any SQ-like algorithm into an algorithm based on positive statistical queries (estimate for probabilities over the set of positive instances) and instance statistical queries (estimate for probabilities over the instance space). We prove that any class learnable in the statistical query learning model is learnable from positive statistical queries and instance statistical queries only if a lower bound on the weight of any target concept f can be estimated in polynomial time. Then, we design a decision tree induction algorithm POSC4.5, based on C4.5, that uses only positive and unlabeled examples and we give experimental results for this algorithm. In the case of imbalanced classes in the sense that one of the two classes (say the positive class) is heavily underrepresented compared to the other class, the learning problem remains open. This problem is challenging because it is encountered in many real-world applications.  相似文献   

5.
In classification problems, active learning is often adopted to alleviate the laborious human labeling efforts, by finding the most informative samples to query the labels. One of the most popular query strategy is selecting the most uncertain samples for the current classifier. The performance of such an active learning process heavily relies on the learned classifier before each query. Thus, stepwise classifier model/parameter selection is quite critical, which is, however, rarely studied in the literature. In this paper, we propose a novel active learning support vector machine algorithm with adaptive model selection. In this algorithm, before each new query, we trace the full solution path of the base classifier, and then perform efficient model selection using the unlabeled samples. This strategy significantly improves the active learning efficiency with comparatively inexpensive computational cost. Empirical results on both artificial and real world benchmark data sets show the encouraging gains brought by the proposed algorithm in terms of both classification accuracy and computational cost.  相似文献   

6.
Active learning (AL) has been shown to be a useful approach to improving the efficiency of the classification process for remote-sensing imagery. Current AL methods are essentially based on pixel-wise classification. In this paper, a new patch-based active learning (PTAL) framework is proposed for spectral-spatial classification on hyperspectral remote-sensing data. The method consists of two major steps. In the initialization stage, the original hyperspectral images are partitioned into overlapping patches. Then, for each patch, the spectral and spatial information as well as the label are extracted. A small set of patches is randomly selected from the data set for annotation, then a patch-based support vector machine (PTSVM) classifier is initially trained with these patches. In the second stage (close-loop stage of query and retraining), the trained PTSVM classifier is combined with one of three query methods, which are margin sampling (MS), entropy query-by-bagging (EQB), and multi-class level uncertainty (MCLU), and is subsequently employed to query the most informative samples from the candidate pool comprising the rest of the patches from the data set. The query selection cycle enables the PTSVM model to select the most informative queries for human annotation. Then, these informative queries are added to the training set. This process runs iteratively until a stopping criterion is met. Finally, the trained PTSVM is employed to patch classification. In order to compare this to pixel-based active learning (PXAL) models, the prediction label of a patch by PTSVM is transformed into a pixel-wise label of a pixel predictor to get the classification maps. Experimental results show better performance of the proposed PTAL methods on classification accuracy and computational time on three different hyperspectral data sets as compared with PXAL methods.  相似文献   

7.
DeEPs: A New Instance-Based Lazy Discovery and Classification System   总被引:1,自引:0,他引:1  
Distance is widely used in most lazy classification systems. Rather than using distance, we make use of the frequency of an instance's subsets of features and the frequency-change rate of the subsets among training classes to perform both knowledge discovery and classification. We name the system DeEPs. Whenever an instance is considered, DeEPs can efficiently discover those patterns contained in the instance which sharply differentiate the training classes from one to another. DeEPs can also predict a class label for the instance by compactly summarizing the frequencies of the discovered patterns based on a view to collectively maximize the discriminating power of the patterns. Many experimental results are used to evaluate the system, showing that the patterns are comprehensible and that DeEPs is accurate and scalable.  相似文献   

8.
In this contribution, we deal with active learning, which gives the learner the power to select training samples. We propose a novel query algorithm for local learning models, a class of learners that has not been considered in the context of active learning until now. Our query algorithm is based on the idea of selecting a query on the borderline of the actual classification. This is done by drawing on the geometrical properties of local models that typically induce a Voronoi tessellation on the input space, so that the Voronoi vertices of this tessellation offer themselves as prospective query points. The performance of the new query algorithm is tested on the two-spirals problem with promising results.  相似文献   

9.
A natural approach towards powerful machine learning systems is to enable options for additional machine/user interactions, for instance by allowing the system to ask queries about the concept to be learned. This motivates the development and analysis of adequate formal learning models.In the present paper, we investigate two different types of query learning models in the context of learning indexable classes of recursive languages: Angluin's original model and a relaxation thereof, called learning with extra queries. In the original model the learner is restricted to query languages belonging to the target class, while in the new model it is allowed to query other languages, too. As usual, the following standard types of queries are considered: superset, subset, equivalence, and membership queries.The learning capabilities of the resulting query learning models are compared to one another and to different versions of Gold-style language learning from only positive data and from positive and negative data (including finite learning, conservative inference, and learning in the limit). A complete picture of the relation of all these models has been elaborated. A couple of interesting differences and similarities between query learning and Gold-style learning have been observed. In particular, query learning with extra superset queries coincides with conservative inference from only positive data. This result documents the naturalness of the new query model.  相似文献   

10.
Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches.  相似文献   

11.
Shin H  Cho S 《Neural computation》2007,19(3):816-855
The support vector machine (SVM) has been spotlighted in the machine learning community because of its theoretical soundness and practical performance. When applied to a large data set, however, it requires a large memory and a long time for training. To cope with the practical difficulty, we propose a pattern selection algorithm based on neighborhood properties. The idea is to select only the patterns that are likely to be located near the decision boundary. Those patterns are expected to be more informative than the randomly selected patterns. The experimental results provide promising evidence that it is possible to successfully employ the proposed algorithm ahead of SVM training.  相似文献   

12.
Cluster-based instance selection for machine classification   总被引:1,自引:0,他引:1  
Instance selection in the supervised machine learning, often referred to as the data reduction, aims at deciding which instances from the training set should be retained for further use during the learning process. Instance selection can result in increased capabilities and generalization properties of the learning model, shorter time of the learning process, or it can help in scaling up to large data sources. The paper proposes a cluster-based instance selection approach with the learning process executed by the team of agents and discusses its four variants. The basic assumption is that instance selection is carried out after the training data have been grouped into clusters. To validate the proposed approach and to investigate the influence of the clustering method used on the quality of the classification, the computational experiment has been carried out.  相似文献   

13.
In supervised learning, a training set providing previously known information is used to classify new instances. Commonly, several instances are stored in the training set but some of them are not useful for classifying therefore it is possible to get acceptable classification rates ignoring non useful cases; this process is known as instance selection. Through instance selection the training set is reduced which allows reducing runtimes in the classification and/or training stages of classifiers. This work is focused on presenting a survey of the main instance selection methods reported in the literature.  相似文献   

14.
Classification is a key problem in machine learning/data mining. Algorithms for classification have the ability to predict the class of a new instance after having been trained on data representing past experience in classifying instances. However, the presence of a large number of features in training data can hurt the classification capacity of a machine learning algorithm. The Feature Selection problem involves discovering a subset of features such that a classifier built only with this subset would attain predictive accuracy no worse than a classifier built from the entire set of features. Several algorithms have been proposed to solve this problem. In this paper we discuss how parallelism can be used to improve the performance of feature selection algorithms. In particular, we present, discuss and evaluate a coarse-grained parallel version of the feature selection algorithm FortalFS. This algorithm performs well compared with other solutions and it has certain characteristics that makes it a good candidate for parallelization. Our parallel design is based on the master--slave design pattern. Promising results show that this approach is able to achieve near optimum speedups in the context of Amdahl's Law.  相似文献   

15.
Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models. Yet, for these neural network models, it is necessary to label a tremendous amount of training data, which is prohibitively expensive in reality. In this paper, we propose OnLine Machine Learning (OLML) database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data. An efficient model reuse algorithm AdaReuse is developed in the OLML database. Specifically, AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality, through which a group of trained models with high reuse potential for the training task could be selected efficiently. Then, multi selected models will be trained iteratively to encourage diverse models, with which a better training effect could be achieved by ensemble. We evaluate AdaReuse on two types of natural language processing (NLP) tasks, and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited. Based on AdaReuse, we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models. Usability studies are conducted to illustrate the OLML database could properly store the trained models, and reuse the trained models efficiently in new training tasks.  相似文献   

16.
针对许多多示例算法都对正包中的示例情况做出假设的问题,提出了结合模糊聚类的多示例集成算法(ISFC).结合模糊聚类和多示例学习中负包的特点,提出了"正得分"的概念,用于衡量示例标签为正的可能性,降低了多示例学习中示例标签的歧义性;考虑到多示例学习中将负示例分类错误的代价更大,设计了一种包的代表示例选择策略,选出的代表示...  相似文献   

17.
Recently, addressing the few-shot learning issue with meta-learning framework achieves great success. As we know, regularization is a powerful technique and widely used to improve machine learning algorithms. However, rare research focuses on designing appropriate meta-regularizations to further improve the generalization of meta-learning models in few-shot learning. In this paper, we propose a novel meta-contrastive loss that can be regarded as a regularization to fill this gap. The motivation of our method depends on the thought that the limited data in few-shot learning is just a small part of data sampled from the whole data distribution, and could lead to various bias representations of the whole data because of the different sampling parts. Thus, the models trained by a few training data (support set) and test data (query set) might misalign in the model space, making the model learned on the support set can not generalize well on the query data. The proposed meta-contrastive loss is designed to align the models of support and query sets to overcome this problem. The performance of the meta-learning model in few-shot learning can be improved. Extensive experiments demonstrate that our method can improve the performance of different gradient-based meta-learning models in various learning problems, e.g., few-shot regression and classification.  相似文献   

18.
Semi-supervised learning is a machine learning paradigm that can be applied to create pseudo labels from unlabeled data for learning a ranking model, when there is only limited or no training examples available. However, the effectiveness of semi-supervised learning in information retrieval (IR) can be hindered by the low quality pseudo labels, hence the need for the training query filtering that removes the low quality queries. In this paper, we assume two application scenarios with respect to the availability of human labels. First, for applications without any labeled data available, a clustering-based approach is proposed to select the high quality training queries. This approach selects the training queries following the empirical observation that the relevant documents of high quality training queries are highly coherent. Second, for applications with limited labeled data available, a classification-based approach is proposed. This approach learns a weak classifier to predict the retrieval performance gain of a given training query by making use of query features. The queries with high performance gains are selected for the following transduction process to create the pseudo labels for learning to rank algorithms. Experimental results on the standard LETOR dataset show that our proposed approaches outperform the strong baselines.  相似文献   

19.
Recent machine learning challenges require the capability of learning in non-stationary environments. These challenges imply the development of new algorithms that are able to deal with changes in the underlying problem to be learnt. These changes can be gradual or trend changes, abrupt changes and recurring contexts. As the dynamics of the changes can be very different, existing machine learning algorithms exhibit difficulties to cope with them. Several methods using, for instance, ensembles or variable length windowing have been proposed to approach this task.In this work we propose a new method, for single-layer neural networks, that is based on the introduction of a forgetting function in an incremental online learning algorithm. This forgetting function gives a monotonically increasing importance to new data. Due to the combination of incremental learning and increasing importance assignment the network forgets rapidly in the presence of changes while maintaining a stable behavior when the context is stationary.The performance of the method has been tested over several regression and classification problems and its results compared with those of previous works. The proposed algorithm has demonstrated high adaptation to changes while maintaining a low consumption of computational resources.  相似文献   

20.
We present a fast video retrieval system with three novel characteristics. First, it exploits the methods of machine learning to construct automatically a hierarchy of small subsets of features that are progressively more useful for indexing. These subsets are induced by a new heuristic method called Sort-Merge feature selection, which exploits a novel combination of Fastmap for dimensionality reduction and Mahalanobis distance for likelihood determination. Second, because these induced feature sets form a hierarchy with increasing classification accuracy, video segments can be segmented and categorized simultaneously in a coarse-fine manner that efficiently and progressively detects and refines their temporal boundaries. Third, the feature set hierarchy enables an efficient implementation of query systems by the approach of lazy evaluation, in which new queries are used to refine the retrieval index in real-time. We analyze the performance of these methods, and demonstrate them in the domain of a 75-min instructional video and a 30-min baseball video.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号