首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
K‐Nearest neighbor (K‐NN) algorithm is a classification algorithm widely used in machine learning, statistical pattern recognition, data mining, etc. Ordered weighted averaging (OWA) distance based CxK nearest neighbor algorithm is a kind of K‐NN algorithm based on OWA distance. In this study, the aim is two‐fold: i) to perform the algorithm with two different fuzzy metric measures, which are Diamond distance, and weighted dissimilarity measure composed by spread distances and center distances, and ii) to evaluate the effects of different metric measures. K neighbors are searched for each class, and OWA distance is used to aggregate the information. The OWA distance can behave as intercluster distance approaches single, complete, and average linkages by using different weights. The experimental study is performed on well‐known three classification data sets (iris, glass, and wine). N‐fold cross‐validation is used for the evaluation of performances. It is seen that single linkage approach by using two different metric measures has significant different results.  相似文献   

2.
Electroencephalography signals are typically used for analyzing epileptic seizures. These signals are highly nonlinear and nonstationary, and some specific patterns exist for certain disease types that are hard to develop an automatic epileptic seizure detection system. This paper discussed statistical mechanics of complex networks, which inherit the characteristic properties of electroencephalography signals, for feature extraction via a horizontal visibility algorithm in order to reduce processing time and complexity. The algorithm transforms a time series signal into a complex network, which some features are abbreviated. The statistical mechanics are calculated to capture distinctions pertaining to certain diseases to form a feature vector. The feature vector is classified by multiclass classification via a k‐nearest neighbor classifier, a multilayer perceptron neural network, and a support vector machine with a 10‐fold cross‐validation criterion. In performance evaluation of proposed method with healthy, seizure‐free interval, and seizure signals, firstly, input data length is regarded among some practical signal samples by optimizing between accuracy‐processing time, and the proposed method yields outstanding performance on the average classification accuracy for 3‐class problems mainly for detection of seizure‐free interval and seizure signals and acceptable results for 2‐class and 5‐class problems comparing with conventional methods. The proposed method is another tool that can be used for classifying signal patterns, as an alternative to time/frequency analyses.  相似文献   

3.
Although microfinance organizations play an important role in developing economies, decision support models for microfinance credit scoring have not been sufficiently covered in the literature, particularly for microcredit enterprises. The aim of this paper is to create a three‐class model that can improve credit risk assessment in the microfinance context. The real‐world microcredit data set used in this study includes data from retail, micro, and small enterprises. To the best of the authors' knowledge, existing research on microfinance credit scoring has been limited to regression and genetic algorithms, thereby excluding novel machine learning algorithms. The aim of this research is to close this gap. The proposed models predict default events by analysing different ensemble classification methods that empower the effects of the synthetic minority oversampling technique (SMOTE) used in the preprocessing of the imbalanced microcredit data set. Initial results have shown improvement in the prediction results for certain classes when the oversampling technique with homogeneous and heterogeneous ensemble classifier methods was applied. A prediction improvement for all classes was achieved via application of SMOTE and the Consolidated Trees Construction algorithm together with Rotation Forest. To obtain a complete view of all aspects, an additional set of metrics is used in the evaluation of performance.  相似文献   

4.
5.
This paper presents a new model developed by merging a non-parametric k-nearest-neighbor (kNN) preprocessor into an underlying support vector machine (SVM) to provide shelters for meaningful training examples, especially for stray examples scattered around their counterpart examples with different class labels. Motivated by the method of adding heavier penalty to the stray example to attain a stricter loss function for optimization, the model acts to shelter stray examples. The model consists of a filtering kNN emphasizer stage and a classical classification stage. First, the filtering kNN emphasizer stage was employed to collect information from the training examples and to produce arbitrary weights for stray examples. Then, an underlying SVM with parameterized real-valued class labels was employed to carry those weights, representing various emphasized levels of the examples, in the classification. The emphasized weights given as heavier penalties changed the regularization in the quadratic programming of the SVM, and brought the resultant decision function into a higher training accuracy. The novel idea of real-valued class labels for conveying the emphasized weights provides an effective way to pursue the solution of the classification inspired by the additional information. The adoption of the kNN preprocessor as a filtering stage is effective since it is independent of SVM in the classification stage. Due to its property of estimating density locally, the kNN method has the advantage of distinguishing stray examples from regular examples by merely considering their circumstances in the input space. In this paper, detailed experimental results and a simulated application are given to address the corresponding properties. The results show that the model is promising in terms of its original expectations.  相似文献   

6.
An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new classifier called the kNN model-based classifier (kNN Model) is proposed. It combines the strength of both kNN and Rocchio. A text categorization prototype, which implements kNN Model along with kNN and Rocchio, is described. An experimental evaluation of different methods is carried out on two common document corpora: the 20-newsgroup collection and the ModApte version of the Reuters-21578 collection of news stories. The experimental results show that the proposed kNN model-based method outperforms the kNN and Rocchio classifiers, and is therefore a good alternative for kNN and Rocchio in some application areas. This work was partly supported by the European Commission project ICONS, project no. IST-2001-32429.  相似文献   

7.
Given a multidimensional point q, a reverse k nearest neighbor (RkNN) query retrieves all the data points that have q as one of their k nearest neighbors. Existing methods for processing such queries have at least one of the following deficiencies: they (i) do not support arbitrary values of k, (ii) cannot deal efficiently with database updates, (iii) are applicable only to 2D data but not to higher dimensionality, and (iv) retrieve only approximate results. Motivated by these shortcomings, we develop algorithms for exact RkNN processing with arbitrary values of k on dynamic, multidimensional datasets. Our methods utilize a conventional data-partitioning index on the dataset and do not require any pre-computation. As a second step, we extend the proposed techniques to continuous RkNN search, which returns the RkNN results for every point on a line segment. We evaluate the effectiveness of our algorithms with extensive experiments using both real and synthetic datasets.  相似文献   

8.
This study developed a methodology for formulating water level models to forecast river stages during typhoons, comparing various models by using lazy and eager learning approaches. Two lazy learning models were introduced: the locally weighted regression (LWR) and the k-nearest neighbor (kNN) models. Their efficacy was compared with that of three eager learning models, namely, the artificial neural network (ANN), support vector regression (SVR), and linear regression (REG). These models were employed to analyze the Tanshui River Basin in Taiwan. The data collected comprised 50 historical typhoon events and relevant hourly hydrological data from the river basin during 1996–2007. The forecasting horizon ranged from 1 h to 4 h. Various statistical measures were calculated, including the correlation coefficient, mean absolute error, and root mean square error. Moreover, significance, computation efficiency, and Akaike information criterion were evaluated. The results indicated that (a) among the eager learning models, ANN and SVR yielded more favorable results than REG (based on statistical analyses and significance tests). Although ANN, SVR, and REG were categorized as eager learning models, their predictive abilities varied according to various global learning optimizers. (b) Regarding the lazy learning models, LWR performed more favorably than kNN. Although LWR and kNN were categorized as lazy learning models, their predictive abilities were based on diverse local learning optimizers. (c) A comparison of eager and lazy learning models indicated that neither were effective or yielded favorable results, because the distinct approximators of models that can be categorized as either eager or lazy learning models caused the performance to be dependent on individual models.  相似文献   

9.
 由于二手车推荐的数据集具有非平衡特性,因此,二手车推荐可视为非平衡分类问题,可借助解决非平衡分类问题的方法来实现二手车推荐。本文对非平衡数据分类的数据集重构进行研究,通过分析合成少数类过采样方法(Synthetic Minority Over-sampling Technique, SMOTE)的特点与不足,提出合成少数类过采样过滤器方法(Synthetic Minority Over-sampling Technique Filter, SmoteFilter),对SMOTE方法合成样本进行过滤,减少合成样本中的噪声数据,提高训练样本“质量”。使用支持向量机对SMOTE合成的数据和SmoteFilter合成的数据进行实验对比,结果表明SmoteFilter方法相较传统的SMOTE过采样方法,提高了二手车推荐中少数类的预测精度,提升了对二手车推荐的整体预测性能。  相似文献   

10.
Modified sequential k‐means clustering concerns a k‐means clustering problem in which the clustering machine utilizes output similarity in addition. While conventional clustering methods commonly recognize similar instances at features‐level modified sequential clustering takes advantage of response, too. To this end, the approach we pursue is to enhance the quality of clustering by using some proper information. The information enables the clustering machine to detect more patterns and dependencies that may be relevant. This allows one to determine, for instance, which fashion products exhibit similar behaviour in terms of sales. Unfortunately, conventional clustering methods cannot tackle such cases, because they handle attributes solely at the feature level without considering any response. In this study, we introduce a novel approach underlying minimum conditional entropy clustering and show its advantages in terms of data analytics. In particular, we achieve this by modifying the conventional sequential k‐means algorithm. This modified clustering approach has the ability to reflect the response effect in a consistent manner. To verify the feasibility and the performance of this approach, we conducted several experiments based on real data from the apparel industry.  相似文献   

11.
On Machine Learning Methods for Chinese Document Categorization   总被引:1,自引:0,他引:1  
This paper reports our comparative evaluation of three machine learning methods, namely k Nearest Neighbor (kNN), Support Vector Machines (SVM), and Adaptive Resonance Associative Map (ARAM) for Chinese document categorization. Based on two Chinese corpora, a series of controlled experiments evaluated their learning capabilities and efficiency in mining text classification knowledge. Benchmark experiments showed that their predictive performance were roughly comparable, especially on clean and well organized data sets. While kNN and ARAM yield better performances than SVM on small and clean data sets, SVM and ARAM significantly outperformed kNN on noisy data. Comparing efficiency, kNN was notably more costly in terms of time and memory than the other two methods. SVM is highly efficient in learning from well organized samples of moderate size, although on relatively large and noisy data the efficiency of SVM and ARAM are comparable.  相似文献   

12.
Predicting the three‐dimensional structure (fold) of a protein is a key problem in molecular biology. It is also interesting issue for statistical methods recognition. In this paper a multi‐class support vector machine (SVM) classifier is used on a real world data set. The SVM is a binary classifier, but protein fold recognition is a multi‐class problem. So several new approaches to deal with this issue are presented including a modification of the well‐known one‐versus‐one strategy. However, in this strategy the number of different binary classifiers that must be trained is quickly increasing with the number of classes. The methods proposed in this paper show how this problem can be overcome.  相似文献   

13.
An important query for spatio-temporal databases is to find nearest trajectories of moving objects. Existing work on this topic focuses on the closest trajectories in the whole data space. In this paper, we introduce and solve constrained k-nearest neighbor (CkNN) queries and historical continuous CkNN (HCCkNN) queries on R-tree-like structures storing historical information about moving object trajectories. Given a trajectory set D, a query object (point or trajectory) q, a temporal extent T, and a constrained region CR, (i) a CkNN query over trajectories retrieves from D within T, the k (≥ 1) trajectories that lie closest to q and intersect (or are enclosed by) CR; and (ii) an HCCkNN query on trajectories retrieves the constrained k nearest neighbors (CkNNs) of q at any time instance of T. We propose a suite of algorithms for processing CkNN queries and HCCkNN queries respectively, with different properties and advantages. In particular, we thoroughly investigate two types of CkNN queries, i.e., CkNNP and CkNNT, which are defined with respect to stationary query points and moving query trajectories, respectively; and two types of HCCkNN queries, namely, HCCkNNP and HCCkNNT, which are continuous counterparts of CkNNP and CkNNT, respectively. Our methods utilize an existing data-partitioning index for trajectory data (i.e., TB-tree) to achieve low I/O and CPU cost. Extensive experiments with both real and synthetic datasets demonstrate the performance of the proposed algorithms in terms of efficiency and scalability.  相似文献   

14.
κ Nearest Neighbor (κNN) search is one of the most important operations in spatial and spatio-temporal databases. Although it has received considerable attention in the database literature, there is little prior work on κNN retrieval for moving object trajectories. Motivated by this observation, this paper studies the problem of efficiently processing κNN (κ≥ 1) search on R-tree-like structures storing historical information about moving object trajectories. Two algorithms are developed based on best-first traversal paradigm, called BFPκNN and BFTκNN, which handle the κNN retrieval with respect to the static query point and the moving query trajectory, respectively. Both algorithms minimize the number of node access, that is, they perform a single access only to those qualifying nodes that may contain the final result. Aiming at saving main-memory consumption and reducing CPU cost further, several effective pruning heuristics are also presented. Extensive experiments with synthetic and real datasets confirm that the proposed algorithms in this paper outperform their competitors significantly in both efficiency and scalability.  相似文献   

15.
This article deals with the model predictive control (MPC) of linear, time‐invariant discrete‐time polytopic (LTIDP) systems. The 2‐fold aim is to simplify the treatment of complex issues like stability and feasibility analysis of MPC in the presence of parametric uncertainty as well as to reduce the complexity of the relative optimization procedure. The new approach is based on a two degrees of freedom (2DOF) control scheme, where the output r(k) of the feedforward input estimator (IE) is used as input forcing the closed‐loop system ∑f. ∑f is the feedback connection of an LTIDP plant ∑p with an LTI feedback controller ∑g. Both cases of plants with measurable and unmeasurable state are considered. The task of ∑g is to guarantee the quadratic stability of ∑f, as well as the fulfillment of hard constraints on some physical variables for any input r(k) satisfying an “a priori” determined admissibility condition. The input r(k) is computed by the feedforward IE through the on‐line minimization of a worst‐case finite‐horizon quadratic cost functional and is applied to ∑f according to the usual receding horizon strategy. The on‐line constrained optimization problem is here simplified, reducing the number of the involved constraints and decision variables. This is obtained modeling r(k) as a B‐spline function, which is known to admit a parsimonious parametric representation. This allows us to reformulate the minimization of the worst‐case cost functional as a box‐constrained robust least squares estimation problem, which can be efficiently solved using second‐order cone programming.  相似文献   

16.
With the increasing of frequency and destructiveness of product‐harm events, study on enterprise crisis management becomes essentially important, but little literature thoroughly explores the risk prediction method of product‐harm event. In this study, an initial index system for risk prediction was built based on the analysis of the key drivers of the product‐harm event's evolution; ultimately, nine risk‐forecasting indexes were obtained using rough set attribute reduction. With the four indexes of cumulative abnormal returns as the input, fuzzy clustering was used to classify the risk level of a product‐harm event into four grades. In order to control the uncertainty and instability of single classifiers in risk prediction, multiple classifier fusion was introduced and combined with self‐organising data mining (SODM). Further, an SODM‐based multiple classifier fusion (SB‐MCF) model was presented for the risk prediction related to a product‐harm event. The experimental results based on 165 Chinese listed companies indicated that the SB‐MCF model improved the average predictive accuracy and reduced variation degree simultaneously. The statistical analysis demonstrated that the SB‐MCF model significantly outperformed six widely used single classification models (e.g. neural networks, support vector machine, and case‐based reasoning) and other six commonly used multiple classifier fusion methods (e.g. majority voting, Bayesian method, and genetic algorithm).  相似文献   

17.
Maximum‐margin clustering is an extension of the support vector machine (SVM) to clustering. It partitions a set of unlabeled data into multiple groups by finding hyperplanes with the largest margins. Although existing algorithms have shown promising results, there is no guarantee of convergence of these algorithms to global solutions due to the nonconvexity of the optimization problem. In this paper, we propose a simulated annealing‐based algorithm that is able to mitigate the issue of local minima in the maximum‐margin clustering problem. The novelty of our algorithm is twofold, ie, (i) it comprises a comprehensive cluster modification scheme based on simulated annealing, and (ii) it introduces a new approach based on the combination of k‐means++ and SVM at each step of the annealing process. More precisely, k‐means++ is initially applied to extract subsets of the data points. Then, an unsupervised SVM is applied to improve the clustering results. Experimental results on various benchmark data sets (of up to over a million points) give evidence that the proposed algorithm is more effective at solving the clustering problem than a number of popular clustering algorithms.  相似文献   

18.
19.
The k-nearest neighbours (kNN) methods have been used successfully in many countries for the production of spatially comprehensive raster databases of forest attributes, made from the combination of National Forest Inventory (NFI) and satellite data. In Sweden, country-wide kNN estimates of forest variables have been produced to represent the forest condition in the years 2000 and 2005 by using a combination of Système Pour l'Observation de la Terre 5 (SPOT 5) satellite data and field data from the Swedish NFI. The resulting products are wall-to-wall raster maps with estimates of total stem volume, stem volume per tree species, tree height and stand age and a 25?×?25 m2 pixel resolution. However, probability-based kNN stem volume estimates tend to have a suppressed variation range as large values are usually underestimated and small values are overestimated. One way to handle this problem is to calibrate the kNN stem volume estimates to the reference distribution of stem volume observations by histogram matching (HM) for a defined geographic area.

In this study, we have tested HM for the calibration of kNN total stem volume raster maps to the reference distribution captured by a forest inventory (FI) from 106 stands in Strömsjöliden, in the north of Sweden. The available field FI data set comprises 1084 circular plots, divided into a reference data set and an evaluation data set of total stem volume observations. The reference data set was used for the creation of a cumulative frequency histogram of total stem volume and the evaluation data set was used to assess the accuracy of volume estimates, before and after HM. The HM adjusted the cumulative distribution of the kNN data set to the distribution of the reference observations and resulted in a distribution of kNN estimates of total stem volume, which corresponded closely to that of the evaluation data set. The results show that the variation range of the kNN stem volume estimates can be extended by HM both on the pixel and stand levels. The extension of the range of estimates towards the range provided by the field observations allows improvement of kNN volume estimation for use in forest management planning based on stand-level analysis, given that the reference stem volume distribution can be estimated accurately, for example, using field data from NFI.  相似文献   

20.
雷小锋  陈皎  毛善君  谢昆青 《软件学报》2018,29(12):3764-3785
建立邻接图上的批量边删除聚类算法通用框架,提出基于高斯平滑模型的批量边删除判定准则,定义了适于聚类的邻接图的一般性质,提出并证明在kNN图基础上引入随机因子构造的随机kNN图,可以增强顶点之间的局部连通性,使聚类结果不再强烈依赖于某条边或某些边的保留或删除.RkNNClus算法简洁高效,依赖参数少,无需指定类簇数目,模拟和真实数据上的实验均有证明.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号