首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
K‐Nearest neighbor (K‐NN) algorithm is a classification algorithm widely used in machine learning, statistical pattern recognition, data mining, etc. Ordered weighted averaging (OWA) distance based CxK nearest neighbor algorithm is a kind of K‐NN algorithm based on OWA distance. In this study, the aim is two‐fold: i) to perform the algorithm with two different fuzzy metric measures, which are Diamond distance, and weighted dissimilarity measure composed by spread distances and center distances, and ii) to evaluate the effects of different metric measures. K neighbors are searched for each class, and OWA distance is used to aggregate the information. The OWA distance can behave as intercluster distance approaches single, complete, and average linkages by using different weights. The experimental study is performed on well‐known three classification data sets (iris, glass, and wine). N‐fold cross‐validation is used for the evaluation of performances. It is seen that single linkage approach by using two different metric measures has significant different results.  相似文献   

2.
The moving k nearest neighbor (MkNN) query continuously finds the k nearest neighbors of a moving query point. MkNN queries can be efficiently processed through the use of safe regions. In general, a safe region is a region within which the query point can move without changing the query answer. This paper presents an incremental safe-region-based technique for answering MkNN queries, called the V*-Diagram, as well as analysis and evaluation of its associated algorithm, V*-kNN. Traditional safe-region approaches compute a safe region based on the data objects but independent of the query location. Our approach exploits the knowledge of the query location and the boundary of the search space in addition to the data objects. As a result, V*-kNN has much smaller I/O and computation costs than existing methods. We further provide cost models to estimate the number of data accesses for V*-kNN and a competitive technique, RIS-kNN. The V*-Diagram and V*-kNN are also applicable to the domain of spatial networks and we present algorithms to construct a spatial-network V*-Diagram. Our experimental results show that V*-kNN significantly outperforms the competitive technique. The results also verify the accuracy of the cost models.  相似文献   

3.
This study developed a methodology for formulating water level models to forecast river stages during typhoons, comparing various models by using lazy and eager learning approaches. Two lazy learning models were introduced: the locally weighted regression (LWR) and the k-nearest neighbor (kNN) models. Their efficacy was compared with that of three eager learning models, namely, the artificial neural network (ANN), support vector regression (SVR), and linear regression (REG). These models were employed to analyze the Tanshui River Basin in Taiwan. The data collected comprised 50 historical typhoon events and relevant hourly hydrological data from the river basin during 1996–2007. The forecasting horizon ranged from 1 h to 4 h. Various statistical measures were calculated, including the correlation coefficient, mean absolute error, and root mean square error. Moreover, significance, computation efficiency, and Akaike information criterion were evaluated. The results indicated that (a) among the eager learning models, ANN and SVR yielded more favorable results than REG (based on statistical analyses and significance tests). Although ANN, SVR, and REG were categorized as eager learning models, their predictive abilities varied according to various global learning optimizers. (b) Regarding the lazy learning models, LWR performed more favorably than kNN. Although LWR and kNN were categorized as lazy learning models, their predictive abilities were based on diverse local learning optimizers. (c) A comparison of eager and lazy learning models indicated that neither were effective or yielded favorable results, because the distinct approximators of models that can be categorized as either eager or lazy learning models caused the performance to be dependent on individual models.  相似文献   

4.
Electroencephalography signals are typically used for analyzing epileptic seizures. These signals are highly nonlinear and nonstationary, and some specific patterns exist for certain disease types that are hard to develop an automatic epileptic seizure detection system. This paper discussed statistical mechanics of complex networks, which inherit the characteristic properties of electroencephalography signals, for feature extraction via a horizontal visibility algorithm in order to reduce processing time and complexity. The algorithm transforms a time series signal into a complex network, which some features are abbreviated. The statistical mechanics are calculated to capture distinctions pertaining to certain diseases to form a feature vector. The feature vector is classified by multiclass classification via a k‐nearest neighbor classifier, a multilayer perceptron neural network, and a support vector machine with a 10‐fold cross‐validation criterion. In performance evaluation of proposed method with healthy, seizure‐free interval, and seizure signals, firstly, input data length is regarded among some practical signal samples by optimizing between accuracy‐processing time, and the proposed method yields outstanding performance on the average classification accuracy for 3‐class problems mainly for detection of seizure‐free interval and seizure signals and acceptable results for 2‐class and 5‐class problems comparing with conventional methods. The proposed method is another tool that can be used for classifying signal patterns, as an alternative to time/frequency analyses.  相似文献   

5.
Although microfinance organizations play an important role in developing economies, decision support models for microfinance credit scoring have not been sufficiently covered in the literature, particularly for microcredit enterprises. The aim of this paper is to create a three‐class model that can improve credit risk assessment in the microfinance context. The real‐world microcredit data set used in this study includes data from retail, micro, and small enterprises. To the best of the authors' knowledge, existing research on microfinance credit scoring has been limited to regression and genetic algorithms, thereby excluding novel machine learning algorithms. The aim of this research is to close this gap. The proposed models predict default events by analysing different ensemble classification methods that empower the effects of the synthetic minority oversampling technique (SMOTE) used in the preprocessing of the imbalanced microcredit data set. Initial results have shown improvement in the prediction results for certain classes when the oversampling technique with homogeneous and heterogeneous ensemble classifier methods was applied. A prediction improvement for all classes was achieved via application of SMOTE and the Consolidated Trees Construction algorithm together with Rotation Forest. To obtain a complete view of all aspects, an additional set of metrics is used in the evaluation of performance.  相似文献   

6.
针对网络中存在的对等网络(P2P)流量泛滥导致的流量失衡问题,提出将非平衡数据分类思想应用于流量识别过程。通过引入合成少数类过采样技术(SMOTE)算法并进行改进,提出了均值SMOTE (M-SMOTE)算法,实现对流量数据的平衡化处理。在此基础上分别采用3种机器学习分类器:随机森林(RF)、支持向量机(SVM)、反向传播神经网络(BPNN)对处理后各类流量进行识别。理论分析与仿真结果表明,在不影响P2P流量识别准确率的前提下,与非平衡状态相比,引入SMOTE算法将非P2P流量的识别准确率平均提高了16.5个百分点,将网络流量的整体识别率提高了9.5个百分点;与SMOTE算法相比,M-SMOTE算法将非P2P流量的识别准确率与网络流量的整体识别率分别进一步提高了3.2个百分点和2.6个百分点。实验结果表明,非平衡数据分类思想可有效解决P2P流量过多导致的非P2P流量识别率低的问题,同时所提M-SMOTE算法具有更高的识别准确度。  相似文献   

7.
This paper presents a new model developed by merging a non-parametric k-nearest-neighbor (kNN) preprocessor into an underlying support vector machine (SVM) to provide shelters for meaningful training examples, especially for stray examples scattered around their counterpart examples with different class labels. Motivated by the method of adding heavier penalty to the stray example to attain a stricter loss function for optimization, the model acts to shelter stray examples. The model consists of a filtering kNN emphasizer stage and a classical classification stage. First, the filtering kNN emphasizer stage was employed to collect information from the training examples and to produce arbitrary weights for stray examples. Then, an underlying SVM with parameterized real-valued class labels was employed to carry those weights, representing various emphasized levels of the examples, in the classification. The emphasized weights given as heavier penalties changed the regularization in the quadratic programming of the SVM, and brought the resultant decision function into a higher training accuracy. The novel idea of real-valued class labels for conveying the emphasized weights provides an effective way to pursue the solution of the classification inspired by the additional information. The adoption of the kNN preprocessor as a filtering stage is effective since it is independent of SVM in the classification stage. Due to its property of estimating density locally, the kNN method has the advantage of distinguishing stray examples from regular examples by merely considering their circumstances in the input space. In this paper, detailed experimental results and a simulated application are given to address the corresponding properties. The results show that the model is promising in terms of its original expectations.  相似文献   

8.
9.
An important query for spatio-temporal databases is to find nearest trajectories of moving objects. Existing work on this topic focuses on the closest trajectories in the whole data space. In this paper, we introduce and solve constrained k-nearest neighbor (CkNN) queries and historical continuous CkNN (HCCkNN) queries on R-tree-like structures storing historical information about moving object trajectories. Given a trajectory set D, a query object (point or trajectory) q, a temporal extent T, and a constrained region CR, (i) a CkNN query over trajectories retrieves from D within T, the k (≥ 1) trajectories that lie closest to q and intersect (or are enclosed by) CR; and (ii) an HCCkNN query on trajectories retrieves the constrained k nearest neighbors (CkNNs) of q at any time instance of T. We propose a suite of algorithms for processing CkNN queries and HCCkNN queries respectively, with different properties and advantages. In particular, we thoroughly investigate two types of CkNN queries, i.e., CkNNP and CkNNT, which are defined with respect to stationary query points and moving query trajectories, respectively; and two types of HCCkNN queries, namely, HCCkNNP and HCCkNNT, which are continuous counterparts of CkNNP and CkNNT, respectively. Our methods utilize an existing data-partitioning index for trajectory data (i.e., TB-tree) to achieve low I/O and CPU cost. Extensive experiments with both real and synthetic datasets demonstrate the performance of the proposed algorithms in terms of efficiency and scalability.  相似文献   

10.
An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new classifier called the kNN model-based classifier (kNN Model) is proposed. It combines the strength of both kNN and Rocchio. A text categorization prototype, which implements kNN Model along with kNN and Rocchio, is described. An experimental evaluation of different methods is carried out on two common document corpora: the 20-newsgroup collection and the ModApte version of the Reuters-21578 collection of news stories. The experimental results show that the proposed kNN model-based method outperforms the kNN and Rocchio classifiers, and is therefore a good alternative for kNN and Rocchio in some application areas. This work was partly supported by the European Commission project ICONS, project no. IST-2001-32429.  相似文献   

11.
The use of data driven predictive systems is becoming widespread as innovations in machine learning techniques have allowed the training of increasingly sophisticated models via the available data. The light detection and ranging (LiDAR) remote sensing technique is being increasingly applied to obtain informative terrain maps, due to its ability to collect large amounts of data with satisfactory accuracy. This paper focuses on the application of machine‐learning‐based predictive systems for the extraction of biomass information from LiDAR data. Biomass information has inmense ecological and economical value. We demonstrate the estimation of the Pinus radiata biomass in the Arratia‐Nervión region (Spain). Biomass estimation is considered a regression problem in which the ground truth for some specific sample sites is available. The promising results obtained in this study indicate that LiDAR data can be used to carry out detailed biomass mappings by the extrapolation of the models trained in this study.  相似文献   

12.
Modified sequential k‐means clustering concerns a k‐means clustering problem in which the clustering machine utilizes output similarity in addition. While conventional clustering methods commonly recognize similar instances at features‐level modified sequential clustering takes advantage of response, too. To this end, the approach we pursue is to enhance the quality of clustering by using some proper information. The information enables the clustering machine to detect more patterns and dependencies that may be relevant. This allows one to determine, for instance, which fashion products exhibit similar behaviour in terms of sales. Unfortunately, conventional clustering methods cannot tackle such cases, because they handle attributes solely at the feature level without considering any response. In this study, we introduce a novel approach underlying minimum conditional entropy clustering and show its advantages in terms of data analytics. In particular, we achieve this by modifying the conventional sequential k‐means algorithm. This modified clustering approach has the ability to reflect the response effect in a consistent manner. To verify the feasibility and the performance of this approach, we conducted several experiments based on real data from the apparel industry.  相似文献   

13.
On Machine Learning Methods for Chinese Document Categorization   总被引:1,自引:0,他引:1  
This paper reports our comparative evaluation of three machine learning methods, namely k Nearest Neighbor (kNN), Support Vector Machines (SVM), and Adaptive Resonance Associative Map (ARAM) for Chinese document categorization. Based on two Chinese corpora, a series of controlled experiments evaluated their learning capabilities and efficiency in mining text classification knowledge. Benchmark experiments showed that their predictive performance were roughly comparable, especially on clean and well organized data sets. While kNN and ARAM yield better performances than SVM on small and clean data sets, SVM and ARAM significantly outperformed kNN on noisy data. Comparing efficiency, kNN was notably more costly in terms of time and memory than the other two methods. SVM is highly efficient in learning from well organized samples of moderate size, although on relatively large and noisy data the efficiency of SVM and ARAM are comparable.  相似文献   

14.
Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques—over-sampling, under-sampling and synthetic minority over-sampling (SMOTE)—along with four popular classification methods—logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates.  相似文献   

15.
κ Nearest Neighbor (κNN) search is one of the most important operations in spatial and spatio-temporal databases. Although it has received considerable attention in the database literature, there is little prior work on κNN retrieval for moving object trajectories. Motivated by this observation, this paper studies the problem of efficiently processing κNN (κ≥ 1) search on R-tree-like structures storing historical information about moving object trajectories. Two algorithms are developed based on best-first traversal paradigm, called BFPκNN and BFTκNN, which handle the κNN retrieval with respect to the static query point and the moving query trajectory, respectively. Both algorithms minimize the number of node access, that is, they perform a single access only to those qualifying nodes that may contain the final result. Aiming at saving main-memory consumption and reducing CPU cost further, several effective pruning heuristics are also presented. Extensive experiments with synthetic and real datasets confirm that the proposed algorithms in this paper outperform their competitors significantly in both efficiency and scalability.  相似文献   

16.
This article deals with the model predictive control (MPC) of linear, time‐invariant discrete‐time polytopic (LTIDP) systems. The 2‐fold aim is to simplify the treatment of complex issues like stability and feasibility analysis of MPC in the presence of parametric uncertainty as well as to reduce the complexity of the relative optimization procedure. The new approach is based on a two degrees of freedom (2DOF) control scheme, where the output r(k) of the feedforward input estimator (IE) is used as input forcing the closed‐loop system ∑f. ∑f is the feedback connection of an LTIDP plant ∑p with an LTI feedback controller ∑g. Both cases of plants with measurable and unmeasurable state are considered. The task of ∑g is to guarantee the quadratic stability of ∑f, as well as the fulfillment of hard constraints on some physical variables for any input r(k) satisfying an “a priori” determined admissibility condition. The input r(k) is computed by the feedforward IE through the on‐line minimization of a worst‐case finite‐horizon quadratic cost functional and is applied to ∑f according to the usual receding horizon strategy. The on‐line constrained optimization problem is here simplified, reducing the number of the involved constraints and decision variables. This is obtained modeling r(k) as a B‐spline function, which is known to admit a parsimonious parametric representation. This allows us to reformulate the minimization of the worst‐case cost functional as a box‐constrained robust least squares estimation problem, which can be efficiently solved using second‐order cone programming.  相似文献   

17.
18.
Maximum‐margin clustering is an extension of the support vector machine (SVM) to clustering. It partitions a set of unlabeled data into multiple groups by finding hyperplanes with the largest margins. Although existing algorithms have shown promising results, there is no guarantee of convergence of these algorithms to global solutions due to the nonconvexity of the optimization problem. In this paper, we propose a simulated annealing‐based algorithm that is able to mitigate the issue of local minima in the maximum‐margin clustering problem. The novelty of our algorithm is twofold, ie, (i) it comprises a comprehensive cluster modification scheme based on simulated annealing, and (ii) it introduces a new approach based on the combination of k‐means++ and SVM at each step of the annealing process. More precisely, k‐means++ is initially applied to extract subsets of the data points. Then, an unsupervised SVM is applied to improve the clustering results. Experimental results on various benchmark data sets (of up to over a million points) give evidence that the proposed algorithm is more effective at solving the clustering problem than a number of popular clustering algorithms.  相似文献   

19.
The k-nearest neighbours (kNN) methods have been used successfully in many countries for the production of spatially comprehensive raster databases of forest attributes, made from the combination of National Forest Inventory (NFI) and satellite data. In Sweden, country-wide kNN estimates of forest variables have been produced to represent the forest condition in the years 2000 and 2005 by using a combination of Système Pour l'Observation de la Terre 5 (SPOT 5) satellite data and field data from the Swedish NFI. The resulting products are wall-to-wall raster maps with estimates of total stem volume, stem volume per tree species, tree height and stand age and a 25?×?25 m2 pixel resolution. However, probability-based kNN stem volume estimates tend to have a suppressed variation range as large values are usually underestimated and small values are overestimated. One way to handle this problem is to calibrate the kNN stem volume estimates to the reference distribution of stem volume observations by histogram matching (HM) for a defined geographic area.

In this study, we have tested HM for the calibration of kNN total stem volume raster maps to the reference distribution captured by a forest inventory (FI) from 106 stands in Strömsjöliden, in the north of Sweden. The available field FI data set comprises 1084 circular plots, divided into a reference data set and an evaluation data set of total stem volume observations. The reference data set was used for the creation of a cumulative frequency histogram of total stem volume and the evaluation data set was used to assess the accuracy of volume estimates, before and after HM. The HM adjusted the cumulative distribution of the kNN data set to the distribution of the reference observations and resulted in a distribution of kNN estimates of total stem volume, which corresponded closely to that of the evaluation data set. The results show that the variation range of the kNN stem volume estimates can be extended by HM both on the pixel and stand levels. The extension of the range of estimates towards the range provided by the field observations allows improvement of kNN volume estimation for use in forest management planning based on stand-level analysis, given that the reference stem volume distribution can be estimated accurately, for example, using field data from NFI.  相似文献   

20.
Given a multidimensional point q, a reverse k nearest neighbor (RkNN) query retrieves all the data points that have q as one of their k nearest neighbors. Existing methods for processing such queries have at least one of the following deficiencies: they (i) do not support arbitrary values of k, (ii) cannot deal efficiently with database updates, (iii) are applicable only to 2D data but not to higher dimensionality, and (iv) retrieve only approximate results. Motivated by these shortcomings, we develop algorithms for exact RkNN processing with arbitrary values of k on dynamic, multidimensional datasets. Our methods utilize a conventional data-partitioning index on the dataset and do not require any pre-computation. As a second step, we extend the proposed techniques to continuous RkNN search, which returns the RkNN results for every point on a line segment. We evaluate the effectiveness of our algorithms with extensive experiments using both real and synthetic datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号