首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Fast agglomerative clustering using a k-nearest neighbor graph   总被引:1,自引:0,他引:1  
We propose a fast agglomerative clustering method using an approximate nearest neighbor graph for reducing the number of distance calculations. The time complexity of the algorithm is improved from O(tauN2) to O(tauN log N) at the cost of a slight increase in distortion; here, tau denotes the lumber of nearest neighbor updates required at each iteration. According to the experiments, a relatively small neighborhood size is sufficient to maintain the quality close to that of the full search  相似文献   

2.
改进的k-nn快速分类算法   总被引:1,自引:0,他引:1       下载免费PDF全文
针对传统的k-近邻(k-nn)方法的缺点,将聚类中的K均值和分类中的k近邻算法有机结合,提出了一种改进的k-nn快速分类算法。实验表明该算法在影响分类效果不大的情况下能达到快速分类的目的。  相似文献   

3.
Nearest neighbor (NN) rule is one of the simplest and the most important methods in pattern recognition. In this paper, we propose a kernel difference-weighted k-nearest neighbor (KDF-KNN) method for pattern classification. The proposed method defines the weighted KNN rule as a constrained optimization problem, and we then propose an efficient solution to compute the weights of different nearest neighbors. Unlike traditional distance-weighted KNN which assigns different weights to the nearest neighbors according to the distance to the unclassified sample, difference-weighted KNN weighs the nearest neighbors by using both the correlation of the differences between the unclassified sample and its nearest neighbors. To take into account the effective nonlinear structure information, we further extend difference-weighted KNN to its kernel version KDF-KNN. Our experimental results indicate that KDF-WKNN is much better than the original KNN and the distance-weighted KNN methods, and is comparable to or better than several state-of-the-art methods in terms of classification accuracy.  相似文献   

4.
Learning from high-dimensional data is usually quite challenging, as captured by the well-known phrase curse of dimensionality. Data analysis often involves measuring the similarity between different examples. This sometimes becomes a problem, as many widely used metrics tend to concentrate in high-dimensional feature spaces. The reduced contrast makes it more difficult to distinguish between close and distant points, which renders many traditional distance-based learning methods ineffective. Secondary distances based on shared neighbor similarities have recently been proposed as one possible solution to this problem. However, these initial metrics failed to take hubness into account. Hubness is a recently described aspect of the dimensionality curse, and it affects all sorts of $k$ -nearest neighbor learning methods in severely negative ways. This paper is the first to discuss the impact of hubs on forming the shared neighbor similarity scores. We propose a novel, hubness-aware secondary similarity measure $simhub_s$ and an extensive experimental evaluation shows it to be much more appropriate for high-dimensional data classification than the standard $simcos_s$ measure. The proposed similarity changes the underlying $k$ NN graph in such a way that it reduces the overall frequency of label mismatches in $k$ -neighbor sets and increases the purity of occurrence profiles, which improves classifier performance. It is a hybrid measure, which takes into account both the supervised and the unsupervised hubness information. The analysis shows that both components are useful in their own ways and that the measure is therefore properly defined. This new similarity does not increase the overall computational cost, and the improvement is essentially ‘free’.  相似文献   

5.
k nearest neighbor (kNN) is an effective and powerful lazy learning algorithm, notwithstanding its easy-to-implement. However, its performance heavily relies on the quality of training data. Due to many complex real-applications, noises coming from various possible sources are often prevalent in large scale databases. How to eliminate anomalies and improve the quality of data is still a challenge. To alleviate this problem, in this paper we propose a new anomaly removal and learning algorithm under the framework of kNN. The primary characteristic of our method is that the evidence of removing anomalies and predicting class labels of unseen instances is mutual nearest neighbors, rather than k nearest neighbors. The advantage is that pseudo nearest neighbors can be identified and will not be taken into account during the prediction process. Consequently, the final learning result is more creditable. An extensive comparative experimental analysis carried out on UCI datasets provided empirical evidence of the effectiveness of the proposed method for enhancing the performance of the k-NN rule.  相似文献   

6.
针对k近邻(kNN)方法不能很好地解决非平衡类问题,提出一种新的面向非平衡类问题的k近邻分类算法。与传统k近邻方法不同,在学习阶段,该算法首先使用划分算法(如K-Means)将多数类数据集划分为多个簇,然后将每个簇与少数类数据集合并成一个新的训练集用于训练一个k近邻模型,即该算法构建了一个包含多个k近邻模型的分类器库。在预测阶段,使用划分算法(如K-Means)从分类器库中选择一个模型用于预测样本类别。通过这种方法,提出的算法有效地保证了k近邻模型既能有效发现数据局部特征,又能充分考虑数据的非平衡性对分类器性能的影响。另外,该算法也有效地提升了k近邻的预测效率。为了进一步提高该算法的性能,将合成少数类过抽样技术(SMOTE)应用到该算法中。KEEL数据集上的实验结果表明,即使对采用随机划分策略划分的多数类数据集,所提算法也能有效地提高k近邻方法在评价指标recall、g-mean、f-measure和AUC上的泛化性能;另外,过抽样技术能进一步提高该算法在非平衡类问题上的性能,并明显优于其他高级非平衡类处理方法。  相似文献   

7.
Traditional fast k-nearest neighbor search algorithms based on pyramid structures need either many extra memories or long search time. This paper proposes a fast k-nearest neighbor search algorithm based on the wavelet transform, which exploits the important information hiding in the transform coefficients to reduce the computational complexity. The study indicates that the Haar wavelet transform brings two kinds of important pyramids. Two elimination criteria derived from the transform coefficients are used to reject those impossible candidates. Experimental results on texture classification verify the effectiveness of the proposed algorithm.  相似文献   

8.
9.
In evolutionary algorithm, one of the main issues is how to reduce the number of fitness evaluations required to obtain optimal solutions. Generally a large number of evaluations are needed to find optimal solutions, which leads to an increase of computational time. Expensive cost may have to be paid for fitness evaluation as well. Differential evolution (DE), which is widely used in many applications due to its simplicity and good performance, also cannot escape from this problem. In order to solve this problem a fitness approximation model has been proposed so far, replacing real fitness function for evaluation. In fitness approximation, an ability to estimate accurate value with compact structure is needed for good performance. Therefore in this paper we propose an efficient differential evolution using fitness estimator. We choose k-nearest neighbor (kNN) as fitness estimator because it does not need any training period or complex computation. However too many training samples in the estimator may cause computational complexity to be exponentially high. Accordingly, two schemes with regard to accuracy and efficiency are proposed to improve the estimator. Our proposed algorithm is tested with various benchmark functions and shown to find good optimal solutions with less fitness evaluation and more compact size, compared with DE and DE-kNN.  相似文献   

10.
In many pattern classification problems, an estimate of the posterior probabilities (rather than only a classification) is required. This is usually the case when some confidence measure in the classification is needed. In this article, we propose a new posterior probability estimator. The proposed estimator considers the K-nearest neighbors. It attaches a weight to each neighbor that contributes in an additive fashion to the posterior probability estimate. The weights corresponding to the K-nearest-neighbors (which add to 1) are estimated from the data using a maximum likelihood approach. Simulation studies confirm the effectiveness of the proposed estimator.  相似文献   

11.
Wu  Wei  Parampalli  Udaya  Liu  Jian  Xian  Ming 《World Wide Web》2019,22(1):101-123
World Wide Web - To utilize the cost-saving advantages of the cloud computing paradigm, individuals and enterprises increasingly resort to outsource their databases and data operations to cloud...  相似文献   

12.
传统k最近邻算法kNN在数据分类中具有广泛的应用,但该算法具有较多的冗余计算,致使处理高维数据时花费较多的计算时间。同时,基于地标点谱聚类的分类算法(LC-kNN和RC-kNN)中距离当前测试点的最近邻点存在部分缺失,导致其准确率降低。针对上述问题,提出一种基于聚类的环形k最近邻算法。提出的算法在聚类算法的基础上,首先将训练集中相似度较高的数据点聚成一个簇,然后以当前测试点为中心设置一个环形过滤器,最后通过kNN算法对过滤器中的点进行分类,其中聚类算法可以根据实际情况自由选择。算法性能已在UCI数据库中6组公开数据集上进行了实验测试,实验结果表明:AkNN_E与AkNN_H算法比kNN算法在计算量上平均减少51%,而在准确率上比LC-kNN和RC-kNN算法平均提高3%。此外,当数据在10 000维的情况下该算法仍然有效。  相似文献   

13.
Bax  Eric  Weng  Lingjie  Tian  Xu 《Machine Learning》2019,108(12):2087-2111
Machine Learning - We introduce the speculate-correct method to derive error bounds for local classifiers. Using it, we show that k-nearest neighbor classifiers, in spite of their famously...  相似文献   

14.
As Geographic Information Systems (GIS) technologies have evolved, more and more GIS applications and geospatial data are available on the web. Spatial objects in a given query range can be retrieved using spatial range query − one of the most widely used query types in GIS and spatial databases. However, it can be challenging to retrieve these data from various web applications where access to the data is only possible through restrictive web interfaces that support certain types of queries. A typical scenario is the existence of numerous business web sites that provide their branch locations through a limited “nearest location” web interface. For example, a chain restaurant’s web site such as McDonalds can be queried to find some of the closest locations of its branches to the user’s home address. However, even though the site has the location data of all restaurants in, for example, the state of California, it is difficult to retrieve the entire data set efficiently due to its restrictive web interface. Considering that k-Nearest Neighbor (k-NN) search is one of the most popular web interfaces in accessing spatial data on the web, this paper investigates the problem of retrieving geospatial data from the web for a given spatial range query using only k-NN searches. Based on the classification of k-NN interfaces on the web, we propose a set of range query algorithms to completely cover the rectangular shape of the query range (completeness) while minimizing the number of k-NN searches as possible (efficiency). We evaluated the efficiency of the proposed algorithms through statistical analysis and empirical experiments using both synthetic and real data sets.
Cyrus ShahabiEmail:

Wan D. Bae   is currently an assistant professor in the Mathematics, Statistics and Computer Science Department at the University of Wisconsin-Stout. She received her Ph.D. in Computer Science from the University of Denver in 2007. Dr. Bae’s current research interests include online query processing, Geographic Information Systems, digital mapping, multidimensional data analysis and data mining in spatial and spatiotemporal databases. Shayma Alkobaisi   is currently an assistant professor at the College of Information Technology in the United Arab Emirates University. She received her Ph.D. in Computer Science from the University of Denver in 2008. Dr. Alkobaisi’s research interests include uncertainty management in spatiotemporal databases, online query processing in spatial databases, Geographic Information Systems and computational geometry. Seon Ho Kim   is currently an associate professor in the Computer Science & Information Technology Department at the University of District of Columbia. He received his Ph.D. in Computer Science from the University of Southern California in 1999. Dr. Kim’s primary research interests include design and implementation of multimedia storage systems, and databases, spatiotemporal databases, and GIS. He co-chaired the 2004 ACM Workshop on Next Generation Residential Broadband Challenges in conjunction with the ACM Multimedia Conference. Sada Narayanappa   is currently an advanced computing technologist at Jeppesen. He received his Ph.D. in Mathematics and Computer Science from the University of Denver in 2006. Dr. Narayanappa’s primary research interests include computational geometry, graph theory, algorithms, design and implementation of databases. Cyrus Shahabi   is currently an Associate Professor and the Director of the Information Laboratory (InfoLAB) at the Computer Science Department and also a Research Area Director at the NSF’s Integrated Media Systems Center (IMSC) at the University of Southern California. He received his Ph.D. degree in Computer Science from the University of Southern California in August 1996. Dr. Shahabi’s current research interests include Peer-to-Peer Systems, Streaming Architectures, Geospatial Data Integration and Multidimensional Data Analysis. He is currently on the editorial board of ACM Computers in Entertainment magazine. He is also serving on many conference program committees such as ICDE, SSTD, ACM SIGMOD, ACM GIS. Dr. Shahabi is the recipient of the 2002 National Science Foundation CAREER Award and 2003 Presidential Early Career Awards for Scientists and Engineers (PECASE). In 2001, he also received an award from the Okawa Foundations.   相似文献   

15.
In this paper, we propose a novel process to optical character recognition (OCR) used in real environments, such as gas-meters and electricity-meters, where the quantity of noise is sometimes as large as the quantity of good signal. Our method combines two algorithms an artificial neural network on one hand, and the k-nearest neighbor as the confirmation algorithm. Our approach, unlike other OCR systems, it is based on the angles of the digits rather than on pixels. Some of the advantages of the proposed system are: insensitivity to the possible rotations of the digits, the possibility to work in different light and exposure conditions, the ability to deduct and use heuristics for character recognition. The experimental results point out that our method with moderate level of training epochs can produce a high accuracy of 99.3 % in recognizing the digits, proving that our system is very successful.  相似文献   

16.
17.
A partially specified nearest neighbor query is a nearest neighbor search in which only some of the possible keys are specified. An algorithm that uses k-d trees to perform such searching is described. The expected time complexity is O(N1-jk). where k is the total number of keys and j the number of keys specified in the query. Experimental results, which are consistent with the theoretical predictions, are also presented.  相似文献   

18.
Efficient k-nearest neighbor search on moving object trajectories   总被引:1,自引:0,他引:1  
With the growing number of mobile applications, data analysis on large sets of historical moving objects trajectories becomes increasingly important. Nearest neighbor search is a fundamental problem in spatial and spatio-temporal databases. In this paper, we consider the following problem: Given a set of moving object trajectories D and a query trajectory mq, find the k nearest neighbors to mq within D for any instant of time within the lifetime of mq. We assume D is indexed in a 3D-R-tree and employ a filter-and-refine strategy. The filter step traverses the index and creates a stream of so-called units (linear pieces of a trajectory) as a superset of the units required to build the result of the query. The refinement step processes an ordered stream of units and determines the pieces of units forming the precise result. To support the filter step, for each node p of the index, in preprocessing a time-dependent coverage function C p (t) is computed which is the number of trajectories represented in p present at time t. Within the filter step, sophisticated data structures are used to keep track of the aggregated coverages of the nodes seen so far in the index traversal to enable pruning. Moreover, the R-tree index is built in a special way to obtain coverage functions that are effective for pruning. As a result, one obtains a highly efficient kNN algorithm for moving data and query points that outperforms the two competing algorithms by a wide margin. Implementations of the new algorithms and of the competing techniques are made available as well. Algorithms can be used in a system context including, for example, visualization and animation of results. Experiments of the paper can be easily checked or repeated, and new experiments be performed.  相似文献   

19.
Recently, microarray technology has widely used on the study of gene expression in cancer diagnosis. The main distinguishing feature of microarray technology is that can measure thousands of genes at the same time. In the past, researchers always used parametric statistical methods to find the significant genes. However, microarray data often cannot obey some of the assumptions of parametric statistical methods, or type I error may be over expanded. Therefore, our aim is to establish a gene selection method without assumption restriction to reduce the dimension of the data set. In our study, adaptive genetic algorithm/k-nearest neighbor (AGA/KNN) was used to evolve gene subsets. We find that AGA/KNN can reduce the dimension of the data set, and all test samples can be classified correctly. In addition, the accuracy of AGA/KNN is higher than that of GA/KNN, and it only takes half the CPU time of GA/KNN. After using the proposed method, biologists can identify the relevant genes efficiently from the sub-gene set and classify the test samples correctly.  相似文献   

20.
The problem of k-nearest neighbors (kNN) is to find the nearest k neighbors for a query point from a given data set. Among available methods, the principal axis search tree (PAT) algorithm always has good performance on finding nearest k neighbors using the PAT structure and a node elimination criterion. In this paper, a novel kNN search algorithm is proposed. The proposed algorithm stores projection values for all data points in leaf nodes. If a leaf node in the PAT cannot be rejected by the node elimination criterion, data points in the leaf node are further checked using their pre-stored projection values to reject more impossible data points. Experimental results show that the proposed method can effectively reduce the number of distance calculations and computation time for the PAT algorithm, especially for the data set with a large dimension or for a search tree with large number of data points in a leaf node.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号