首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Recently, big data are widely noticed in many fields like machine learning, pattern recognition, medical, financial, and transportation fields. Data analysis is crucial to converting data into more specific information fed to the decision-making systems. With the diverse and complex types of datasets, knowledge discovery becomes more difficult. One solution is to use feature subset selection preprocessing that reduces this complexity, so the computation and analysis become convenient. Preprocessing produces a reliable and suitable source for any data-mining algorithm. The effective features’ selection can improve a model’s performance and help us understand the characteristics and underlying structure of complex data. This study introduces a novel hybrid feature selection cloud-based model for imbalanced data based on the k nearest neighbor algorithm. The proposed model showed good performance compared with the simple weighted nearest neighbor. The proposed model combines the firefly distance metric and the Euclidean distance used in the k nearest neighbor. The experimental results showed good insights in both time usage and feature weights compared with the weighted nearest neighbor. It also showed improvement in the classification accuracy by 12% compared with the weighted nearest neighbor algorithm. And using the cloud-distributed model reduced the processing time up to 30%, which is deliberated to be substantial compared with the recent state-of-the-art methods.

  相似文献   

2.
k nearest neighbor (kNN) is an effective and powerful lazy learning algorithm, notwithstanding its easy-to-implement. However, its performance heavily relies on the quality of training data. Due to many complex real-applications, noises coming from various possible sources are often prevalent in large scale databases. How to eliminate anomalies and improve the quality of data is still a challenge. To alleviate this problem, in this paper we propose a new anomaly removal and learning algorithm under the framework of kNN. The primary characteristic of our method is that the evidence of removing anomalies and predicting class labels of unseen instances is mutual nearest neighbors, rather than k nearest neighbors. The advantage is that pseudo nearest neighbors can be identified and will not be taken into account during the prediction process. Consequently, the final learning result is more creditable. An extensive comparative experimental analysis carried out on UCI datasets provided empirical evidence of the effectiveness of the proposed method for enhancing the performance of the k-NN rule.  相似文献   

3.
聚类是一种无监督的机器学习方法,其任务是发现数据中的自然簇。共享最近邻聚类算法(SNN)在处理大小不同、形状不同以及密度不同的数据集上具有很好的聚类效果,但该算法还存在以下不足:(1)时间复杂度为O(n2),不适合处理大规模数据集;(2)没有明确给出参数阈值的简单指导性操作方法;(3)只能处理数值型属性数据集。对共享最近邻算法进行改进,使其能够处理混合属性数据集,并给出参数阈值的简单选择方法,改进后算法运行时间与数据集大小成近似线性关系,适用于大规模高维数据集。在真实数据集和人造数据集上的实验结果表明,提出的改进算法是有效可行的。  相似文献   

4.
It is very expensive and time-consuming to annotate huge amounts of data. Active learning would be a suitable approach to minimize the effort of annotation. A novel active learning approach, coupled K nearest neighbor pseudo pruning (CKNNPP), is proposed in the paper, which is based on querying examples by KNNPP method. The KNNPP method applies k nearest neighbor technique to search for k neighbor samples from labeled samples of unlabeled samples. When k labeled samples are not belong to the same class, the corresponded unlabeled sample is queried and given its right label by supervisor, and then it is added to labeled training set. In contrast with the previous depiction, the unlabeled sample is not selected and pruned, that is the pseudo pruning. This definition is enlightened from the K nearest neighbor pruning preprocessing. These samples selected by KNNPP are considered to be near or on the optimal classification hyperplane that is crucial for active learning. Especially, in order to avoid the excursion of the optimal classification hyperplane after adding a queried sample, CKNNPP method is proposed finally that two samples with different class label (like a couple, annotated by supervisor) are queried by KNNPP and added in the training set simultaneously for updating training set in each iteration. The CKNNPP can provide a good performance, and especially it is simple, effective, and robust, and can solve the classification problem with unbalanced dataset compared with the existing methods. Then, the computational complexity of CKNNPP is analyzed. Additionally, a new stopping criterion is applied in the proposed method, and the classifier is implemented by Lagrangian Support Vector Machines in iterations of active learning. Finally, twelve UCI datasets, image datasets of aircrafts, and the dataset of radar high-resolution range profile are used to validate the feasibility and effectiveness of the proposed method. The results illuminate that CKNNPP gains superior performance compared with the other seven state-of-the-art active learning approaches.  相似文献   

5.
This article proposes an optimized instance-based learning approach for prediction of the compressive strength of high performance concrete based on mix data, such as water to binder ratio, water content, super-plasticizer content, fly ash content, etc. The base algorithm used in this study is the k nearest neighbor algorithm, which is an instance-based machine leaning algorithm. Five different models were developed and analyzed to investigate the effects of the number of neighbors, the distance function and the attribute weights on the performance of the models. For each model a modified version of the differential evolution algorithm was used to find the optimal model parameters. Moreover, two different models based on generalized regression neural network and stepwise regressions were also developed. The performances of the models were evaluated using a set of high strength concrete mix data. The results of this study indicate that the optimized models outperform those derived from the standard k nearest neighbor algorithm, and that the proposed models have a better performance in comparison to generalized regression neural network, stepwise regression and modular neural networks models.  相似文献   

6.
赵鹏  王友仁  崔江  罗慧 《信息与控制》2010,39(5):574-580
提出了一种基于免疫记忆网络理论与$k$近邻算法的模拟电路故障诊断方法。首先,利用免疫记忆网络寻找各故障空间的最佳记忆抗体。在免疫记忆网络中根据浓度来选择记忆抗体,以促进记忆抗体在各故障空间的均匀分布。利用克隆和超级变异机制来保证抗体多样性,再利用浓度和期望值对抗体进行促进和抑制,以避免早熟现象的产生;然后,根据所得到的各故障空间的最佳记忆抗体,使用改进的阈值k近邻算法对抗原进行故障分类;最后,以带通滤波器为诊断实例,利用实际电路测试数据和仿真数据作为测试样本进行故障诊断性能评估;实验结果证明该故障诊断方法具有较高的故障诊断率。  相似文献   

7.
This study investigates stock market indices prediction that is an interesting and important research in the areas of investment and applications, as it can get more profits and returns at lower risk rate with effective exchange strategies. To realize accurate prediction, various methods have been tried, among which the machine learning methods have drawn attention and been developed. In this paper, we propose a basic hybridized framework of the feature weighted support vector machine as well as feature weighted K-nearest neighbor to effectively predict stock market indices. We first establish a detailed theory of feature weighted SVM for the data classification assigning different weights for different features with respect to the classification importance. Then, to get the weights, we estimate the importance of each feature by computing the information gain. Lastly, we use feature weighted K-nearest neighbor to predict future stock market indices by computing k weighted nearest neighbors from the historical dataset. Experiment results on two well known Chinese stock market indices like Shanghai and Shenzhen stock exchange indices are finally presented to test the performance of our established model. With our proposed model, it can achieve a better prediction capability to Shanghai Stock Exchange Composite Index and Shenzhen Stock Exchange Component Index in the short, medium and long term respectively. The proposed algorithm can also be adapted to other stock market indices prediction.  相似文献   

8.
针对现存的基于EM (Expectation maximization)迭代的无指导词义消歧方法收敛缓慢、计算量大的问题, 利用互信息和Z-测试结合的方法选取特征, 并通过一种 统计学习算法估算初始参数值. 实验结果表明改进方法有效地提高了汉语词义消歧的准确率, 具有良好的扩展性和实用性.  相似文献   

9.
A probabilistic active support vector learning algorithm   总被引:3,自引:0,他引:3  
The paper describes a probabilistic active learning strategy for support vector machine (SVM) design in large data applications. The learning strategy is motivated by the statistical query model. While most existing methods of active SVM learning query for points based on their proximity to the current separating hyperplane, the proposed method queries for a set of points according to a distribution as determined by the current separating hyperplane and a newly defined concept of an adaptive confidence factor. This enables the algorithm to have more robust and efficient learning capabilities. The confidence factor is estimated from local information using the k nearest neighbor principle. The effectiveness of the method is demonstrated on real-life data sets both in terms of generalization performance, query complexity, and training time.  相似文献   

10.
In this paper, a Bayesian robust linear dynamic system approach is proposed for process modeling. Traditional linear dynamic system (LDS) constructed with Kalman filter is designed by Gaussian assumption which can be easily violated in non-Gaussian modeling situations, especially those with outliers. To deal with this issue, the conventional Gaussian-based Kalman filter is modified with heavy tailed Student's t-distribution so as to deal with the non-Gaussian noise and modeling outliers. Then, a variational Bayesian expectation maximization (VBEM) algorithm is developed for learning parameters of the robust linear dynamic system. For process monitoring, traditional monitoring scheme are discussed and the residual space monitoring mechanism has been improved. To explore the feasibility and effectiveness, the proposed method is applied for fault detection, with detailed comparative studies with several other methods through the Tennessee Eastman benchmark.  相似文献   

11.
The k nearest neighbor is a lazy learning algorithm that is inefficient in the classification phase because it needs to compare the query sample with all training samples. A template reduction method is recently proposed that uses only samples near the decision boundary for classification and removes those far from the decision boundary. However, when class distributions overlap, more border samples are retrained and it leads to inefficient performance in the classification phase. Because the number of reduced samples are limited, using an appropriate feature reduction method seems a logical choice to improve classification time. This paper proposes a new prototype reduction method for the k nearest neighbor algorithm, and it is based on template reduction and ViSOM. The potential property of ViSOM is displaying the topology of data on a two-dimensional feature map, it provides an intuitive way for users to observe and analyze data. An efficient classification framework is then presented, which combines the feature reduction method and the prototype selection algorithm. It needs a very small data size for classification while keeping recognition rate. In the experiments, both of synthetic and real datasets are used to evaluate the performance. Experimental results demonstrate that the proposed method obtains above 70 % speedup ratio and 90 % compression ratio while maintaining similar performance to kNN.  相似文献   

12.
在数据挖掘以及机器学习等领域,都需要涉及一个数据预处理过程,以消除数据中所包含的错误、噪声、不一致数据或缺失值。其中,缺失值的填充是一个非常具有挑战性的任务,因为填充效果的好坏会极大的影响学习算法及挖掘算法的后续处理过程。目前已有的一些填充算法,如基于粗糙集的和基于最近邻法的算法等,在一定程度上能够处理缺失值问题。与以上方法不同,提出了一种扩展的基于信息增益的缺失值填充算法,它充分利用数据集中各属性之间隐含的关系对缺失的数据进行填充。大量的实验表明,提出的扩展的基于信息增益的缺失值填充算法是有效的。  相似文献   

13.
赵海峰  余强  曹俞旦 《计算机科学》2014,41(12):160-163
多标签学习用于处理一个样本同时拥有多个标签的问题。已有的多标签懒惰学习算法IMLLA未充分考虑样本分布的特点,即在构建样本的近邻点集时,近邻点个数取固定值,这可能会将相似度高的点排除在近邻集之外,或者将相似度低的点包括在近邻集内,影响分类方法的性能。针对IMLLA的缺陷,将粒计算的思想加入近邻集的构建,提出一种基于粒计算的多标签懒惰学习算法(GMLLA)。该方法通过粒度控制,确定样本近邻点集,使得近邻集内的样本具有高相似度。实验结果表明,本算法的性能优于IMLLA。  相似文献   

14.
为解决传统核极限学习机算法参数优化困难的问题,提高分类准确度,提出一种改进贝叶斯优化的核极限学习机算法.用樽海鞘群设计贝叶斯优化框架中获取函数的下置信界策略,提高算法的局部搜索能力和寻优能力;用这种改进的贝叶斯优化算法对核极限学习机的参数进行寻优,用最优参数构造核极限学习机分类器.在UCI真实数据集上进行仿真实验,实验...  相似文献   

15.
This paper reports a comparative study of two machine learning methods on Arabic text categorization. Based on a collection of news articles as a training set, and another set of news articles as a testing set, we evaluated K nearest neighbor (KNN) algorithm, and support vector machines (SVM) algorithm. We used the full word features and considered the tf.idf as the weighting method for feature selection, and CHI statistics as a ranking metric. Experiments showed that both methods were of superior performance on the test corpus while SVM showed a better micro average F1 and prediction time.  相似文献   

16.
曹路 《计算机科学》2016,43(12):97-100
传统的支持向量机在处理不平衡数据时效果不佳。为了提高少类样本的识别精度,提出了一种基于支持向量的上采样方法。首先根据K近邻的思想清除原始数据集中的噪声;然后用支持向量机对训练集进行学习以获得支持向量,进一步对少类样本的每一个支持向量添加服从一定规律的噪声,增加少数类样本的数目以获得相对平衡的数据集;最后将获得的新数据集用支持向量机学习。实验结果显示,该方法在人工数据集和UCI标准数据集上均是有效的。  相似文献   

17.
现实世界中存在着大量无类标的数据,如各种医疗图像数据、网页数据等。在大数据时代,这种情况更加突出。标注这些无类标的数据需要付出巨大的代价。主动学习是解决这一问题的有效手段,也是近几年机器学习和数据挖掘领域中的一个研究热点。提出了一种基于在线序列极限学习机的主动学习算法,该算法利用在线序列极限学习机增量学习的特点,可显著提高学习系统的效率。另外,该算法用样例熵作为启发式度量无类标样例的重要性,用K-近邻分类器作为Oracle标注选出的无类标样例的类别。实验结果显示,提出的算法具有学习速度快、标注准确的特点。  相似文献   

18.
基于改进SEM算法的基因调控网络构建方法*   总被引:1,自引:0,他引:1  
动态贝叶斯网络(DBN)是基因调控网络的一种有力建模工具。贝叶斯结构期望最大算法(SEM)能较好地处理构建基因调控网络中数据缺失的情况,但SEM算法学习的结果对初始参数设置依赖性强。针对此问题,提出一种改进的SEM算法,通过随机生成一些候选初始值,在经过一次迭代后得到的参数中选择一个最好的初始值作为模型的初始参数值,然后执行基本的SEM算法。利用啤酒酵母细胞周期微阵列表达数据,构建其基因调控网络并与现有文献比较,结果显示该算法进一步提高了调控网络构建的精度。  相似文献   

19.
Supervised clustering is a new research area that aims to improve unsupervised clustering algorithms exploiting supervised information. Today, there are several clustering algorithms, but the effective supervised cluster adjustment method which is able to adjust the resulting clusters, regardless of applied clustering algorithm has not been presented yet. In this paper, we propose a new supervised cluster adjustment method which can be applied to any clustering algorithm. Since the adjustment method is based on finding the nearest neighbors, a novel exact nearest neighbor search algorithm is also introduced which is significantly faster than the classic one. Several datasets and clustering evaluation metrics are employed to examine the effectiveness of the proposed cluster adjustment method and the proposed fast exact nearest neighbor algorithm comprehensively. The experimental results show that the proposed algorithms are significantly effective in improving clusters and accelerating nearest neighbor searches.  相似文献   

20.
《国际计算机数学杂志》2012,89(12):2423-2440
ABSTRACT

Bayesian network is an effective representation tool to describe the uncertainty of the knowledge in artificial intelligence. One important method to learning Bayesian network from data is to employ a search procedure to explore the space of networks and a scoring metric to evaluate each candidate structure. In this paper, a novel discrete particle swarm optimization algorithm has been designed to solve the problem of Bayesian network structures learning. The proposed algorithm not only maintains the search advantages of the classical particle swarm optimization but also matches the characteristics of Bayesian networks. Meanwhile, mutation and neighbor searching operators have been used to overcome the drawback of premature convergence and balance the exploration and exploitation abilities of the particle swarm optimization. The experimental results on benchmark networks illustrate the feasibility and effectiveness of the proposed algorithm, and the comparative experiments indicate that our algorithm is highly competitive compared to other algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号