期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Toward feature selection in big data preprocessing based on hybrid cloud-based model

Shehab Noha Badawy Mahmoud Ali H Arafat 《The Journal of supercomputing》2022,78(3):3226-3265

Recently, big data are widely noticed in many fields like machine learning, pattern recognition, medical, financial, and transportation fields. Data analysis is crucial to converting data into more specific information fed to the decision-making systems. With the diverse and complex types of datasets, knowledge discovery becomes more difficult. One solution is to use feature subset selection preprocessing that reduces this complexity, so the computation and analysis become convenient. Preprocessing produces a reliable and suitable source for any data-mining algorithm. The effective features’ selection can improve a model’s performance and help us understand the characteristics and underlying structure of complex data. This study introduces a novel hybrid feature selection cloud-based model for imbalanced data based on the k nearest neighbor algorithm. The proposed model showed good performance compared with the simple weighted nearest neighbor. The proposed model combines the firefly distance metric and the Euclidean distance used in the k nearest neighbor. The experimental results showed good insights in both time usage and feature weights compared with the weighted nearest neighbor. It also showed improvement in the classification accuracy by 12% compared with the weighted nearest neighbor algorithm. And using the cloud-distributed model reduced the processing time up to 30%, which is deliberated to be substantial compared with the recent state-of-the-art methods.

相似文献

2.

Noisy data elimination using mutual k-nearest neighbor for classification mining

Huawen Liu Shichao Zhang 《Journal of Systems and Software》2012,85(5):1067-1074

k nearest neighbor (kNN) is an effective and powerful lazy learning algorithm, notwithstanding its easy-to-implement. However, its performance heavily relies on the quality of training data. Due to many complex real-applications, noises coming from various possible sources are often prevalent in large scale databases. How to eliminate anomalies and improve the quality of data is still a challenge. To alleviate this problem, in this paper we propose a new anomaly removal and learning algorithm under the framework of kNN. The primary characteristic of our method is that the evidence of removing anomalies and predicting class labels of unseen instances is mutual nearest neighbors, rather than k nearest neighbors. The advantage is that pseudo nearest neighbors can be identified and will not be taken into account during the prediction process. Consequently, the final learning result is more creditable. An extensive comparative experimental analysis carried out on UCI datasets provided empirical evidence of the effectiveness of the proposed method for enhancing the performance of the k-NN rule. 相似文献

3.

改进的共享最近邻聚类算法

下载免费PDF全文

李霞蒋盛益《计算机工程与应用》2011,47(8):138-142

聚类是一种无监督的机器学习方法,其任务是发现数据中的自然簇。共享最近邻聚类算法（SNN）在处理大小不同、形状不同以及密度不同的数据集上具有很好的聚类效果,但该算法还存在以下不足：（1）时间复杂度为O（n2）,不适合处理大规模数据集;（2）没有明确给出参数阈值的简单指导性操作方法;（3）只能处理数值型属性数据集。对共享最近邻算法进行改进,使其能够处理混合属性数据集,并给出参数阈值的简单选择方法,改进后算法运行时间与数据集大小成近似线性关系,适用于大规模高维数据集。在真实数据集和人造数据集上的实验结果表明,提出的改进算法是有效可行的。相似文献

4.

Active learning based on coupled KNN pseudo pruning

Lin Xiong L. C. Jiao Shasha Mao Li Zhang 《Neural computing & applications》2012,21(7):1669-1686

It is very expensive and time-consuming to annotate huge amounts of data. Active learning would be a suitable approach to minimize the effort of annotation. A novel active learning approach, coupled K nearest neighbor pseudo pruning (CKNNPP), is proposed in the paper, which is based on querying examples by KNNPP method. The KNNPP method applies k nearest neighbor technique to search for k neighbor samples from labeled samples of unlabeled samples. When k labeled samples are not belong to the same class, the corresponded unlabeled sample is queried and given its right label by supervisor, and then it is added to labeled training set. In contrast with the previous depiction, the unlabeled sample is not selected and pruned, that is the pseudo pruning. This definition is enlightened from the K nearest neighbor pruning preprocessing. These samples selected by KNNPP are considered to be near or on the optimal classification hyperplane that is crucial for active learning. Especially, in order to avoid the excursion of the optimal classification hyperplane after adding a queried sample, CKNNPP method is proposed finally that two samples with different class label (like a couple, annotated by supervisor) are queried by KNNPP and added in the training set simultaneously for updating training set in each iteration. The CKNNPP can provide a good performance, and especially it is simple, effective, and robust, and can solve the classification problem with unbalanced dataset compared with the existing methods. Then, the computational complexity of CKNNPP is analyzed. Additionally, a new stopping criterion is applied in the proposed method, and the classifier is implemented by Lagrangian Support Vector Machines in iterations of active learning. Finally, twelve UCI datasets, image datasets of aircrafts, and the dataset of radar high-resolution range profile are used to validate the feasibility and effectiveness of the proposed method. The results illuminate that CKNNPP gains superior performance compared with the other seven state-of-the-art active learning approaches. 相似文献

5.

An optimized instance based learning algorithm for estimation of compressive strength of concrete

Behrouz Ahmadi-Nedushan 《Engineering Applications of Artificial Intelligence》2012,25(5):1073-1081

This article proposes an optimized instance-based learning approach for prediction of the compressive strength of high performance concrete based on mix data, such as water to binder ratio, water content, super-plasticizer content, fly ash content, etc. The base algorithm used in this study is the k nearest neighbor algorithm, which is an instance-based machine leaning algorithm. Five different models were developed and analyzed to investigate the effects of the number of neighbors, the distance function and the attribute weights on the performance of the models. For each model a modified version of the differential evolution algorithm was used to find the optimal model parameters. Moreover, two different models based on generalized regression neural network and stepwise regressions were also developed. The performances of the models were evaluated using a set of high strength concrete mix data. The results of this study indicate that the optimized models outperform those derived from the standard k nearest neighbor algorithm, and that the proposed models have a better performance in comparison to generalized regression neural network, stepwise regression and modular neural networks models. 相似文献

6.

模拟电路免疫记忆网络故障诊断新方法

赵鹏王友仁崔江罗慧《信息与控制》2010,39(5):574-580

提出了一种基于免疫记忆网络理论与$k$近邻算法的模拟电路故障诊断方法。首先,利用免疫记忆网络寻找各故障空间的最佳记忆抗体。在免疫记忆网络中根据浓度来选择记忆抗体,以促进记忆抗体在各故障空间的均匀分布。利用克隆和超级变异机制来保证抗体多样性,再利用浓度和期望值对抗体进行促进和抑制,以避免早熟现象的产生;然后,根据所得到的各故障空间的最佳记忆抗体,使用改进的阈值k近邻算法对抗原进行故障分类;最后,以带通滤波器为诊断实例,利用实际电路测试数据和仿真数据作为测试样本进行故障诊断性能评估;实验结果证明该故障诊断方法具有较高的故障诊断率。相似文献

7.

A feature weighted support vector machine and K-nearest neighbor algorithm for stock market indices prediction

《Expert systems with applications》2017

This study investigates stock market indices prediction that is an interesting and important research in the areas of investment and applications, as it can get more profits and returns at lower risk rate with effective exchange strategies. To realize accurate prediction, various methods have been tried, among which the machine learning methods have drawn attention and been developed. In this paper, we propose a basic hybridized framework of the feature weighted support vector machine as well as feature weighted K-nearest neighbor to effectively predict stock market indices. We first establish a detailed theory of feature weighted SVM for the data classification assigning different weights for different features with respect to the classification importance. Then, to get the weights, we estimate the importance of each feature by computing the information gain. Lastly, we use feature weighted K-nearest neighbor to predict future stock market indices by computing k weighted nearest neighbors from the historical dataset. Experiment results on two well known Chinese stock market indices like Shanghai and Shenzhen stock exchange indices are finally presented to test the performance of our established model. With our proposed model, it can achieve a better prediction capability to Shanghai Stock Exchange Composite Index and Shenzhen Stock Exchange Component Index in the short, medium and long term respectively. The proposed algorithm can also be adapted to other stock market indices prediction. 相似文献

8.

一种改进的汉语全文无指导词义消歧方法

李旭刘国华张东明《自动化学报》2010,36(1):184-187

针对现存的基于EM (Expectation maximization)迭代的无指导词义消歧方法收敛缓慢、计算量大的问题, 利用互信息和Z-测试结合的方法选取特征, 并通过一种统计学习算法估算初始参数值. 实验结果表明改进方法有效地提高了汉语词义消歧的准确率, 具有良好的扩展性和实用性. 相似文献

9.

A probabilistic active support vector learning algorithm 总被引：3，自引：0，他引：3

Mitra P Murthy CA Pal SK 《IEEE transactions on pattern analysis and machine intelligence》2004,26(3):413-418

The paper describes a probabilistic active learning strategy for support vector machine (SVM) design in large data applications. The learning strategy is motivated by the statistical query model. While most existing methods of active SVM learning query for points based on their proximity to the current separating hyperplane, the proposed method queries for a set of points according to a distribution as determined by the current separating hyperplane and a newly defined concept of an adaptive confidence factor. This enables the algorithm to have more robust and efficient learning capabilities. The confidence factor is estimated from local information using the k nearest neighbor principle. The effectiveness of the method is demonstrated on real-life data sets both in terms of generalization performance, query complexity, and training time. 相似文献

10.

Bayesian robust linear dynamic system approach for dynamic process monitoring

《Journal of Process Control》2016

In this paper, a Bayesian robust linear dynamic system approach is proposed for process modeling. Traditional linear dynamic system (LDS) constructed with Kalman filter is designed by Gaussian assumption which can be easily violated in non-Gaussian modeling situations, especially those with outliers. To deal with this issue, the conventional Gaussian-based Kalman filter is modified with heavy tailed Student's t-distribution so as to deal with the non-Gaussian noise and modeling outliers. Then, a variational Bayesian expectation maximization (VBEM) algorithm is developed for learning parameters of the robust linear dynamic system. For process monitoring, traditional monitoring scheme are discussed and the residual space monitoring mechanism has been improved. To explore the feasibility and effectiveness, the proposed method is applied for fault detection, with detailed comparative studies with several other methods through the Tennessee Eastman benchmark. 相似文献

11.

A fast prototype reduction method based on template reduction and visualization-induced self-organizing map for nearest neighbor algorithm

I-Jing Li Jia-Chian Chen Jiunn-Lin Wu 《Applied Intelligence》2013,39(3):564-582

The k nearest neighbor is a lazy learning algorithm that is inefficient in the classification phase because it needs to compare the query sample with all training samples. A template reduction method is recently proposed that uses only samples near the decision boundary for classification and removes those far from the decision boundary. However, when class distributions overlap, more border samples are retrained and it leads to inefficient performance in the classification phase. Because the number of reduced samples are limited, using an appropriate feature reduction method seems a logical choice to improve classification time. This paper proposes a new prototype reduction method for the k nearest neighbor algorithm, and it is based on template reduction and ViSOM. The potential property of ViSOM is displaying the topology of data on a two-dimensional feature map, it provides an intuitive way for users to observe and analyze data. An efficient classification framework is then presented, which combines the feature reduction method and the prototype selection algorithm. It needs a very small data size for classification while keeping recognition rate. In the experiments, both of synthetic and real datasets are used to evaluate the performance. Experimental results demonstrate that the proposed method obtains above 70 % speedup ratio and 90 % compression ratio while maintaining similar performance to kNN. 相似文献

12.

缺失值填充:基于信息增益的方法

张红霞《计算机工程与设计》2006,27(24):4810-4812

在数据挖掘以及机器学习等领域,都需要涉及一个数据预处理过程,以消除数据中所包含的错误、噪声、不一致数据或缺失值。其中,缺失值的填充是一个非常具有挑战性的任务,因为填充效果的好坏会极大的影响学习算法及挖掘算法的后续处理过程。目前已有的一些填充算法,如基于粗糙集的和基于最近邻法的算法等,在一定程度上能够处理缺失值问题。与以上方法不同,提出了一种扩展的基于信息增益的缺失值填充算法,它充分利用数据集中各属性之间隐含的关系对缺失的数据进行填充。大量的实验表明,提出的扩展的基于信息增益的缺失值填充算法是有效的。相似文献

13.

基于粒计算的多标签懒惰学习算法

赵海峰余强曹俞旦《计算机科学》2014,41(12):160-163

多标签学习用于处理一个样本同时拥有多个标签的问题。已有的多标签懒惰学习算法IMLLA未充分考虑样本分布的特点,即在构建样本的近邻点集时,近邻点个数取固定值,这可能会将相似度高的点排除在近邻集之外,或者将相似度低的点包括在近邻集内,影响分类方法的性能。针对IMLLA的缺陷,将粒计算的思想加入近邻集的构建,提出一种基于粒计算的多标签懒惰学习算法(GMLLA)。该方法通过粒度控制,确定样本近邻点集,使得近邻集内的样本具有高相似度。实验结果表明,本算法的性能优于IMLLA。相似文献

14.

进化贝叶斯优化的核极限学习机分类器

张梦蝶覃华苏一丹《计算机工程与设计》2022,43(2):399-405

为解决传统核极限学习机算法参数优化困难的问题,提高分类准确度,提出一种改进贝叶斯优化的核极限学习机算法.用樽海鞘群设计贝叶斯优化框架中获取函数的下置信界策略,提高算法的局部搜索能力和寻优能力;用这种改进的贝叶斯优化算法对核极限学习机的参数进行寻优,用最优参数构造核极限学习机分类器.在UCI真实数据集上进行仿真实验,实验... 相似文献

15.

Performance of KNN and SVM classifiers on full word Arabic articles

Ismail Hmeidi Eyas El-Qawasmeh 《Advanced Engineering Informatics》2008,22(1):106-111

This paper reports a comparative study of two machine learning methods on Arabic text categorization. Based on a collection of news articles as a training set, and another set of news articles as a testing set, we evaluated K nearest neighbor (KNN) algorithm, and support vector machines (SVM) algorithm. We used the full word features and considered the tf.idf as the weighting method for feature selection, and CHI statistics as a ranking metric. Experiments showed that both methods were of superior performance on the test corpus while SVM showed a better micro average F1 and prediction time. 相似文献

16.

基于支持向量上采样的不平衡数据分类方法

曹路《计算机科学》2016,43(12):97-100

传统的支持向量机在处理不平衡数据时效果不佳。为了提高少类样本的识别精度,提出了一种基于支持向量的上采样方法。首先根据K近邻的思想清除原始数据集中的噪声;然后用支持向量机对训练集进行学习以获得支持向量,进一步对少类样本的每一个支持向量添加服从一定规律的噪声,增加少数类样本的数目以获得相对平衡的数据集;最后将获得的新数据集用支持向量机学习。实验结果显示,该方法在人工数据集和UCI标准数据集上均是有效的。相似文献

17.

在线序列主动学习方法

翟俊海臧立光张素芳《计算机科学》2017,44(1):37-41

现实世界中存在着大量无类标的数据,如各种医疗图像数据、网页数据等。在大数据时代,这种情况更加突出。标注这些无类标的数据需要付出巨大的代价。主动学习是解决这一问题的有效手段,也是近几年机器学习和数据挖掘领域中的一个研究热点。提出了一种基于在线序列极限学习机的主动学习算法,该算法利用在线序列极限学习机增量学习的特点,可显著提高学习系统的效率。另外,该算法用样例熵作为启发式度量无类标样例的重要性,用K-近邻分类器作为Oracle标注选出的无类标样例的类别。实验结果显示,提出的算法具有学习速度快、标注准确的特点。相似文献

18.

基于改进SEM算法的基因调控网络构建方法* 总被引：1，自引：0，他引：1

葛玲玲王浩姚宏亮《计算机应用研究》2010,27(2):450-452

动态贝叶斯网络（DBN）是基因调控网络的一种有力建模工具。贝叶斯结构期望最大算法（SEM）能较好地处理构建基因调控网络中数据缺失的情况,但SEM算法学习的结果对初始参数设置依赖性强。针对此问题,提出一种改进的SEM算法,通过随机生成一些候选初始值,在经过一次迭代后得到的参数中选择一个最好的初始值作为模型的初始参数值,然后执行基本的SEM算法。利用啤酒酵母细胞周期微阵列表达数据,构建其基因调控网络并与现有文献比较,结果显示该算法进一步提高了调控网络构建的精度。相似文献

19.

A novel supervised cluster adjustment method using a fast exact nearest neighbor search algorithm

Ali Zaghian Fakhroddin Noorbehbahani 《Pattern Analysis & Applications》2017,20(3):701-715

Supervised clustering is a new research area that aims to improve unsupervised clustering algorithms exploiting supervised information. Today, there are several clustering algorithms, but the effective supervised cluster adjustment method which is able to adjust the resulting clusters, regardless of applied clustering algorithm has not been presented yet. In this paper, we propose a new supervised cluster adjustment method which can be applied to any clustering algorithm. Since the adjustment method is based on finding the nearest neighbors, a novel exact nearest neighbor search algorithm is also introduced which is significantly faster than the classic one. Several datasets and clustering evaluation metrics are employed to examine the effectiveness of the proposed cluster adjustment method and the proposed fast exact nearest neighbor algorithm comprehensively. The experimental results show that the proposed algorithms are significantly effective in improving clusters and accelerating nearest neighbor searches. 相似文献

20.

A novel discrete particle swarm optimization algorithm for solving bayesian network structures learning problem

《国际计算机数学杂志》2012,89(12):2423-2440

ABSTRACT

Bayesian network is an effective representation tool to describe the uncertainty of the knowledge in artificial intelligence. One important method to learning Bayesian network from data is to employ a search procedure to explore the space of networks and a scoring metric to evaluate each candidate structure. In this paper, a novel discrete particle swarm optimization algorithm has been designed to solve the problem of Bayesian network structures learning. The proposed algorithm not only maintains the search advantages of the classical particle swarm optimization but also matches the characteristics of Bayesian networks. Meanwhile, mutation and neighbor searching operators have been used to overcome the drawback of premature convergence and balance the exploration and exploitation abilities of the particle swarm optimization. The experimental results on benchmark networks illustrate the feasibility and effectiveness of the proposed algorithm, and the comparative experiments indicate that our algorithm is highly competitive compared to other algorithms. 相似文献