共查询到20条相似文献,搜索用时 10 毫秒
1.
Neural Computing and Applications - Naive Bayes makes an assumption regarding conditional independence, but this assumption rarely holds true in real-world applications, so numerous attempts have... 相似文献
2.
The value difference metric (VDM) is one of the best-known and widely used distance functions for nominal attributes. This work applies the instanceweighting technique to improveVDM. An instance weighted value difference metric (IWVDM) is proposed here. Different from prior work, IWVDM uses naive Bayes (NB) to find weights for training instances. Because early work has shown that there is a close relationship between VDM and NB, some work on NB can be applied to VDM. The weight of a training instance x, that belongs to the class c, is assigned according to the difference between the estimated conditional probability ^P( c| x) by NB and the true conditional probability P( c| x), and the weight is adjusted iteratively. Compared with previous work, IWVDM has the advantage of reducing the time complexity of the process of finding weights, and simultaneously improving the performance of VDM. Experimental results on 36 UCI datasets validate the effectiveness of IWVDM. 相似文献
3.
Due to its simplicity, efficiency and efficacy, naive Bayes (NB) continues to be one of the top 10 data mining algorithms. A mass of improved approaches to NB have been proposed to weaken its conditional independence assumption. However, there has been little work, up to the present, on instance weighting filter approaches to NB. In this paper, we propose a simple, efficient, and effective instance weighting filter approach to NB. We call it attribute (feature) value frequency-based instance weighting and denote the resulting improved model as attribute value frequency weighted naive Bayes (AVFWNB). In AVFWNB, the weight of each training instance is defined as the inner product of its attribute value frequency vector and the attribute value number vector. The experimental results on 36 widely used classification problems show that AVFWNB significantly outperforms NB, yet at the same time maintains the computational simplicity that characterizes NB. 相似文献
4.
多维尺度分析(MDS)通常以欧氏空间中点的距离来度量对象间的差异性(相似性)。当对象有像性别、颜色等名义属性时,通常的做法是将它们数量化,然后再对其运用欧氏距离,显然,这种处理方法存在不合理性。将一种混合值差度量(HVDM)引入含名义属性的对象间距离的计算,以改善名义属性下MDS的计算合理性。在UCI Abalone数据集上进行的实验,结果表明该方法比传统的数量化方法在重构能力、重构精确度方面都有更好的表现。 相似文献
5.
针对离群点检测中传统距离法不能有效处理符号型属性和经典粗糙集方法不能有效处理数值型属性的问题,利用邻域粗糙集的粒化特征提出了改进的邻域值差异度量(NVDM)方法进行离群点检测。首先,将属性取值归一化并以混合欧氏重叠度量(HEOM)和具有自适应特征的邻域半径构建邻域信息系统(NIS);其次,以NVDM构造对象的邻域离群因子(NOF);最后,设计并实现了基于邻域值差异度量的离群点检测(NVDMOD)算法,该算法在计算单属性邻域覆盖(SANC)的方式上充分利用有序二分和近邻搜索思想改进了传统的无序逐一计算模式。在UCI标准数据集上与现有离群点检测算法——邻域离群点检测(NED)算法、基于距离的离群点检测(DIS)算法和 K最近邻( KNN)算法进行了实验对比、分析。实验结果表明,NVDMOD算法具有更好的适应性和有效性,为混合型属性数据集的离群点检测提供了一条更有效的新途径。 相似文献
6.
为提高朴素贝叶斯分类器的分类精度和泛化能力,提出了基于属性相关性的加权贝叶斯集成方法(WEBNC)。根据每个条件属性与决策属性的相关度对其赋以相应的权值,然后用AdaBoost训练属性加权后的BNC。该分类方法在16个UCI标准数据集上进行了测试,并与BNC、贝叶斯网和由AdaBoost训练出的BNC进行比较,实验结果表明,该分类器具有更高的分类精度与泛化能力。 相似文献
7.
The naive Bayes classifier continues to be a popular learning algorithm for data mining applications due to its simplicity and linear run-time. Many enhancements to the basic algorithm have been proposed to help mitigate its primary weakness – the assumption that attributes are independent given the class. All of them improve the performance of naive Bayes at the expense (to a greater or lesser degree) of execution time and/or simplicity of the final model. In this paper we present a simple filter method for setting attribute weights for use with naive Bayes. Experimental results show that naive Bayes with attribute weights rarely degrades the quality of the model compared to standard naive Bayes and, in many cases, improves it dramatically. The main advantages of this method compared to other approaches for improving naive Bayes is its run-time complexity and the fact that it maintains the simplicity of the final model. 相似文献
8.
针对现有邻域粗糙集模型中存在属性权重都相同,无法保证关键属性在属性约简时能够被保留的问题,提出了一种基于信息熵加权的属性约简算法。首先,采用了类间熵、类内熵策略,以最大化类间熵最小化类内熵为原则给属性赋予权重;其次,构造了基于加权邻域关系的加权邻域粗糙集模型;最后,基于依赖关系评估属性子集的重要性,从而实现属性约简。在基于UCI数据集上与其他三种属性约简算法进行对比实验,结果表明,该算法能够有效去除冗余,提高分类精度。 相似文献
9.
Due to data sparseness and attribute redundancy in high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. To effectively address this issue, this paper presents a new optimization algorithm for clustering high-dimensional categorical data, which is an extension of the k-modes clustering algorithm. In the proposed algorithm, a novel weighting technique for categorical data is developed to calculate two weights for each attribute (or dimension) in each cluster and use the weight values to identify the subsets of important attributes that categorize different clusters. The convergence of the algorithm under an optimization framework is proved. The performance and scalability of the algorithm is evaluated experimentally on both synthetic and real data sets. The experimental studies show that the proposed algorithm is effective in clustering categorical data sets and also scalable to large data sets owning to its linear time complexity with respect to the number of data objects, attributes or clusters. 相似文献
10.
In this study, the traffic accidents recognizing risk factors related to the environmental (climatological) conditions that are associated with motor vehicles accidents on the Konya-Afyonkarahisar highway with the aid of Geographical Information Systems (GIS) have been determined using the combination of K-means clustering (KMC)-based attribute weighting (KMCAW) and classifier algorithms including artificial neural network (ANN) and adaptive network-based fuzzy inference system (ANFIS). The dynamic segmentation process in ArcGIS9.0 from the traffic accident reports recorded by District Traffic Agency has identified the locations of the motor vehicle accidents. The attributes obtained from this system are day, temperature, humidity, weather conditions, and month of occurred traffic accidents. The traffic accident dataset comprises five attributes (day, temperature, humidity, weather conditions, and month of occurred traffic accidents) and 358 observations including 179 without accident and 179 with accident. The proposed comprises two stages. In the first stage, the all attributes of dataset have been weighted using KMCAW method. The aims of this weighting method are both to increase the classification performance of used classifier algorithm and to transform from linearly non-separable traffic accidents dataset to a linearly separable dataset. In the second stage, after weighting process, ANN and ANFIS classifier algorithms have been separately used to determine the case of traffic accidents as with accident or without accident. In order to evaluate the performance of proposed method, the classification accuracy, sensitivity, specificity and area under the ROC (Receiver Operating Characteristic) curves (AUC) values have been used. While ANN and ANFIS classifiers obtained the overall prediction accuracies of 53.93 and 38.76%, respectively, the combination of KMCAW and ANN and the combination of KMCAW and ANFIS achieved the overall prediction accuracies of 74.15 and 55.06% on the prediction of traffic accidents. The experimental results have demonstrated that the proposed attribute weighting method called KMCAW is a robust and effective data pre-processing method in the prediction of traffic accidents on Konya-Afyonkarahisar highway in Turkey. 相似文献
11.
在新型大规模计算环境下应用ABAC(基于属性的访问控制)面临着属性数量多、来源复杂、质量参差不齐、难以人工修正、难以直接应用于访问控制的问题。针对属性标称值的优化问题,设计了一种基于权限聚类的属性值优化算法,通过将实体表示成对应的权限集合,对实体进行基于密度的聚类,为实体赋予权限对应的类别标签,而后基于粗糙集理论对属性值进行化简与修正。最后在 UCI 公开数据集上对算法进行了验证,证明应用该算法后,ABAC策略挖掘在真阳性率和F1得分上均具有较大的提升。 相似文献
13.
为提升贷款金融客户行为预测的准确性,针对传统的K-最近邻(KNN)算法在数据分析中处理非数值因素的不完备问题,提出了一种采用值差度量(VDM)距离的对聚类结果迭代优化的改进KNN算法。首先对收集到的数据信息进行基于VDM距离的KNN算法的聚类,再对聚类结果进行迭代分析,最后通过联合训练提高了预测精度。基于葡萄牙零售银行2008-2013年收集的客户数据比较可知,改进的KNN算法与传统的KNN算法、基于属性值相关距离的KNN改进(FCD-KNN)算法、高斯贝叶斯算法、Gradient Boosting等现有算法相比具有更好的性能和稳定性,在银行数据预测客户行为中具有很大的应用价值。 相似文献
14.
This paper represents another step in overcoming a drawback of K-Means, its lack of defense against noisy features, using feature weights in the criterion. The Weighted K-Means method by Huang et al. (2008, 2004, 2005) [5], [6], [7] is extended to the corresponding Minkowski metric for measuring distances. Under Minkowski metric the feature weights become intuitively appealing feature rescaling factors in a conventional K-Means criterion. To see how this can be used in addressing another issue of K-Means, the initial setting, a method to initialize K-Means with anomalous clusters is adapted. The Minkowski metric based method is experimentally validated on datasets from the UCI Machine Learning Repository and generated sets of Gaussian clusters, both as they are and with additional uniform random noise features, and appears to be competitive in comparison with other K-Means based feature weighting algorithms. 相似文献
15.
针对现有k-匿名方法直接用于多敏感属性数据发布中存在大量隐私泄露的问题,提出一种基于语义相似和多维加权的联合敏感属性隐私保护算法。该算法通过语义相似性反聚类思想和灵活设置多敏感属性值的权值,实现了联合敏感属性值和语义多样性分组的隐私保护,并根据应用需要为数据提供不同的隐私保护力度。实验结果表明,该方法能有效保护数据隐私,增强了数据发布的安全性和实用性。 相似文献
16.
Artificial Intelligence Review - Attribute weighting is a task of paramount relevance in multi-attribute decision-making (MADM). Over the years, different approaches have been developed to face... 相似文献
17.
In the fields of pattern recognition and machine learning, the use of data preprocessing algorithms has been increasing in recent years to achieve high classification performance. In particular, it has become inevitable to use the data preprocessing method prior to classification algorithms in classifying medical datasets with the nonlinear and imbalanced data distribution. In this study, a new data preprocessing method has been proposed for the classification of Parkinson, hepatitis, Pima Indians, single proton emission computed tomography (SPECT) heart, and thoracic surgery medical datasets with the nonlinear and imbalanced data distribution. These datasets were taken from UCI machine learning repository. The proposed data preprocessing method consists of three steps. In the first step, the cluster centers of each attribute were calculated using k-means, fuzzy c-means, and mean shift clustering algorithms in medical datasets including Parkinson, hepatitis, Pima Indians, SPECT heart, and thoracic surgery medical datasets. In the second step, the absolute differences between the data in each attribute and the cluster centers are calculated, and then, the average of these differences is calculated for each attribute. In the final step, the weighting coefficients are calculated by dividing the mean value of the difference to the cluster centers, and then, weighting is performed by multiplying the obtained weight coefficients by the attribute values in the dataset. Three different attribute weighting methods have been proposed: (1) similarity-based attribute weighting in k-means clustering, (2) similarity-based attribute weighting in fuzzy c-means clustering, and (3) similarity-based attribute weighting in mean shift clustering. In this paper, we aimed to aggregate the data in each class together with the proposed attribute weighting methods and to reduce the variance value within the class. Thus, by reducing the value of variance in each class, we have put together the data in each class and at the same time, we have further increased the discrimination between the classes. To compare with other methods in the literature, the random subsampling has been used to handle the imbalanced dataset classification. After attribute weighting process, four classification algorithms including linear discriminant analysis, k-nearest neighbor classifier, support vector machine, and random forest classifier have been used to classify imbalanced medical datasets. To evaluate the performance of the proposed models, the classification accuracy, precision, recall, area under the ROC curve, κ value, and F-measure have been used. In the training and testing of the classifier models, three different methods including the 50–50% train–test holdout, the 60–40% train–test holdout, and tenfold cross-validation have been used. The experimental results have shown that the proposed attribute weighting methods have obtained higher classification performance than random subsampling method in the handling of classifying of the imbalanced medical datasets. 相似文献
18.
The quantifier-guided aggregation is used for aggregating the multiple-criteria input. Therefore, the selection of appropriate quantifiers is crucial in multicriteria aggregation since the weights for the aggregation are generated from the selected quantifier. Since Yager proposed a method for obtaining the ordered weighted averaging (OWA) vector via the three relative quantifiers used for the quantifier-guided aggregation, limited efforts have been devoted to developing new quantifiers that are suitable for use in multicriteria aggregation. In this correspondence, we propose some new quantifier functions that are based on the weighting functions characterized by showing a constant value of orness independent of the number of criteria aggregated. The proposed regular increasing monotone and regular decreasing monotone quantifiers produce the same orness as the weighting functions from which each quantifier function originates. Further, the quantifier orness rapidly converges into the value of orness of the weighting functions having a constant value of orness. This result indicates that a quantifier-guided OWA aggregation will result in a similar aggregate in case the number of criteria is not too small. 相似文献
19.
A novel weighting method is proposed for multimodel predictive control of nonlinear systems with multiple scheduling variables (MIMO nonlinear systems), in which the gap metric is employed to formulate weighting functions for local controller combination. Compared to existent weighting functions, the proposed weighting method has two major advantages: firstly, there is only one tuning parameter, which makes it simpler. Secondly, the weights depend only on the scheduling vector and can be calculated off-line and stored in a look-up table. Therefore, the computational load can be reduced, especially for nonlinear systems with multiple scheduling variables. A MIMO CSTR system is studied to demonstrate the effectiveness of the proposed weighting method. 相似文献
|