首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
数据缺失对聚类算法提出了挑战,传统方法往往采用均值或回归方法将不完整数据进行填充,再对填充后的数据进行聚类.为解决均值填充和回归填充等方法在数据缺失比率增大时填充精度以及聚类效果变差的问题,提出一种新的不完整数据相似度计算方法.以期望互信息为依据对数据集中的属性排序,充分考虑了数据集中与位置相关的属性值特征,以数据集本身元素作为缺失值填充的来源,对排序后的不完整数据集进行相似度填充计算,最后采用基于局部密度的聚类算法进行聚类.利用UCI机器学习库中的数据集验证本文填充聚类算法,实验结果表明,当数据集中缺失值增多时,算法对缺失值的容忍性较好,对缺失元素的恢复能力较强,填充精度以及最终聚类结果方面均表现良好.本文填充计算相似度的方法考虑数据集的每个属性值来对缺失值逐个填充,因而耗时较多.  相似文献   

2.
This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values.  相似文献   

3.
Data preparation is an important step in mining incomplete data. To deal with this problem, this paper introduces a new imputation approach called SN (Shell Neighbors) imputation, or simply SNI. The SNI fills in an incomplete instance (with missing values) in a given dataset by only using its left and right nearest neighbors with respect to each factor (attribute), referred them to Shell Neighbors. The left and right nearest neighbors are selected from a set of nearest neighbors of the incomplete instance. The size of the sets of the nearest neighbors is determined with the cross-validation method. And then the SNI is generalized to deal with missing data in datasets with mixed attributes, for example, continuous and categorical attributes. Some experiments are conducted for evaluating the proposed approach, and demonstrate that the generalized SNI method outperforms the kNN imputation method at imputation accuracy and classification accuracy.  相似文献   

4.
近年来,工业界和学术界面临着非常严重的数据缺失问题,缺失值极大降低了数据可用性。现有的缺失值填充技术需要较大的时间开销,很难满足大数据查询实时性的需求,为此,研究在有缺失值的情况下高效处理聚集查询,将基于采样的近似聚集查询处理与缺失值填充技术有效的结合,快速返回满足用户需求的聚集结果。采用基于块(block-level)的采样策略,在采集到的样本上进行缺失值填充,并根据缺失值填充的结果重构得到聚集结果的无偏估计。真实数据集和合成数据集上的实验结果表明,该文的方法比当前最好的方法在保证相同精度的前提下,大大提升了查询效率。  相似文献   

5.
信息处理过程中对异常信息的智能化处理是一个前沿的且富有挑战性的研究方向;针对所获取的信息由于噪声干扰等因素存在缺失这一异常现象,提出了一种不完整(缺失)数据的智能分类算法;对于某一个不完整样本,该方法首先根据找到的近邻类别信息得到单个或多个版本的估计样本,这样在保证插补的准确性的同时能够有效地表征由于缺失引起的不精确性,然后用分类器分类带有估计值的样本;最后,在证据推理框架下提出一种新的信任分类方法,将难以划分类别的样本分配到对应的复合类来描述由于缺失值引起的样本类别的不确定性,同时降低错误分类的风险;用UCI数据库的真实数据集来验证算法的有效性,实验结果表明该算法能够有效地处理不完整数据分类问题.  相似文献   

6.
In this paper, we present a new method of data decomposition to avoid the necessity of reasoning from data with missing attribute values. We define firstly a general binary relation on the original incomplete dataset. This binary relation generates data subsets without missing values. These data subsets are used to generate a topological base relation which decomposes datasets. We investigate a new approach to find the missing values in incomplete datasets. New pre-topological approximations are initiated and some of their properties are proved. Also, pre-topological measures are defined and studied. Finally, the reducts and the core of incomplete information system are determined.  相似文献   

7.
《Artificial Intelligence》2001,125(1-2):209-226
Naive Bayes classifiers provide an efficient and scalable approach to supervised classification problems. When some entries in the training set are missing, methods exist to learn these classifiers under some assumptions about the pattern of missing data. Unfortunately, reliable information about the pattern of missing data may be not readily available and recent experimental results show that the enforcement of an incorrect assumption about the pattern of missing data produces a dramatic decrease in accuracy of the classifier. This paper introduces a Robust Bayes Classifier (rbc) able to handle incomplete databases with no assumption about the pattern of missing data. In order to avoid assumptions, the rbc bounds all the possible probability estimates within intervals using a specialized estimation method. These intervals are then used to classify new cases by computing intervals on the posterior probability distributions over the classes given a new case and by ranking the intervals according to some criteria. We provide two scoring methods to rank intervals and a decision theoretic approach to trade off the risk of an erroneous classification and the choice of not classifying unequivocally a case. This decision theoretic approach can also be used to assess the opportunity of adopting assumptions about the pattern of missing data. The proposed approach is evaluated on twenty publicly available databases.  相似文献   

8.
数据驱动的扩展置信规则库专家系统能够处理含有定量数据或定性知识的不确定性问题.该方法已被广泛地研究和应用,但仍缺乏在不完整数据问题上的研究.鉴于此,针对不完整数据集上的问题,提出一种新的扩展置信规则库专家系统推理方法.首先提出基于析取范式的扩展规则结构,并通过实验讨论了在新的规则结构下,置信规则前提属性参考值个数对推理...  相似文献   

9.
目前已有的不完整数据填充方法大多局限于单一类型的缺失变量,对大规模数据的填充效果相对弱势.为了解决真实大数据中混合类型变量的缺失问题,本文提出了一个新的模型——SXGBI(Spark-based e Xtreme Gradient Boosting Imputation),其适应于连续型和分类型两种缺失变量并存的不完整数据填充,同时具备快速处理大数据的泛化能力.该方法通过对集成学习方法 XGBoost的改进,将多种补全算法结合在一起,构建了一个集成学习器,并结合Spark分布式计算框架进行了并行化设计,能较好地运行于Spark分布式集群上.实验表明,随着缺失率的增长,SXGBI在RMSE、PFC和F1几项评价指标上都取得了比实验中其它填充方法更好的填充结果.此外,它还可以有效地运用在大规模的数据集上.  相似文献   

10.
The article presents an experimental study on multiclass Support Vector Machine (SVM) methods over a cardiac arrhythmia dataset that has missing attribute values for electrocardiogram (ECG) diagnostic application. The presence of an incomplete dataset and high data dimensionality can affect the performance of classifiers. Imputation of missing data and discriminant analysis are commonly used as preprocessing techniques in such large datasets. The article proposes experiments to evaluate performance of One-Against-All (OAA) and One-Against-One (OAO) approaches in kernel multiclass SVM for a heartbeat classification problem with imputation and dimension reduction techniques. The results indicate that the OAA approach has superiority over OAO in multiclass SVM for ECG data analysis with missing values.  相似文献   

11.
完整性是数据质量的一个重要维度,由于数据本身固有的不确定性、采集的随机性及不准确性,导致现实应用中产生了大量具有如下特点的数据集:1)数据规模庞大;2)数据往往是不完整、不准确的.因此将大规模数据集分段到不同的数据窗口中处理是数据处理的重要方法,但缺失数据估算的相关研究大都忽视了数据集的特点和窗口的应用,而且回定大小的数据窗17容易造成算法的准确性和性能受窗口大小及窗口内数据值分布的影响.假设数据满足一定的领域相关的约束,首先提出了一种新的基于时间的动态自适应数据窗口检测算法,并基于此窗口提出了一种改进的模糊k-均值聚类算法来进行不完整数据的缺失数据估算.实验表明较之其他算法,不仅能更适应数据集的特点,具有较好的性能,而且能够保证准确性.  相似文献   

12.
This paper proposes to utilize information within incomplete instances (instances with missing values) when estimating missing values. Accordingly, a simple and efficient nonparametric iterative imputation algorithm, called the NIIA method, is designed for iteratively imputing missing target values. The NIIA method imputes each missing value several times until the algorithm converges. In the first iteration, all the complete instances are used to estimate missing values. The information within incomplete instances is utilized since the second imputation iteration. We conduct some experiments for evaluating the efficiency, and demonstrate: (1) the utilization of information within incomplete instances is of benefit to easily capture the distribution of a dataset; and (2) the NIIA method outperforms the existing methods in accuracy, and this advantage is clearly highlighted when datasets have a high missing ratio.  相似文献   

13.
Self-organising maps (SOM) have become a commonly-used cluster analysis technique in data mining. However, SOM are not able to process incomplete data. To build more capability of data mining for SOM, this study proposes an SOM-based fuzzy map model for data mining with incomplete data sets. Using this model, incomplete data are translated into fuzzy data, and are used to generate fuzzy observations. These fuzzy observations, along with observations without missing values, are then used to train the SOM to generate fuzzy maps. Compared with the standard SOM approach, fuzzy maps generated by the proposed method can provide more information for knowledge discovery.  相似文献   

14.
Skyline queries are extensively incorporated in various real-life applications by filtering uninteresting data objects. Sometimes, a skyline query may return so many results because it cannot control the retrieval conditions especially for highdimensional datasets. As an extension of skyline query, the kdominant skyline query reduces the control of the dimension by controlling the value of the parameter kto achieve the purpose of reducing the retrieval objects. In addition, with the continuous promotion of Bigdata applications, the data we acquired may not have the entire content that people wanted for some practically reasons of delivery failure, no power of battery, accidental loss, so that the data might be incomplete with missing values in some attributes. Obviously, the k-dominant skyline query algorithms of incomplete data depend on the user definition in some degree and the results cannot be shared. Meanwhile, the existing algorithms are unsuitable for directly used to the incomplete big data. Based on the above situations, this paper mainly studies k-dominant skyline query problem over incomplete dataset and combines this problem with the distributed structure like MapReduce environment. First, we propose an index structure over incomplete data, named incomplete data index based on dominate hierarchical tree (ID-DHT). Applying the bucket strategy, the incomplete data is divided into different buckets according to the dimensions of missing attributes. Second, we also put forward query algorithm for incomplete data in MapReduce environment, named MapReduce incomplete data based on dominant hierarchical tree algorithm (MR-ID-DHTA). The data in the bucket is allocated to the subspace according to the dominant condition by Map function. Reduce function controls the data according to the key value and returns the k-dominant skyline query result. The effective experiments demonstrate the validity and usability of our index structure and the algorithm.  相似文献   

15.
基于粗糙集理论的遗失值填充算法   总被引:2,自引:1,他引:1  
分析了在含有遗失值的数据集上如何进行有效的数据填补,以便更客观地反映数据集中数据所隐含的内在联系。通过采用粗糙集理论的有关思想和方法,提出了一种高效的等价类划分方法,在此基础上给出了一种基于粗糙集理论的遗失值填充算法,提高了遗失数据的填补效率和精确度。数据实验表明了该方法的有效性和可行性。  相似文献   

16.
After an earthquake, every damaged building needs to be properly evaluated in order to determine its capacity to withstand aftershocks as well as to assess safety for occupants to return. These evaluations are time-sensitive as the quicker they are completed, the less costly the disaster will be in terms of lives and dollars lost. In this direction, there is often not sufficient time or resources to acquire all information regarding the structure to do a high-level structural analysis. The post-earthquake damage survey data may be incomplete and contain missing values, which delays the analytical procedure or even makes structural evaluation impossible. This paper proposes a novel multiple imputation (MI) approach to address the missing data problem by filling in each missing value with multiple realistic, valid candidates, accounting for the uncertainty of missing data. The proposed method, called sequential regression-based predictive mean matching (SRB-PMM), utilizes Bayesian parameter estimation to consecutively infer the model parameters for variables with missing values, conditional based on the fully observed and imputed variables. Given the model parameters, a hybrid approach integrating PMM with a cross-validation algorithm is developed to obtain the most plausible imputed data set. Two examples are carried out to validate the usefulness of the SRB-PMM approach based on a database including 262 reinforced concrete (RC) column specimens subjected to earthquake loads. The results from both examples suggest that the proposed SRB-PMM approach is an effective means to handle missing data problems prominent in post-earthquake structural evaluations.  相似文献   

17.
不平衡数据集中的组合分类算法   总被引:1,自引:0,他引:1  
吴广潮  陈奇刚 《计算机工程与设计》2007,28(23):5687-5689,5761
为提高少数类的分类性能,对基于数据预处理的组合分类器算法进行了研究.利用Tomek links对数据集进行预处理;把新数据集里的多数类样本按照不平衡比拆分为多个子集,每个子集和少数类样本合并成新子集;用最小二乘支持向量机对每个新子集进行训练,把训练后的各个子分类器组合为一个分类系统,新的测试样本的类别将由这个分类系统投票表决.数据试验结果表明,该算法在多数类和少数类的分类性能方面,都优于最小二乘支持向量机过抽样方法和欠抽样方法.  相似文献   

18.
部分数据缺失环境下的知识发现方法   总被引:12,自引:0,他引:12  
王清毅  蔡智  邹翔  蔡庆生 《软件学报》2001,12(10):1516-1524
介绍了目前的不完全数据环境下的知识发现研究工作,分两个部分提出了一个不完全数据库中的知识发现方法.首先具体讨论了如何猜测丢失的数据,给出了基于距离的关联规则的定义及挖掘方法.然后在此基础上详细描述了一个不完全数据库中的知识发现算法,分析了算法的复杂度,并给出了相应的实验结果.最后,将所提方法与其他相关方法进行了比较.  相似文献   

19.
Data for classification are often incomplete. The multiple-values construction method (MVCM) can be used to include data with missing values for classification. In this study, the MVCM is implemented by using fuzzy sets theory in the context of classification with discrete data. By using the fuzzy sets based MVCM, data with missing values can add values to classification, but can also introduce excessive uncertainty. Furthermore, the computational cost for the use of incomplete data could be prohibitive if the scale of missing values is large. This paper discusses the association between classification performance and the use of incomplete data. It proposes an algorithm of near-optimal use of incomplete classification data. An experiment with real-world data demonstrates the usefulness of the algorithm.  相似文献   

20.
The invention of Phasor Measurement Units (PMUs) produce synchronized phasor measurements with high resolution real time monitoring and control of power system in smart grids that make possible. PMUs are used in transmitting data to Phasor Data Concentrators (PDC) placed in control centers for monitoring purpose. A primary concern of system operators in control centers is maintaining safe and efficient operation of the power grid. This can be achieved by continuous monitoring of the PMU data that contains both normal and abnormal data. The normal data indicates the normal behavior of the grid whereas the abnormal data indicates fault or abnormal conditions in power grid. As a result, detecting anomalies/abnormal conditions in the fast flowing PMU data that reflects the status of the power system is critical. A novel methodology for detecting and categorizing abnormalities in streaming PMU data is presented in this paper. The proposed method consists of three modules namely, offline Gaussian Mixture Model (GMM), online GMM for identifying anomalies and clustering ensemble model for classifying the anomalies. The significant features of the proposed method are detecting anomalies while taking into account of multivariate nature of the PMU dataset, adapting to concept drift in the flowing PMU data without retraining the existing model unnecessarily and classifying the anomalies. The proposed model is implemented in Python and the testing results prove that the proposed model is well suited for detection and classification of anomalies on the fly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号