首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
数据缺失会影响数据的质量,可能导致分析结果的不准确和降低模型的可靠性,缺失值填补能减低偏差方便后续分析.大多数的缺失值填补算法,都是假设多项缺失值之间是弱相关甚至无相关,很少考虑缺失值之间的相关性以及填补顺序.在销售领域中对缺失值进行独立填补,会减少缺失值信息的利用,从而对缺失值填补的准确度造成较大的影响.针对以上问题,本文以销售领域为研究目标,根据销售行为的多维度特征,利用不同模型输出值的空间分布特征特性,探索多项缺失值的填补更新机制,研究面向销售数据多项缺失值增量填补方法,根据特征相关性,对缺失特征排序并用已填补的数据作为信息要素融合对后面的缺失值进行增量填补.该算法同时考虑了模型的泛化性和缺失数据之间的信息相关问题,并结合多模型融合,对多项缺失值进行有效填补.最后基于真实连锁药店销售数据集通过大量实验对比验证了所提算法的有效性.  相似文献   

2.
不完整数据的分析与填充一直是大数据处理的热点研究课题,传统的分析方法无法对不完整数据直接聚类,大部分方法先填充缺失值,然后对数据聚类。这些方法一般利用整个数据集对缺失数据进行填充,使得填充值容易受到噪声的干扰,导致填充结果不精确,进而造成聚类精度很低。提出一种不完整数据聚类算法,对不完全信息系统的相似度公式进行重新定义,给出不完整数据对象间的相似度度量方式,进而直接对不完整数据聚类。根据聚类结果将同一类对象划分到相同的簇中,通过同一类对象的属性值对缺失值进行填充,避免噪声对填充值的干扰,提高填充结果的精确性。实验结果表明,提出的方法能够对不完整数据进行聚类,并有效提高缺失数据的填充精度。  相似文献   

3.
Fault diagnosis on bottle filling plant using genetic-based neural network   总被引:1,自引:0,他引:1  
Timely detection of the pneumatic system problems is important in industry. Many techniques have been employed to solve this problem. In this paper, Genetic Algorithm (GA) based optimal configuration of neural networks is proposed for fault diagnostic of bottle filling systems. Back-propagation is used for neural networks algorithm. The back-propagation algorithm had six inputs and one output. A fitness function was designed to the minimize execution time of ANN model by keeping the number of hidden layer(s) and nodes as low as possible while the mean square error of estimated output error is minimized. The designed GA–ANN combination and the graphical user interface (GUI) eliminate the trial and error process for selection of the fastest and most accurate configuration. The performance of the proposed system was evaluated by using experimental data collected at a pneumatic work cell which attach caps to the bottles. The sensory data was collected at normal operating conditions and a series of faults were imposed to the system such as missing bottle, attaching nonworking bottle caps at two different cylinders, two air pressure problems (insufficient and low air), and not filling water. The study demonstrated the convenience, accuracy and speed of the proposed GA–NN environment. It may also be used for training for selection of ANN configurations at various applications.  相似文献   

4.
Landsat 7 enhanced thematic mapper plus (ETM+) satellite imagery is an important data source for many applications. However, the scan line corrector (SLC) failed on 31 May 2003. As a result of the SLC failure, about 22% of the image data is missing in each scene; this is especially pronounced away from nadir. In this article, a local regression method called geographically weighted regression (GWR) is introduced for filling the gaps of the Landsat ETM+ imagery, and it is compared with kriging/cokriging for this purpose. The case studies show that the GWR approach is an effective technique to fill gaps in Landsat ETM+ imagery, although the image restoration is still not perfect. GWR performed marginally better than the complex cokriging method, which too has proven to be an effective method, but is computationally intensive. Although there are visible seam lines at the edges of the filled wide gaps in some bands, the validation results – including RMSE values, error distribution maps, and classification results for the case studies – demonstrate that the DN values estimated by GWR are in fact closer to those of the original image than the corresponding values estimated by kriging/cokriging.  相似文献   

5.
Recorded time series of relative humidity (RH) are modeled by using genetic expression programming (GEP) and artificial neural networks (ANNs) models. The data are noisy and contain missing datapoints. RH is modeled as a function of three meteorological variables: temperature, wind speed, and pressure. Various model structures of both of these models are investigated with the aim of testing the robustness of the predicted values in the presence of noise and missing data. Due to the presence of noise, a sophisticated treatment of missing data was not justifiable, and therefore, the strategy adopted was just to carry the datapoints backward, although this may induce bias in the time dimension and contaminate the predicted results. The results of this study indicate that through a careful selection of model structures both GEP and ANN can produce adequately reliable prediction of RH values 1 year into the future. The paper provides evidence that this model structure is feasible when the dependent variables include both the present and past values.  相似文献   

6.
岳根霞  刘金花  刘峰 《计算机仿真》2021,(1):451-454,459
从大数据的基本特点和医疗大数据研究现状出发,分析处理过程中存在的问题,提出在决策树算法下的医疗大数据填补及分类方法.分析医疗数据的关联规则,采用关联分析(Apriori)算法和频繁模式树(Frequent Pattern Growth,FP-Growth)算法挖掘数据.以挖掘数据为基础填补其中的缺失数据,按照医疗数据特...  相似文献   

7.
如何解决在创建决策树时出现缺失值是决策树算法在规则提取方面的一个重要难题.讨论了决策树分类算法的基本原理后,对于数据集的数据不完整进行了分析,并给出了缺失值的具体解决方法.在创建决策树的过程中对缺失值进行填充时提出了填充缺失值的解决算法.  相似文献   

8.
"Missing is useful": missing values in cost-sensitive decision trees   总被引:3,自引:0,他引:3  
Many real-world data sets for machine learning and data mining contain missing values and much previous research regards it as a problem and attempts to impute missing values before training and testing. In this paper, we study this issue in cost-sensitive learning that considers both test costs and misclassification costs. If some attributes (tests) are too expensive in obtaining their values, it would be more cost-effective to miss out their values, similar to skipping expensive and risky tests (missing values) in patient diagnosis (classification). That is, "missing is useful" as missing values actually reduces the total cost of tests and misclassifications and, therefore, it is not meaningful to impute their values. We discuss and compare several strategies that utilize only known values and that "missing is useful" for cost reduction in cost-sensitive decision tree learning.  相似文献   

9.
数据缺失对聚类算法提出了挑战,传统方法往往采用均值或回归方法将不完整数据进行填充,再对填充后的数据进行聚类.为解决均值填充和回归填充等方法在数据缺失比率增大时填充精度以及聚类效果变差的问题,提出一种新的不完整数据相似度计算方法.以期望互信息为依据对数据集中的属性排序,充分考虑了数据集中与位置相关的属性值特征,以数据集本身元素作为缺失值填充的来源,对排序后的不完整数据集进行相似度填充计算,最后采用基于局部密度的聚类算法进行聚类.利用UCI机器学习库中的数据集验证本文填充聚类算法,实验结果表明,当数据集中缺失值增多时,算法对缺失值的容忍性较好,对缺失元素的恢复能力较强,填充精度以及最终聚类结果方面均表现良好.本文填充计算相似度的方法考虑数据集的每个属性值来对缺失值逐个填充,因而耗时较多.  相似文献   

10.
《Computers & Structures》2007,85(3-4):179-192
The application of artificial neural networks (ANNs) to solve wind engineering problems has received increasing interests in recent years. This paper is concerned with developing two ANN approaches (a backpropagation neural network [BPNN] and a fuzzy neural network [FNN]) for the prediction of mean, root-mean-square (rms) pressure coefficients and time series of wind-induced pressures on a large gymnasium roof. In this study, simultaneous pressure measurements are made on a large gymnasium roof model in a boundary layer wind tunnel and parts of the model test data are used as the training sets for developing two ANN models to recognize the input–output patterns. Comparisons of the prediction results by the two ANN approaches and those from the wind tunnel test are made to examine the performance of the two ANN models, which demonstrates that the two ANN approaches can successfully predict the pressures on the entire surfaces of the large roof on the basis of wind tunnel pressure measurements from a certain number of pressure taps. Moreover, the FNN approach is found to be superior to the BPNN approach. It is shown through this study that the developed ANN approaches can be served as an effective tool for the design and analysis of wind effects on large roof structures.  相似文献   

11.
In this work, an attempt was made to derive wind speeds from the wave parameters recorded by a high-frequency (HF) radar by resorting to the techniques of an artificial neural network (ANN) and a model tree (MT) and by considering it as an inverse-modelling problem. The time series of significant wave height, average wave period, wave direction and wind direction collected over the years 2007 and 2008 by the Bodega Marine Laboratory (BML) at the Bodega Bay, California, were used along with the corresponding wind speeds measured by a floating buoy in the vicinity. The ANN and MT models were trained and tested using alternative data splits to assess their performance over varying sample sizes. Both these methods worked very well in this application, with the ANN showing better flexibility in model fitting. This study thus indicates that data-driven methods can be effectively used to derive unobserved wind speed values in HF radar measurements.  相似文献   

12.
In this paper the assessment of the wave energy potential in nearshore coastal areas is investigated by means of artificial neural networks (ANNs). The performance of the ANNs is compared with in situ measurements and spectral numerical modelling (the conventional tool for wave energy assessment). For this purpose, 13 years of records of two buoys, one offshore and one inshore, with an hourly frequency are used to develop an ANN model for predicting the nearshore wave power. The best suited architecture was selected after assessing the performance of 480 ANN models involving twelve different architectures. The results predicted by the ANN model were compared with the measured data and those obtained by means of the SWAN (Simulating Waves Nearshore) spectral model. The quality in the predictions of the ANN model shows that this type of artificial intelligence models constitutes a powerful tool to forecast the wave energy potential at particular coastal site with great accuracy, and one that overcomes some of the disadvantages of the conventional tools for nearshore wave power prediction.  相似文献   

13.
A multisensor fusion approach to improve LAI time series   总被引:2,自引:0,他引:2  
High-quality and gap-free satellite time series are required for reliable terrestrial monitoring. Moderate resolution sensors provide continuous observations at global scale for monitoring spatial and temporal variations of land surface characteristics. However, the full potential of remote sensing systems is often hampered by poor quality or missing data caused by clouds, aerosols, snow cover, algorithms and instrumentation problems. A multisensor fusion approach is here proposed to improve the spatio-temporal continuity, consistency and accuracy of current satellite products. It is based on the use of neural networks, gap filling and temporal smoothing techniques. It is applicable to any optical sensor and satellite product. In this study, the potential of this technique was demonstrated for leaf area index (LAI) product based on MODIS and VEGETATION reflectance data. The FUSION product showed an overall good agreement with the original MODIS LAI product but exhibited a reduction of 90% of the missing LAI values with an improved monitoring of vegetation dynamics, temporal smoothness, and better agreement with ground measurements.  相似文献   

14.
Formerly, tree height has been more difficult to measure accurately in the field than tree diameter at breast height. As a consequence, models to predict height from diameter measurements have been widely developed in the forestry literature. Through the use of airborne laser scanning technology (e.g., LiDAR), tree variables such as height and crown diameter can be measured accurately, a development which has spawned the need for models to predict diameter from airborne laser-derived measurements. Although some work has been done for fitting such models, none have incorporated spatial information to improve the accuracy of the predicted diameters. Using a simple linear model for predicting tree diameter from laser-derived tree height and crown diameter measurements, we compared the performance of ordinary least squares (OLS), generalized least squares with a non-null correlation structure (GLS), linear mixed-effects model (LME), and geographically weighted regression (GWR). Our data were obtained from 36 sample plots established in Norway. This is the first study to examine the use of spatial statistical models for tree-level LiDAR data. Root mean square prediction errors in tree diameter with LME are 3.5%, with GWR are 10%, and with OLS and GLS are 17%. LME also exhibited low variability in predicting performance across all the validation classes (based on laser-derived height). Giving the difficulties of using parametric statistical inference (such as maximum likelihood-based indices) for GWR, we used permutation tests as a way for detecting statistical differences. LME was significantly better than the other models, as well as GWR was to OLS and GLS. Our results indicate that the LME model produced the best predictions of tree diameter from LiDAR-based variables to a degree that has previously not been possible.  相似文献   

15.
刘莹  景波  黄兵 《计算机工程》2008,34(13):56-57,6
关联规则的研究目前已经能够从含有缺失值的数据间建立关联性,但缺失值填充的完整性仍显不足。该文利用规则回收技术,以回收组合的方法将已往在挖掘过程中被删除掉的关联规则加以回收利用,从而可以获得更多的关联规则。这种以回收获得的组合式关联规则不仅能够提升缺失值的填充率和正确率,而且可以改进关联规则挖掘方法,降低挖掘时间及空间的复杂度。  相似文献   

16.
基于粗糙集理论的遗失值填充算法   总被引:2,自引:1,他引:1  
分析了在含有遗失值的数据集上如何进行有效的数据填补,以便更客观地反映数据集中数据所隐含的内在联系。通过采用粗糙集理论的有关思想和方法,提出了一种高效的等价类划分方法,在此基础上给出了一种基于粗糙集理论的遗失值填充算法,提高了遗失数据的填补效率和精确度。数据实验表明了该方法的有效性和可行性。  相似文献   

17.
人工神经网络用于元素电负性的研究   总被引:5,自引:0,他引:5  
采用人工神经网络方法,将元素的第一电离能、最外层电子的主量子数、最外层电子数、有效核电荷数及原子半径作为输入参数,将已知的Pauling电负性和Mulliken电负性分别作为标准输出,成功地预报了86种元素的电负性,补充完整了Pauling标度与Mulliken标度方法中所缺少的的相应值。结果合理可靠。  相似文献   

18.
Fuzzy rule-based classification systems (FRBCSs) are known due to their ability to treat with low quality data and obtain good results in these scenarios. However, their application in problems with missing data are uncommon while in real-life data, information is frequently incomplete in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formerly known as imputation. In this work, we focus on FRBCSs considering 14 different approaches to missing attribute values treatment that are presented and analyzed. The analysis involves three different methods, in which we distinguish between Mamdani and TSK models. From the obtained results, the convenience of using imputation methods for FRBCSs with missing values is stated. The analysis suggests that each type behaves differently while the use of determined missing values imputation methods could improve the accuracy obtained for these methods. Thus, the use of particular imputation methods conditioned to the type of FRBCSs is required.  相似文献   

19.
Since the Scanning Imaging Absorption Spectrometer for Atmospheric Cartography (SCIAMACHY) instrument on the Environmental Satellite (ENVISAT) was launched in 2002, CH4 measurements from the satellite at regional or global scales became available. However, many gaps of missing data exist on the maps of the retrieved atmospheric CH4 column concentrations from SCIAMACHY/ENVISAT. Moreover, the gridded CH4 map with 50?×?50 km is a bit coarse for local interpretation. In this study, two geostatistical methods of ordinary kriging (OK) and ordinary cokriging (OCK) associated with 5 km normalized difference vegetation index (NDVI) images were examined to fill in missing data and to downscale the spatial resolution of CH4 images. The 50 km CH4 images interpolated by the two methods presented similar spatial patterns to the original 50 km CH4 image and provided good results for the missing data. Taking into account the statistical results, the OCK method achieved better performance than OK in filling gaps of missing data. In further downscaling the CH4 image from 50 to 5 km, the OCK method achieved a significant amount of spatial detail, and the statistical results also showed that OCK performed better than OK.  相似文献   

20.
具有丢失数据的可分解马尔可夫网络结构学习   总被引:14,自引:0,他引:14  
王双成  苑森淼 《计算机学报》2004,27(9):1221-1228
具有丢失数据的可分解马尔可夫网络结构学习是一个重要而困难的研究课题,数据的丢失使变量之间的依赖关系变得混乱,无法直接进行可靠的结构学习.文章结合最大似然树和Gibbs抽样,通过对随机初始化的丢失数据和最大似然树进行迭代修正一调整,得到修复后的完整数据集;在此基础上基于变量之间的基本依赖关系和依赖分析思想进行可分解马尔可夫网络结构学习,能够避免现有的丢失数据处理方法和可分解马尔可夫网络结构学习方法存在的效率和可靠性低等问题.试验结果显示,该方法能够有效地进行具有丢失数据的可分解马尔可夫网络结构学习.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号