Recent trends in the management of water supply have increased the need for modelling techniques that can provide reliable, efficient, and accurate representation of the complex, non-linear dynamics of water quality within water distribution systems. Statistical models based on artificial neural networks (ANNs) have been found to be highly suited to this application, and offer distinct advantages over more conventional modelling techniques. However, many practitioners utilise somewhat heuristic or ad hoc methods for input variable selection (IVS) during ANN development.This paper describes the application of a newly proposed non-linear IVS algorithm to the development of ANN models to forecast water quality within two water distribution systems. The intention is to reduce the need for arbitrary judgement and extensive trial-and-error during model development. The algorithm utilises the concept of partial mutual information (PMI) to select inputs based on the analysis of relationship strength between inputs and outputs, and between redundant inputs. In comparison with an existing approach, the ANN models developed using the IVS algorithm are found to provide optimal prediction with significantly greater parsimony. Furthermore, the results obtained from the IVS procedure are useful for developing additional insight into the important relationships that exist between water distribution system variables.  相似文献   

Input variable selection (IVS) is one of the most important steps in the development of artificial neural network and other data driven environmental and water resources models. Partial mutual information (PMI) is one of the most promising approaches to IVS, but has the disadvantage of requiring kernel density estimates (KDEs) of the data to be obtained, which can become problematic when the data are non-normally distributed, as is often the case for environmental and water resources problems. In order to overcome this issue, preliminary guidelines for the selection of the most appropriate methods for obtaining the required KDEs are determined based on the results of 3780 trials using synthetic data with distributions of varying degrees of non-normality and six different KDE techniques. The validity of the guidelines is confirmed for two semi-real case studies developed based on the forecasting of river salinity and rainfall-runoff modelling problems.  相似文献   

Input variable selection (IVS) is vital in the development of data-driven models. Among different IVS methods, partial mutual information (PMI) has shown significant promise, although its performance has been found to deteriorate for non-Gaussian and non-linear data. In this paper, the effectiveness of different approaches to improving PMI performance is investigated, focussing on boundary issues associated with bandwidth estimation. Boundary issues, associated with kernel-based density and residual computations within PMI, arise from the extension of symmetrical kernels beyond the feasible bounds of potential inputs, and result in an underestimation of kernel-based marginal and joint probability distribution functions in the PMI. In total, the effectiveness of 16 different approaches is tested on synthetically generated data and the results are used to develop preliminary guidelines for PMI IVS. By using the proposed guidelines, the correct inputs can be identified in 100% of trials, even if the data are highly non-linear or non-Gaussian.  相似文献   

Input Variable Selection (IVS) is an essential step in the development of data-driven models and is particularly relevant in environmental modelling. While new methods for identifying important model inputs continue to emerge, each has its own advantages and limitations and no single method is best suited to all datasets and modelling purposes. Rigorous evaluation of new and existing input variable selection methods would allow the effectiveness of these algorithms to be properly identified in various circumstances. However, such evaluations are largely neglected due to the lack of guidelines or precedent to facilitate consistent and standardised assessment. In this paper, a new framework is proposed for the evaluation and inter-comparison of IVS methods which takes into account: (1) a wide range of dataset properties that are relevant to real world environmental data, (2) assessment criteria selected to highlight algorithm suitability in different situations of interest, and (3) a website for sharing data, algorithms and results (http://ivs4em.deib.polimi.it/). The framework is demonstrated on four IVS algorithms commonly used in environmental modelling studies and twenty-six datasets exhibiting different typical properties of environmental data. The main aim at this stage is to demonstrate the application of the proposed evaluation framework, rather than provide a definitive answer as to which of these algorithms has the best overall performance. Nevertheless, the results indicate interesting differences in the algorithms' performance that have not been identified previously.  相似文献   

An algorithm is proposed for calculating correlation measures based on entropy. The proposed algorithm allows exhaustive exploration of variable subsets on real data. Its time efficiency is demonstrated by comparison against three other variable selection methods based on entropy using 8 data sets from various domains as well as simulated data. The method is applicable to discrete data with a limited number of values making it suitable for medical diagnostic support, DNA sequence analysis, psychometry and other domains.  相似文献   

针对回归问题中存在的变量选择和网络结构设计问题, 提出一种基于互信息的极端学习机(ELM) 训练算法, 同时实现输入变量的选择和隐含层的结构优化. 该算法将互信息输入变量选择嵌入到ELM网络的学习过程之中, 以网络的学习性能作为衡量输入变量与输出变量相关与否的指标, 并以增量式的方法确定隐含层节点的规模.在Lorenz、Gas Furnace 和10 组标杆数据上的仿真结果表明了所提出算法的有效性. 该算法不仅可以简化网络结构, 还可以提高网络的泛化性能.


在高维数据如图像数据、基因数据、文本数据等的分析过程中,当样本存在冗余特征时会大大增加问题分析复杂难度,因此在数据分析前从中剔除冗余特征尤为重要。基于互信息(MI)的特征选择方法能够有效地降低数据维数,提高分析结果精度,但是,现有方法在特征选择过程中评判特征是否冗余的标准单一,无法合理排除冗余特征,最终影响分析结果。为此,提出一种基于最大联合条件互信息的特征选择方法(MCJMI)。MCJMI选择特征时考虑整体联合互信息与条件互信息两个因素,两个因素融合增强特征选择约束。在平均预测精度方面,MCJMI与信息增益(IG)、最小冗余度最大相关性(mRMR)特征选择相比提升了6个百分点;与联合互信息(JMI)、最大化联合互信息(JMIM)相比提升了2个百分点;与LW向前搜索方法(SFS-LW)相比提升了1个百分点。在稳定性方面,MCJMI稳定性达到了0.92,优于JMI、JMIM、SFS-LW方法。实验结果表明MCJMI能够有效地提高特征选择的准确率与稳定性。  相似文献   

针对邻域信息系统的特征选择模型存在人为设定邻域参数值的问题。分别计算样本与最近同类样本和最近异类样本的距离,用于定义样本的最近邻以确定信息粒子的大小。将最近邻的概念扩展到信息理论,提出最近邻互信息。在此基础上,采用前向贪心搜索策略构造了基于最近邻互信息的特征算法。在两个不同基分类器和八个UCI数据集上进行实验。实验结果表明:相比当前多种流行算法,该模型能够以较少的特征获得较高的分类性能。  相似文献   

主成分分析是一种常用的特征选择算法,经典方法是计算各个特征之间的相关,但是相关无法评估变量间的非线性关系.互信息可用于衡量两个变量间相互依赖的强弱程度,且不局限于线性相关,鉴于此,提出一种基于互信息的主成分分析特征选择算法.该算法计算特征间的互信息,以互信息矩阵的特征值作为评价准则确定主成分的个数,并衡量主成分分析特征选择的效果.通过实例对所提出方法和传统主成分分析方法进行比较,并以神经网络为分类器分析分类效果.  相似文献   

This paper developed a new variable selection method for soft sensor applications using the nonnegative garrote (NNG) and artificial neural network (ANN). The proposed method employs the ANN to generate a well-trained network, and then uses the NNG to conduct the accurate shrinkage of input weights of the ANN. This paper took Bayesian information criterion as the model evaluation criterion, and the optimal garrote parameter s was determined by v-fold cross-validation. The performance of the proposed algorithm was compared to existing state-of-art variable selection methods. Two artificial dataset examples and a real industrial application for air separation process were applied to demonstrate the performance of the methods. The experimental results showed that the proposed method presented better model accuracy with fewer variables selected, compared to other state-of-art methods.  相似文献   

