首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
在机器学习中,K折交叉验证方法常常通过把数据分成多个训练集和测试集来进行模型评估与选择,然而其折数K的选择一直是一个公开的问题。注意到上述交叉验证数据划分的一个前提假定是训练集和测试集的分布一致,但是实际数据划分中,往往不是这样。因此,可以通过度量训练集和测试集的分布一致性来进行K折交叉验证折数K的选择。直观地,KL(Kullback-Leibler)距离是一种合适的度量方法,因为它度量了两个分布之间的差异。然而直接基于KL距离进行K的选择时,从多个数据实验结果发现随着K的增加KL距离也在增大,显然这是不合适的。为此,提出了一种基于正则化KL距离的K折交叉验证折数K的选择准则,通过最小化此正则KL距离来选择合适的折数K。进一步多个真实数据实验验证了提出准则的有效性和合理性。  相似文献   

2.
超参数调优是神经网络建模的关键问题。针对传统的超参数调优方法存在的问题,该文提出了一种基于m×2正则化交叉验证的超参数调优方法。目的是给出一种适用于复杂模型、大数据集背景下的计算开销较小且稳健的超参数调优方法。该方法的思想是从完整的数据集上选取少部分数据进行调优,避免模型在数据集较大时非常耗时的超参数调优难题;在m×2交叉验证的基础上设置正则化条件均衡训练集与验证集之间的分布差异,从而减少分布不一致带来的性能波动;使用信噪比作为调优的优化目标,从而可以综合考虑模型性能评价指标的均值和方差;并采用正交设计选择相关性较低的超参数组合以提高调优效率。以命名实体任务为例进行实验,在CoNLL 2003数据集上的实验结果显示,提出的调优方法能够选到和网格搜索性能上没有显著差异的超参数组合,且调优时间可显著降低约66%。  相似文献   

3.
在传统的机器学习中,模型选择常常是直接基于某个性能度量指标的估计本身进行,没有考虑估计的方差,但是这样的忽略极有可能导致错误模型的选择。于是考虑在分类模型选择研究中添加方差的信息的方法,以提高所选模型的泛化能力,即将泛化误差性能度量指标的组块3×2交叉验证估计的方差估计作为正则化项添加到传统模型选择准则中,提出了一种新的方差正则化的分类模型选择准则。模拟和真实数据实验验证了在分类模型选择问题中,提出的模型选择准则相比传统方法选到正确分类模型的概率更大,验证了方差在模型选择中的重要性以及提出的模型选择准则的有效性。进一步,理论上证明了在二分类问题的模型选择中,该模型选择准则具有选择的一致性。  相似文献   

4.
模型组合是提高支持向量机泛化性的重要方法,但存在计算效率较低的问题。提出一种基于正则化路径上贝叶斯模型平均的支持向量机模型组合方法,在提高支持向量机泛化性的同时,具有较高的计算效率。基于正则化路径算法建立初始模型集,引入对支持向量机的概率解释。模型的先验可看做是一个高斯过程,模型的后验概率通过贝叶斯公式求得,使用贝叶斯模型平均对模型进行组合。在标准数据集上,实验比较了所提出的模型组合方法与交叉验证及广义近似交叉验证(GACV)方法的性能,验证所提出的模型组合方法的有效性。  相似文献   

5.
事实验证是一项具有挑战性的任务,旨在使用来自可信赖语料库的多个证据句子来验证声明。为了促进研究,一些事实验证数据集被提出,极大地加速了事实验证技术的发展。然而,现有的事实验证数据集通常采用众包的方法构造,无可避免地引入偏差。已有事实验证去偏工作大致可以分为基于数据增强的方法和基于权重正则化的方法,前者不灵活,后者依赖于训练阶段的不确定输出。与已有工作不同,该文从因果关系出发,提出基于反事实推理的事实验证去偏方法。该文首先设计事实验证中的因果图,建模声明、证据以及它们之间的交互和预测结果的因果关系。接着,根据因果图提出事实验证去偏方法,通过总间接效应去除声明带来的偏差影响。我们使用多任务学习的方式来训练模型。训练时,该文采用多任务学习的方式建模各个因素的影响,同时在有偏和无偏测试集上评估模型的性能。实验结果表明,对比基准方法,该文模型在性能上获得了一致的提升。  相似文献   

6.
模糊最小二乘孪生支持向量机模型融合了模糊函数和最小二乘孪生支持向量机算法特性,以解决训练数据集存在孤立点噪声和运算效率低下问题。针对回归过程基于统计学习结构风险最小化原则,对该模型进行L_2范数正则化改进。考虑到大规模数据集的训练效率问题,对原始模型进行了L_1范数正则化改进。基于增量学习特性,对数据集训练过程进行增量选择迭加以加快训练速度。在UCI数据集上验证了相关改进算法的优越性。  相似文献   

7.
在软件缺陷预测任务中,通常基于C&K等静态软件特征数据集,使用机器学习分类算法来构建软件缺陷预测(SDP)模型。然而,大多数静态软件特征数据集中缺陷数较少,数据集的类不平衡问题较为严重,导致学习到的SDP模型的预测性能较差。文中基于生成对抗网络(GAN),并利用FID得分筛选生成正例样本数据,增强正例样本量,然后在组块正则化m×2交叉验证(m×2BCV)框架下,通过众数投票法聚合多个子模型的结果,最终构成SDP模型。以PROMISE数据库下的20个数据集为实验数据集,采用随机森林算法构建SDP聚合模型。实验结果表明,与传统的随机上采样、SMOTE、随机下采样相比,所提SDP聚合模型的F1平均值分别提高了10.2%,5.7%,3.4%,且F1的稳定性也得到相应提高;所提SDP聚合模型在20个数据集的评测中,有17个F1值最高。从AUC指标来看,所提方法与传统的采样方法没有明显差异。  相似文献   

8.
针对单敏感属性匿名化存在的局限性和关联攻击的危害问题,提出了基于贪心算法的(αij,k,m)-匿名模型。首先,该(αij,k,m)-匿名模型主要针对多敏感属性信息进行保护;然后,该模型为每个敏感属性的敏感值进行分级设置,有m个敏感属性就有m个分级表;其次,并为每个级别设置一个特定的αij;最后,设计了基于贪心策略的(αij,k,m)匿名化算法,采取局部最优方法,实现该模型的思想,提高了对数据的隐私保护程度,并从信息损失、执行时间、等价类敏感性距离三个方面对4个模型进行对比。实验结果证明,该模型虽然执行时间稍长,但信息损失量小,对数据的隐私保护程度高,能够抵制关联攻击,保护多敏感属性数据。  相似文献   

9.
正则化路径上三步式SVM贝叶斯组合   总被引:1,自引:0,他引:1  
模型组合旨在整合并利用假设空间中多个模型提高学习系统的稳定性和泛化性.针对支持向量机(support vector machine,SVM)模型组合多采用基于样本采样方法构造候选模型集的现状,研究基于正则化路径的SVM模型组合.首先证明SVM模型组合Lh-风险一致性,给出SVM模型组合基于样本的合理性解释.然后提出正则化路径上的三步式SVM贝叶斯组合方法.利用SVM正则化路径分段线性性质构建初始模型集,并应用平均广义近似交叉验证(generalized approximate cross-validation,GACV)模型集修剪策略获得候选模型集.测试或预测阶段,应用最小近邻法确定输入敏感的最终组合模型集,并实现贝叶斯组合预测.与基于样本采样方法不同,三步式SVM贝叶斯组合方法基于正则化路径在整个样本集上构造模型集,训练过程易于实现,计算效率较高.模型集修剪策略可减小模型集规模,提高计算效率和预测性能.实验结果验证了正则化路径上三步式SVM模型组合的有效性.  相似文献   

10.
文本匹配是检索系统中的关键技术之一.针对现有文本匹配模型对文本语义差异捕获不准确的问题,文中提出了一种基于细粒度差异特征的文本匹配方法.首先,使用预训练模型作为基础模型对匹配文本进行语义的抽取与初步匹配;然后,引入对抗学习的思想,在模型的编码阶段人为构造虚拟对抗样本进行训练,以提升模型的学习能力与泛化能力;最后,通过引入文本的细粒度差异特征,纠正文本匹配的初步预测结果,有效提升了模型对细粒度差异特征的捕获能力,进而提升了文本匹配模型的性能.在两个数据集上进行了实验验证,其中在LCQMC数据集上的实验结果显示,所提方法在ACC性能指标上达到了88.96%,优于已知的最好模型.  相似文献   

11.
Multitask Bregman clustering   总被引:1,自引:0,他引:1  
Traditional clustering methods deal with a single clustering task on a single data set. In some newly emerging applications, multiple similar clustering tasks are involved simultaneously. In this case, we not only desire a partition for each task, but also want to discover the relationship among clusters of different tasks. It is also expected that utilizing the relationship among tasks can improve the individual performance of each task. In this paper, we propose general approaches to extend a wide family of traditional clustering models/algorithms to multitask settings. We first generally formulate the multitask clustering as minimizing a loss function composed of a within-task loss and a task regularization. Then based on the general Bregman divergences, the within-task loss is defined as the average Bregman divergence from a data sample to its cluster centroid. And two types of task regularizations are proposed to encourage coherence among clustering results of tasks. Afterwards, we further provide a probabilistic interpretation to the proposed formulations from a viewpoint of joint density estimation. Finally, we propose alternate procedures to solve the induced optimization problems. In such procedures, the clustering models and the relationship among clusters of different tasks are updated alternately, and the two phases boost each other. Empirical results on several real data sets validate the effectiveness of the proposed approaches.  相似文献   

12.
The performance of a fuzzy k-NN rule depends on the number k and a fuzzy membership-array W[l,mR], where l and mR denote the number of classes and the number of elements in the reference set XR respectively. The proposed learning procedure consists in iterative finding such k and W which minimize the error rate estimate by the leaving ‘leaving one out‘ method.  相似文献   

13.
The process of assessing the prediction ability of a computational model is called model validation. For models predicting a categorical response, the prediction ability is usually quantified by prediction measures such as sensitivity, specificity, and accuracy. This paper presents a software Model Validation using Repeated Partitioning (MVREP) that implements a computer-intensive, nonparametric approach to model validation, which we call the re-partitioning method. MVREP, developed using the SAS Macro language, repeats the process of randomly partitioning a dataset and subsequently performing standard model validation procedures, such as cross-validation, a large number of times and generates the empirical sampling distributions of prediction measures. The means of the sampling distributions serve as the point estimates of prediction measures of the model. The variances of the sampling distributions provide a direct assessment of variability for the point estimates of prediction measures. An example is presented using a mouse developmental toxicity chemical dataset to illustrate how the software can be used for the assessment of structure-activity relationships models.  相似文献   

14.
组合预测模型的权重确定方式对于提高模型精度至关重要,为研究正则化与交叉验证是否能改善组合预测模型的预测效果,提出将正则化和交叉验证应用于基于最小二乘法的组合预测模型.通过在组合模型的最优化求解中分别加入L1L2范数正则化项,并对数据集进行留一交叉验证后发现:L1L2范数正则化都对组合模型的预测精度具有改善效果,且L1范数正则化比L2范数正则化对组合预测模型的改善效果更好,并且参与组合预测的单项预测模型越多,正则化的改善效果越好,交叉验证对组合预测模型的改善效果则与给定实验数据量呈现正相关.  相似文献   

15.
The problem of estimating the properties of smooth, continuous contours from discrete, noisy samples is used as vehicle to demonstrate the robustness of cross-validated regularization applied to a vision problem. A method for estimation of contour properties based on smoothing spline approximations is presented. Generalized cross-validation is to devise an automatic algorithm for finding the optimal value of the smoothing (regularization) parameter from the data. The cross-validated smoothing splines are then used to obtain optimal estimates of the derivatives of quantized contours. Experimental results are presented which demonstrate the robustness of the method applied to the estimation of curvature of quantized contours under variable scale, rotation, and partial occlusion. These results suggest the application of generalized cross-validation to other computer-vision algorithms involving regularization  相似文献   

16.
Metric-Based Methods for Adaptive Model Selection and Regularization   总被引:3,自引:0,他引:3  
We present a general approach to model selection and regularization that exploits unlabeled data to adaptively control hypothesis complexity in supervised learning tasks. The idea is to impose a metric structure on hypotheses by determining the discrepancy between their predictions across the distribution of unlabeled data. We show how this metric can be used to detect untrustworthy training error estimates, and devise novel model selection strategies that exhibit theoretical guarantees against over-fitting (while still avoiding under-fitting). We then extend the approach to derive a general training criterion for supervised learning—yielding an adaptive regularization method that uses unlabeled data to automatically set regularization parameters. This new criterion adjusts its regularization level to the specific set of training data received, and performs well on a variety of regression and conditional density estimation tasks. The only proviso for these methods is that sufficient unlabeled training data be available.  相似文献   

17.
This paper presents a text block extraction algorithm that takes as its input a set of text lines of a given document, and partitions the text lines into a set of text blocks, where each text block is associated with a set of homogeneous formatting attributes, e.g. text-alignment, indentation. The text block extraction algorithm described in this paper is probability based. We adopt an engineering approach to systematically characterising the text block structures based on a large document image database, and develop statistical methods to extract the text block structures from the image. All the probabilities are estimated from an extensive training set of various kinds of measurements among the text lines, and among the text blocks in the training data set. The off-line probabilities estimated in the training then drive all decisions in the on-line text block extraction. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. To evaluate the performance of our text block extraction algorithm, we used a three-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-III database of some 1600 scanned document image pages. The text block extraction algorithm identifies and segments 91% of text blocks correctly.  相似文献   

18.
Statistical outlier detection using direct density ratio estimation   总被引:2,自引:2,他引:0  
We propose a new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers. Our key idea is to use the ratio of training and test data densities as an outlier score. This approach is expected to have better performance even in high-dimensional problems since methods for directly estimating the density ratio without going through density estimation are available. Among various density ratio estimation methods, we employ the method called unconstrained least-squares importance fitting (uLSIF) since it is equipped with natural cross-validation procedures, allowing us to objectively optimize the value of tuning parameters such as the regularization parameter and the kernel width. Furthermore, uLSIF offers a closed-form solution as well as a closed-form formula for the leave-one-out error, so it is computationally very efficient and is scalable to massive datasets. Simulations with benchmark and real-world datasets illustrate the usefulness of the proposed approach.  相似文献   

19.
The manifold regularization (MR) based semi-supervised learning could explore structural relationships from both labeled and unlabeled data. However, the model selection of MR seriously affects its predictive performance due to the inherent additional geometry regularizer of labeled and unlabeled data. In this paper, two continuous and two inherent discrete hyperparameters are selected as optimization variables, and a leave-one-out cross-validation (LOOCV) based Predicted REsidual Sum of Squares (PRESS) criterion is first presented for model selection of MR to choose appropriate regularization coefficients and kernel parameters. Considering the inherent discontinuity of the two hyperparameters, the minimization process is implemented by using a improved Nelder-Mead simplex algorithm to solve the inherent discrete and continues hybrid variables set. The manifold regularization and model selection algorithm are applied to six synthetic and real-life benchmark dataset. The proposed approach, leveraged by effectively exploiting the embedded intrinsic geometric manifolds and unbiased LOOCV estimation, outperforms the original MR and supervised learning approaches in the empirical study.  相似文献   

20.
On the complexity of fixed-priority scheduling of periodic, real-time tasks   总被引:34,自引:0,他引:34  
We consider the complexity of determining whether a set of periodic, real-time tasks can be scheduled on m 1 identical processors with respect to fixed-priority scheduling. It is shown that the problem is NP-hard in all but one special case. The complexity of optimal fixed-priority scheduling algorithm is also discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号