共查询到20条相似文献,搜索用时 218 毫秒
1.
网络连接数据的统计推断问题已成为近年来统计学研究的热点问题. 传统模型中样本数据间的独立性假设通常不能满足现代网络连接数据的分析需求. 本文研究了网络连接数据中每个节点的独立效应, 并借助融合惩罚的思想, 使得相互连接节点的独立效应趋同. 同时借助仿变量方法(Knockoff)仿冒原始变量的数据依赖结构、构造与目标变量无关的属性特征, 提出了针对网络连接数据进行变量选择的仿变量方法(NLKF). 从理论上证明了NLKF方法将变量选择的错误发现率(FDR)控制在目标水平. 对于原始数据协方差未知的情形, 使用估计的协方差矩阵仍具有上述良好的统计性质. 通过与传统变量选择方法Lasso对比, 说明了本文方法的可靠性. 最后结合因子投资领域2022年1–12月中国A股市场4000只股票的200个因子数据及每只股票所属申万一级行业构造的网络关系, 给出模型的应用实例. 相似文献
2.
众多基因生物标志物选择方法常因研究样本较少而不能直接用于临床诊断.于是有学者提出整合不同基因表达数据同时保留生物信息完整性的方法.然而,由于存在批量效应,导致直接整合不同基因表达数据可能会增加新的系统误差.针对上述问题,提出一个融合自主学习与SCAD-Net正则化的分析框架.一方面,自主学习方法能够先从低噪声样本中学习出基础模型,然后再通过高噪声样本学习使得模型更加稳健,从而避免批量效应;另一方面,SCAD-Net正则化融合了基因表达数据与基因间的交互信息,可以实现更好的特征选择效果.不同情形下的模拟数据以及在乳腺癌细胞系数据集上的结果表明,基于自主学习与SCAD-Net正则化的回归模型在处理高维复杂网络数据集时具有更好的预测效果. 相似文献
3.
从脑电信号中检测P300电位是实现P300脑机接口的关键. 由于不同个体间的脑电信号存在较大差异, 现有的基于深度学习的P300检测方法均需要大量的脑电数据来训练模型. 对于小样本的患者数据, 至今仍没有令人满意的解决方案. 本文提出了一种改进的适用于小样本P300脑电信号检测的原型网络方法. 该模型通过卷积神经网络提... 相似文献
4.
《自动化仪表》2017,(9)
针对企业信息化系统存在的问题,分析了当前网络安全存在的误报和漏报现象,对海量信息分析代价大、无法预知安全策略内容、系统自身安全等方面问题作了探讨,设计并建立了基于多标记学习改进算法的入侵检测系统模型。该模型主要包括数据采集、数据预处理、算法检测及响应处理等模块。在设计好基于多标记学习改进算法的入侵检测系统后,将该系统部署到原有系统中,并检测入侵数据,数据检测监控界面显示入侵检测系统的检测结果。针对未处理的网络连接记录,系统管理员可通过手动方式判断其是否为攻击行为;而对于异常的数据,系统会将此类型的攻击行为添加到样本库中。算法在适当的时间通过新的样本库来完善分类器。该模型在不改变原有信息化系统工作的基础上,实现了入侵检测系统的应用。 相似文献
5.
针对有标签信号样本数目较少的实际环境中,通信辐射源个体识别技术存在识别准确率较低的问题,提出改进的一致性正则半监督辐射源个体识别方法,在一致性正则方法中引入伪标签思想的改进方案,在3种一致性正则模型上分别加入伪标签正则项.实验中设计适合实采信号数据的Inception深度网络,探究实验参数变化对实验结果的影响,实验结果... 相似文献
6.
7.
8.
9.
宋丽丽 《计算机工程与应用》2019,55(20):170-176
行人再识别技术是计算机视觉领域中一个具有挑战性的任务。该任务针对个体的外观变化模式展开研究,特征变化剧烈,存在小样本问题,而通过提出的一种基于迁移学习的度量学习模型,可约束不同数据集样本分布的差异,实现度量模型在不同数据集上的迁移。该算法不仅增强了度量模型训练样本的多样性,提高了分辨能力,同时提升了样本的适应性。最后,通过在iLIDS数据集进行度量模型的预训练,并在VIPeR和CUHK01两个数据集上进行的迁移学习,验证了算法的有效性和准确性。 相似文献
10.
11.
12.
由传统机器学习方法组成的空气质量预测模型得到了普遍应用,但是此类模型对于数据有效性,特别是时空相关数据的选取仍旧存在不足.针对深度学习输入数据有效性问题进行研究,提出了一种基于时空相似LSTM的预测模型(spatial-temporal similarity LSTM model,STS-LSTM),以便在时间和空间层面选取更加有效的数据.STS-LSTM分为前序、中序和后序三个模块,前序模块为时空相似选择输入模块,提出了格兰杰因果权重动态时间折叠(Granger causal index weighted dynamic time warping,GCWDTW)算法,用于选取具有更高时空相似性的数据;中序模块使用LSTM作为深度学习网络进行训练;后序模块根据目标站点特征选择不同的输出组合进行集成.STS-LSTM整体模型在空气质量预测误差上较现有算法提升了8%左右,经过有效性选取的数据对于模型精度达到了最高21%的提升.实验结果表明,对于有效数据的选取该算法取得了显著效果,将数据输入输出方法作为应用型深度学习网络的一部分,可以有效提升深度学习网络的最终效果. 相似文献
13.
The vigorous expansion of wind energy power generation over the last decade has also entailed innovative improvements to surface roughness prediction models applied to high-torque milling operations. Artificial neural networks are the most widely used soft computing technique for the development of these prediction models. In this paper, we concentrate on the initial data transformation and its effect on the prediction of surface roughness in high-torque face milling operations. An extensive data set is generated from experiments performed under industrial conditions. The data set includes a very broad set of different parameters that influence surface roughness: cutting tool properties, machining parameters and cutting phenomena. Some of these parameters may potentially be related to the others or may only have a minor influence on the prediction model. Moreover, depending on the number of available records, the machine learning models may or may not be capable of modelling some of the underlying dependencies. Hence, the need to select an appropriate number of input signals and their matching prediction model configuration.A hybrid algorithm that combines a genetic algorithm with neural networks is proposed in this paper, in order to address the selection of relevant parameters and their appropriate transformation. The algorithm has been tested in a number of experiments performed under workshop conditions with data sets of different sizes to investigate the impact of available data on the selection of corresponding data transformation. Data set size has a direct influence on the accuracy of the prediction models for roughness modelling, but also on the use of individual parameters and transformed features. The results of the tests show significant improvements in the quality of prediction models constructed in this way. These improvements are evident when these models are compared with standard multilayer perceptrons trained with all the parameters and with data reduced through standard Principal Component Analysis practice. 相似文献
14.
Variable selection and dimension reduction are two commonly adopted approaches for high-dimensional data analysis, but have
traditionally been treated separately. Here we propose an integrated approach, called sparse gradient learning (SGL), for
variable selection and dimension reduction via learning the gradients of the prediction function directly from samples. By
imposing a sparsity constraint on the gradients, variable selection is achieved by selecting variables corresponding to non-zero
partial derivatives, and effective dimensions are extracted based on the eigenvectors of the derived sparse empirical gradient
covariance matrix. An error analysis is given for the convergence of the estimated gradients to the true ones in both the
Euclidean and the manifold setting. We also develop an efficient forward-backward splitting algorithm to solve the SGL problem,
making the framework practically scalable for medium or large datasets. The utility of SGL for variable selection and feature
extraction is explicitly given and illustrated on artificial data as well as real-world examples. The main advantages of our
method include variable selection for both linear and nonlinear predictions, effective dimension reduction with sparse loadings,
and an efficient algorithm for large p, small n problems. 相似文献
15.
稳定学习的目标是利用单一的训练数据构造一个鲁棒的预测模型,使其可以对任意与训练数据具有相似分布的测试数据进行精准的分类.为了在未知分布的测试数据上实现精准预测,已有的稳定学习算法致力于去除特征与类标签之间的虚假相关关系.然而,这些算法只能削弱特征与类标签之间部分虚假相关关系并不能完全消除虚假相关关系;此外,这些算法在构建预测模型时可能导致过拟合问题.为此,提出一种基于实例加权和双分类器的稳定学习算法,所提算法通过联合优化实例权重和双分类器来学习一个鲁棒的预测模型.具体而言,所提算法从全局角度平衡混杂因子对实例进行加权来去除特征与类标签之间的虚假相关关系,从而更好地评估每个特征对分类的作用.为了完全消除数据中部分不相关特征与类标签之间的虚假相关关系以及弱化不相关特征对实例加权过程的干扰,所提算法在实例加权之前先进行特征选择筛除部分不相关特征.为了进一步提高模型的泛化能力,所提算法在训练预测模型时构建两个分类器,通过最小化两个分类器的参数差异来学习一个较优的分类界面.在合成数据集和真实数据集上的实验结果表明了所提方法的有效性. 相似文献
16.
In this paper, a new approach for centralised and distributed learning from spatial heterogeneous databases is proposed. The
centralised algorithm consists of a spatial clustering followed by local regression aimed at learning relationships between
driving attributes and the target variable inside each region identified through clustering. For distributed learning, similar
regions in multiple databases are first discovered by applying a spatial clustering algorithm independently on all sites,
and then identifying corresponding clusters on participating sites. Local regression models are built on identified clusters
and transferred among the sites for combining the models responsible for identified regions. Extensive experiments on spatial
data sets with missing and irrelevant attributes, and with different levels of noise, resulted in a higher prediction accuracy
of both centralised and distributed methods, as compared to using global models. In addition, experiments performed indicate
that both methods are computationally more efficient than the global approach, due to the smaller data sets used for learning.
Furthermore, the accuracy of the distributed method was comparable to the centralised approach, thus providing a viable alternative
to moving all data to a central location. 相似文献
17.
Kohei Hayashi Takashi Takenouchi Tomohiro Shibata Yuki Kamiya Daishi Kato Kazuo Kunieda Keiji Yamada Kazushi Ikeda 《Knowledge and Information Systems》2011,33(1):57-88
In this paper, we propose a new probabilistic model of heterogeneously attributed multi-dimensional arrays. The model can manage heterogeneity by employing individual exponential family distributions for each attribute of the tensor array. Entries of the tensor are connected by latent variables and share information across the different attributes through the latent variables. The assumption of heterogeneity makes a Bayesian inference intractable, and we cast the EM algorithm approximated by the Laplace method and Gaussian process. We also extended the proposal algorithm for online learning. We apply our method to missing-values prediction and anomaly detection problems and show that our method outperforms conventional approaches that do not consider heterogeneity. 相似文献
18.
Kwon D Landi MT Vannucci M Issaq HJ Prieto D Pfeiffer RM 《Computational statistics & data analysis》2011,55(10):2807-2818
We present a Bayesian variable selection method for the setting in which the number of independent variables or predictors in a particular dataset is much larger than the available sample size. While most of the existing methods allow some degree of correlations among predictors but do not consider these correlations for variable selection, our method accounts for correlations among the predictors in variable selection. Our correlation-based stochastic search (CBS) method, the hybrid-CBS algorithm, extends a popular search algorithm for high-dimensional data, the stochastic search variable selection (SSVS) method. Similar to SSVS, we search the space of all possible models using variable addition, deletion or swap moves. However, our moves through the model space are designed to accommodate correlations among the variables. We describe our approach for continuous, binary, ordinal, and count outcome data. The impact of choices of prior distributions and hyperparameters is assessed in simulation studies. We also examined the performance of variable selection and prediction as the correlation structure of the predictors varies. We found that the hybrid-CBS resulted in lower prediction errors and identified better the true outcome associated predictors than SSVS when predictors were moderately to highly correlated. We illustrate the method on data from a proteomic profiling study of melanoma, a type of skin cancer. 相似文献
19.
本文研究了一种数据驱动下的半导体生产线调度框架,该框架基于调度优化数据样本,应用机器学习算法,获得动态调度模型,通过该模型,对于半导体生产线,能够根据其当前的生产状态,实时地定出近似最优的调度策略.在此基础上,利用特征选择和分类算法,提出一种生成动态调度模型的方法,并且具体实现出一种混合式特征选择和分类算法的调度模型:先采用过滤式特征选择方法对生产属性进行初步筛选,然后再采用封装式特征选择和分类方法生成模型以提高模型生成的效率.最后,在某实际半导体生产线上,对在所提出的框架上采用6种不同算法实现的动态调度模型进行测试,并对算法性能数据和生产线性能据进行对比和分析.结果表明,数据驱动下的动态调度方法优于单一的调度规则,同时也能满足生产线调度实时性要求.在数据样本较多的情况下,建议采用本文所提出的方法. 相似文献
20.
联邦学习是一种不通过中心化的数据训练就能获得机器学习模型的系统,源数据不出本地,降低了隐私泄露的风险,同时本地也获得优化训练模型。但是由于各节点之间的身份、行为、环境等不同,导致不平衡的数据分布可能引起模型在不同设备上的表现出现较大偏差,从而形成数据异构问题。针对上述问题,提出了基于节点优化的数据共享模型参数聚类算法,将聚类和数据共享同时应用到联邦学习系统中,该方法既能够有效地减少数据异构对联邦学习的影响,也加快了本地模型收敛的速度。同时,设计了一种评估全局共享模型收敛程度的方法,用于判断节点聚类的时机。最后,采用数据集EMNIST、CIFAR-10进行了实验和性能分析,验证了共享比例大小对各个节点收敛速度、准确率的影响,并进一步分析了当聚类与数据共享同时应用到联邦学习前后各个节点的准确率。实验结果表明,当引入数据共享后各节点的收敛速度以及准确率都有所提升,而当聚类与数据共享同时引入到联邦学习训练后,与FedAvg算法对比,其准确度提高10%~15%,表明了该方法针对联邦学习数据异构问题上有着良好的效果。 相似文献