首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Association rules have been widely used in many application areas to extract new and useful information expressed in a comprehensive way for decision makers from raw data. However, raw data may not always be available, it can be distributed in multiple datasets and therefore there resulting number of association rules to be inspected is overwhelming. In the light of these observations, we propose meta-association rules, a new framework for mining association rules over previously discovered rules in multiple databases. Meta-association rules are a new tool that convey new information from the patterns extracted from multiple datasets and give a “summarized” representation about most frequent patterns. We propose and compare two different algorithms based respectively on crisp rules and fuzzy rules, concluding that fuzzy meta-association rules are suitable to incorporate to the meta-mining procedure the obtained quality assessment provided by the rules in the first step of the process, although it consumes more time than the crisp approach. In addition, fuzzy meta-rules give a more manageable set of rules for its posterior analysis and they allow the use of fuzzy items to express additional knowledge about the original databases. The proposed framework is illustrated with real-life data about crime incidents in the city of Chicago. Issues such as the difference with traditional approaches are discussed using synthetic data.  相似文献   

An algorithm for silhouette extraction from volumetric data is presented. Trivariate tensor product B-spline functions are used to represent the data. An offline phase that arranges the data in a lookup table is employed to improve the computation time during an interactive session. A subdivision scheme is employed to extract the silhouette curves from an implicit trivariate B-spline function. The produced results are smooth, high-quality silhouette curves when compared to voxel-based silhouette extraction schemes.  相似文献   

This paper presents a study of the characteristics of extra-tropical oceanic Rossby waves from datasets of Sea Surface Height (SSH), Sea Surface Temperature (SST) and ocean colour. The main focus is on the propagation speed of the waves and a comparison is made between the observational results and the speeds predicted by the classical theory and by the most recent extended theory of Rossby waves. There is also discussion, with an example, of the additional information that can be derived by a comparison of the wave signatures in the different datasets.  相似文献   

We propose a Bayesian model for clustered outliers in multiple regression. In the literature, outliers are frequently modeled as coming from a subgroup where the variance of the errors is much larger than in the rest of the data. By contrast, when a cluster of outliers exists, we show that it can be more informative to model them as coming from a subgroup where different regression coefficients hold. We can explicitly model the clustering phenomenon by assuming that the probability of an outlier is a function of the explanatory variables. Fitting proceeds via the Gibbs sampler, using the Metropolis-Hastings algorithm to produce variates from the more unusual distributions. Initialization uses a least median of squares fit, and in some ways this method can be viewed as a Bayesian version of the many algorithms that use this fit as a start to some more efficient estimator. This method works very well in a variety of test data sets. We illustrate its use in a data set of sailboat prices, where it yields information both on the identity of the outliers and on their location, spread, and the regression coefficients inside the minority subgroup.  相似文献   

With the falling price of memory, an increasing number of multimedia servers and proxies are now equipped with a large memory space. Caching media objects in the memory of a proxy helps to reduce the network traffic, the disk I/O bandwidth requirement, and the data delivery latency. The running buffer approach and its alternatives are representative techniques to caching streaming data in the memory. There are two limits in the existing techniques. First, although multiple running buffers for the same media object co-exist in a given processing period, data sharing among multiple buffers is not considered. Second, user access patterns are not insightfully considered in the buffer management. In this paper, we propose two techniques based on shared running buffers in the proxy to address these limits. Considering user access patterns and characteristics of the requested media objects, our techniques adaptively allocate memory buffers to fully utilize the currently buffered data of streaming sessions, with the aim to reduce both the server load and the network traffic. Experimentally comparing with several existing techniques, we show that the proposed techniques achieve significant performance improvement by effectively using the shared running buffers.  相似文献   

Simultaneously estimating multiple conditional quantiles is often regarded as a more appropriate regression tool than the usual conditional mean regression for exploring the stochastic relationship between the response and covariates. When multiple quantile regressions are considered, it is of great importance to share strength among them. In this paper, we propose a novel regularization method that explores the similarity among multiple quantile regressions by selecting a common subset of covariates to model multiple conditional quantiles simultaneously. The penalty we employ is a matrix norm that encourages sparsity in a column-wise fashion. We demonstrate the effectiveness of the proposed method using both simulations and an application of gene expression data analysis.  相似文献   

Dr. R. Dutter 《Computing》1977,18(2):167-176
Several iterative procedures have been proposed and developed to solve numerically the problem of robust regression, in particular, of robust linear regression. The algorithms described here are modified versions of the “sophisticated method” given by Huber (1973, [8]) which sometimes fail to converge. In this paper, the new algorithms are formulated and convergence proofs are given. The behavior of the procedures is illustrated by a numerical example and is compared to another (“simple”) algorithm.  相似文献   

Within the regression context, this method begins with the set of exactly fitted coefficients determined from each p-dimensional subset of the sample. Outlying points in this p-dimensional coefficient space correspond to outliers in the original n-dimensional data space. Resampled values are used to detect anomalous data points through a proposed detection rule that avoids masking and swamping and allows multiple outliers to be identified.  相似文献   

The paper considers the methods to evaluate regression parameters under indefinite a priori information of two types: fuzzy and stochastic. Fuzzy a priori information is assumed to be formulated on the basis of fuzzy notions of the model designer. Stochastic a priori information is systems of equations, which are linear in regression parameters and whose right-hand sides are random variables. Regression parameters may both be constant and vary in time. A classification of the evaluation methods using indefinite a priori information is proposed and used to generalize well-known methods. An evaluation method is developed, which combines the fuzzy and stochastic a priori information about regression parameters.  相似文献   

Niche construction is a process whereby organisms, through their metabolism, activities, and choices, modify their own and/or each other’s niches. Our purpose is to clarify the interactions between evolution and niche construction by focusing on non-linear interactions between genetic and environmental factors shared by interacting species. We constructed a new fitness landscape model termed the NKES model by introducing environmental factors and their interactions with genetic factors into Kauffman’s NKCS model. The evolutionary experiments were conducted using hill-climbing and niche-constructing processes on this landscape. The results have shown that the average fitness among species strongly depends on the ruggedness of the fitness landscape (K) and the degree of the effect of niche construction on genetic factors (E). Especially, we observed two different roles of niche construction: moderate perturbations on hill-climbing processes on the rugged landscapes, and the strong constraint which yields the convergence to a stable state. Also, we show that the difference in the structures of (direct or indirect) interactions among species drastically changes the coevolutionary process of the whole ecosystem by comparing the evolutionary dynamics of the NKES model with that of the NKCS model.  相似文献   

We introduce a class of Gaussian mixture models (GMMs) in which the covariances or the precisions (inverse covariances) are restricted to lie in subspaces spanned by rank-one symmetric matrices. The rank-one basis are shared between the Gaussians according to a sharing structure. We describe an algorithm for estimating the parameters of the GMM in a maximum likelihood framework given a sharing structure. We employ these models for modeling the observations in the hidden-states of a hidden Markov model based speech recognition system. We show that this class of models provide improvement in accuracy and computational efficiency over well-known covariance modeling techniques such as classical factor analysis, shared factor analysis and maximum likelihood linear transformation based models which are special instances of this class of models. We also investigate different sharing mechanisms. We show that for the same number of parameters, modeling precisions leads to better performance when compared to modeling covariances. Modeling precisions also gives a distinct advantage in computational and memory requirements.  相似文献   

The proportion of explained variation (R2) is frequently used in the general linear model but in logistic regression no standard definition of R2 exists. We present a SAS macro which calculates two R2-measures based on Pearson and on deviance residuals for logistic regression. Also, adjusted versions for both measures are given, which should prevent the inflation of R2 in small samples.  相似文献   

微粒群算法在改进多元线性回归上的应用   总被引:2,自引:1,他引:1       下载免费PDF全文
文献[1]利用带约束的非线性规划,将各种改进的多元线性回归方法——主成分回归、岭回归、稳健回归及约束回归统一在一个非线性规划模型中。应用微粒群优化算法(ParticleSwarmOptimization,PSO)对其进行求解,实际算例表明,该方法不但可行,而且得出的结果比其它方法及文献[3]的结果与实际符合得更好。  相似文献   

One of the primary issues confronting XML message brokers is the difficulty associated with processing a large set of continuous XPath queries over incoming XML streams. This paper proposes a novel system designed to present an effective solution to this problem. The proposed system transforms multiple XPath queries before their run-time into a new data structure, called an XP-table, by sharing their common constraints. An XP-table is matched with a stream relation (SR) transformed from a target XML stream by a SAX parser. This arrangement is intended to minimize the run-time workload of continuous query processing. In addition, an early-query-termination strategy is proposed as an improved alternative to the basic approach. It optimizes query processing by arranging the evaluation sequence of the member-lists (m-lists) of an XP-table adaptively and offers increased efficiency, especially in cases of low selectivity. System performance is estimated and verified through a variety of experiments, including comparisons with previous approaches such as YFilter and LazyDFA. The proposed system is practically linear-scalable and stable for evaluating a set of XPath queries in a continuous and timely fashion.  相似文献   

Data-driven modelling is used to develop two alternative types of predictive environmental model: a simulator, a model of a real-world process developed from either a conceptual understanding of physical relations and/or using measured records, and an emulator, an imitator of some other model developed on predicted outputs calculated by that source model. A simple four-way typology called Emulation Simulation Typology (EST) is proposed that distinguishes between (i) model type and (ii) different uses of model development period and model test period datasets. To address the question of to what extent simulator and emulator solutions might be considered interchangeable i.e. provide similar levels of output accuracy when tested on data different from that used in their development, a pair of counterpart pan evaporation models was created using symbolic regression. Each model type delivered similar levels of predictive skill to that other of published solutions. Input–output sensitivity analysis of the two different model types likewise confirmed two very similar underlying response functions. This study demonstrates that the type and quality of data on which a model is tested, has a greater influence on model accuracy assessment, than the type and quality of data on which a model is developed, providing that the development record is sufficiently representative of the conceptual underpinnings of the system being examined. Thus, previously reported substantial disparities occurring in goodness-of-fit statistics for pan evaporation models are most likely explained by the use of either measured or calculated data to test particular models, where lower scores do not necessarily represent major deficiencies in the solution itself.  相似文献   

为降低噪声对数据特征提取(变量选择)效果的不利影响,基于中位数回归分析方法,利用变量选择降维技术(正则化估计),提出了一种稳健、有效的特征提取(变量选择)新方法,并具体给出了估计算法,该算法具有快速计算的特点.实验结果表明,新方法能够有效地对高维数据集进行估计和变量选择,且具有较高的准确性,即使数据中的信噪比很低时,该方法仍具有较好的效果.因此,该方法为高维数据挖掘特征提取提供了稳健且有效的方法.  相似文献   

The traditional least squares estimators used in multiple linear regression model are very sensitive to design anomalies. To rectify the situation we propose a reparametrization of the model. We derive modified maximum likelihood estimators and show that they are robust and considerably more efficient than the least squares estimators besides being insensitive to moderate design anomalies.  相似文献   

目的 手写文本行提取是文档图像处理中的重要基础步骤,对于无约束手写文本图像,文本行都会有不同程度的倾斜、弯曲、交叉、粘连等问题。利用传统的几何分割或聚类的方法往往无法保证文本行边缘的精确分割。针对这些问题提出一种基于文本行回归-聚类联合框架的手写文本行提取方法。方法 首先,采用各向异性高斯滤波器组对图像进行多尺度、多方向分析,利用拖尾效应检测脊形结构提取文本行主体区域,并对其骨架化得到文本行回归模型。然后,以连通域为基本图像单元建立超像素表示,为实现超像素的聚类,建立了像素-超像素-文本行关联层级随机场模型,利用能量函数优化的方法实现超像素的聚类与所属文本行标注。在此基础上,检测出所有的行间粘连字符块,采用基于回归线的k-means聚类算法由回归模型引导粘连字符像素聚类,实现粘连字符分割与所属文本行标注。最后,利用文本行标签开关实现了文本行像素的操控显示与定向提取,而不再需要几何分割。结果 在HIT-MW脱机手写中文文档数据集上进行文本行提取测试,检测率DR为99.83%,识别准确率RA为99.92%。结论 实验表明,提出的文本行回归-聚类联合分析框架相比于传统的分段投影分析、最小生成树聚类、Seam Carving等方法提高了文本行边缘的可控性与分割精度。在高效手写文本行提取的同时,最大程度地避免了相邻文本行的干扰,具有较高的准确率和鲁棒性。  相似文献   

利用因子分析法筛选出对葡萄酒质量影响较大的12种理化指标,将其作为多元线性回归的自变量和BP网络输入层神经元,分别用多元线性回归和改进的BP神经网络两种方法建立葡萄酒和酿酒葡萄的主要理化指标与葡萄酒质量的关系模型。比较了两种模型的泛化能力,得出多元线性回归模型对新样本预测的平均相对误差是1.93%,而BP神经网络模型的平均相对误差是0.37%。仿真实验表明,BP神经网络的泛化能力和稳定性明显优于多元回归模型。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号