首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 10 毫秒
1.
Exploring process data with the use of robust outlier detection algorithms   总被引:3,自引:0,他引:3  
To implement on-line process monitoring techniques such as principal component analysis (PCA) or partial least squares (PLS), it is necessary to extract data associated with the normal operating conditions from the plant historical database for calibrating the models. One way to do this is to use robust outlier detection algorithms such as resampling by half-means (RHM), smallest half volume (SHV), or ellipsoidal multivariate trimming (MVT) in the off-line model building phase. While RHM and SHV are conceptually clear and statistically sound, the computational requirements are heavy. Closest distance to center (CDC) is proposed in this paper as an alternative for outlier detection. The use of Mahalanobis distance in the initial step of MVT for detecting outliers is known to be ineffective. To improve MVT, CDC is incorporated with MVT. The performance was evaluated relative to the goal of finding the best half of a data set. Data sets were derived from the Tennessee Eastman process (TEP) simulator. Comparable results were obtained for RHM, SHV, and CDC. Better performance was obtained when CDC is incorporated with MVT, compared to using CDC and MVT alone. All robust outlier detection algorithms outperformed the standard PCA algorithm. The effect of auto scaling, robust scaling and a new scaling approach called modified scaling were investigated. With the presence of multiple outliers, auto scaling was found to degrade the performance of all the robust techniques. Reasonable results were obtained with the use of robust scaling and modified scaling.  相似文献   

2.
Exploratory factor analysis is commonly used in IS research to detect multivariate data structures. Frequently, the method is blindly applied without checking if the data fulfill the requirements of the method. We investigated the influence of sample size, data transformation, factor extraction method, rotation, and number of factors on the outcome. We compared classical exploratory factor analysis with a robust counterpart which is less influenced by data outliers and data heterogeneities. Our analyses revealed that robust exploratory factor analysis is more stable than the classical method.  相似文献   

3.
In the context of randomization tests, this paper discusses the roles of exploratory data analysis (EDA) and confirmatory data analysis (CDA) in geoscience research. It shows: (1) how the classical methods of statistical inference can be used in EDA with nonrandom samples of data, and (2) how much of the knowledge in the geosciences is derived from EDA. The paper gives a FORTRAN IV computer program, CLASSTEST, that performs a randomization test for a multivariate analysis of variance (MANOVA) design. CLASSTEST will be useful in goescience research apart from its use in illustrating EDA and CDA.  相似文献   

4.
A new definition of depth for functional observations is introduced based on the notion of “half-region” determined by a curve. The half-region depth provides a simple and natural criterion to measure the centrality of a function within a sample of curves. It has computational advantages relative to other concepts of depth previously proposed in the literature which makes it applicable to the analysis of high-dimensional data. Based on this depth a sample of curves can be ordered from the center-outward and order statistics can be defined. The properties of the half-region depth, such as consistency and uniform convergence, are established. A simulation study shows the robustness of this new definition of depth when the curves are contaminated. Finally, real data examples are analyzed.  相似文献   

5.
Two exploratory data analysis techniques the comap and the quad plot are shown to have both strengths and shortcomings when analysing spatial multivariate datasets. A hybrid of these two techniques is proposed: the quad map which is shown to overcome the outlined shortcomings when applied to a dataset containing weather information for disaggregate incidents of urban fires. Common to the quad plot, the quad map uses Polya models in order to articulate the underlying assumptions behind histograms. The Polya model formalises the situation in which past fire incident counts are computed and displayed in (multidimensional) histograms as appropriate assessments of conditional probability providing valuable diagnostics such as posterior variance i.e. sensitivity to new information. Finally we discuss how new technology in particular Online Analytics Processing (OLAP) and Geographical Information Systems (GISs) offer potential in automating exploratory spatial data analyses techniques, such as the quad map.  相似文献   

6.
Box Car过程数据压缩算法在现场总线控制系统中得到广泛采用。其压缩效果受记录限和压缩区间的影响。本文基于对典型仿真数据的大量计算,分析了Box Car过程数据压缩算法记录限和压缩区间对趋势平稳的过程数据的压缩比、计算时间和压缩系数的影响。本文还分析了过程数据趋势特征和波动特性对Box Car算法压缩比和逼近系数的影响。本文的计算结果对于在实际应用中根据过程数据不同的趋势和噪声特征调整Box Car压缩算法参数以获得理想的压缩效果具有指导意义。  相似文献   

7.
The problems arising when there are outliers in a data set that follow a circular distribution are considered. A robust estimation of the unknown parameters is obtained using the methods of weighted likelihood and minimum disparity, each of which is defined for a general parametric family of circular data. The class of power divergence and the related residual adjustment function is investigated in order to improve the performance of the two methods which are studied for the Von Mises (circular normal) and the Wrapped Normal distributions. The techniques are illustrated via two examples based on a real data set and a Monte Carlo study, which also enables the discussion of various computational aspects.  相似文献   

8.
The main aim of data analysis in biochemical metrology is the extraction of relevant information from biochemical data measurements. A system of extended exploratory data analysis (EDA) based on the concept of graphical tools for sample data summarization and exploration is proposed and the original EDA algorithm in S-Plus is available on the Internet at http://www.trilobyte.cz/EDA. To check basic assumptions about biochemical and medical data is to examine the independence of sample elements, sample normality and homogeneity. The exact assessment of the mean-value and the variance of steroid levels in controls is necessary for the correct assessment of the samples from patients. Data examination procedures are illustrated by a determination of the mean-value of 17-hydroxypregnenolone in the umbilical blood of newborns. For an asymmetric, strongly skewed sample distribution corrupted with outliers the best estimate of location seems to be the median. The Box–Cox transformation improves a sample symmetry. The proposed procedure gives reliable estimates of a mean-value for an asymmetric distribution of 17-hydroxypregnenolone when the arithmetic mean can not be used.  相似文献   

9.
A FORTRAN program is presented which generates a statistical model of broadscale spatially coherent data, and from that model identifies and removes outlying data values. The algorithm also interpolates missing data values by making use of this model, as well as the assumption of broadscale coherence. Examples of the application of this technique to geomagnetic data are presented. A significant improvement in the statistical efficiency and consistency of subsequent estimators is seen to obtain from preprocessing data with this method.  相似文献   

10.
Traditional hypothesis-driven research domains such as molecular biology are undergoing paradigm shift in becoming progressively data-driven, enabling rapid acquisition of new knowledge. The purpose of this article is to promote an analogous development in business research. Specifically, we focus on network analysis: given the key constructs in a business research domain, we introduce a data-driven protocol applicable to business survey data to (a) discover the web of influence directionalities among the key constructs and therein identify the critical constructs, and to (b) determine the relative contributions of the constructs in predicting the levels of the critical constructs. In (a), we build a directed connectivity graph by (i) using a state of the art statistical technique to perform variable selection, (ii) integrating the variable selection results to form the directed connectivity graph, and (iii) employing graph-theoretical concepts and a graph clustering technique to interpret the resulting network topology in a multi-resolution manner. In (b), based on the directed connectivity graph, multiple linear regression is performed to quantify relations between the critical and other constructs. As a case study, the protocol is applied to analyze opinion leading and seeking behaviors in online market communications environments. The directed connectivity relations revealed provide new ways of visualizing the web of influence directionalities among the constructs of interest, suggest new research directions to pursue, and aid decision making in marketing management. The proposed method provides a data-driven alternative to traditional confirmatory methods in analyzing relations among given constructs. Its flexibility enables the business researcher to broaden the scope of research he/she can fruitfully engage in.  相似文献   

11.
In this paper, we address exploratory analysis of textual data streams and we propose a bootstrapping process based on a combination of keyword similarity and clustering techniques to: (i) classify documents into fine-grained similarity clusters, based on keyword commonalities; (ii) aggregate similar clusters into larger document collections sharing a richer, more user-prominent keyword set that we call topic; (iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time periods, if any, to highlight topic trends and evolution. An analysis framework is also defined enabling the topic-based exploration of the underlying textual data stream according to a thematic perspective and a temporal perspective. The bootstrapping process is evaluated on a real data stream of about 330.000 newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015.  相似文献   

12.
Crisp input and output data are fundamentally indispensable in traditional data envelopment analysis (DEA). However, the input and output data in real-world problems are often imprecise or ambiguous. Some researchers have proposed interval DEA (IDEA) and fuzzy DEA (FDEA) to deal with imprecise and ambiguous data in DEA. Nevertheless, many real-life problems use linguistic data that cannot be used as interval data and a large number of input variables in fuzzy logic could result in a significant number of rules that are needed to specify a dynamic model. In this paper, we propose an adaptation of the standard DEA under conditions of uncertainty. The proposed approach is based on a robust optimization model in which the input and output parameters are constrained to be within an uncertainty set with additional constraints based on the worst case solution with respect to the uncertainty set. Our robust DEA (RDEA) model seeks to maximize efficiency (similar to standard DEA) but under the assumption of a worst case efficiency defied by the uncertainty set and it’s supporting constraint. A Monte-Carlo simulation is used to compute the conformity of the rankings in the RDEA model. The contribution of this paper is fourfold: (1) we consider ambiguous, uncertain and imprecise input and output data in DEA; (2) we address the gap in the imprecise DEA literature for problems not suitable or difficult to model with interval or fuzzy representations; (3) we propose a robust optimization model in which the input and output parameters are constrained to be within an uncertainty set with additional constraints based on the worst case solution with respect to the uncertainty set; and (4) we use Monte-Carlo simulation to specify a range of Gamma in which the rankings of the DMUs occur with high probability.  相似文献   

13.
The current computational power and some recently developed algorithms allow a new automatic spectral analysis method for randomly missing data. Accurate spectra and autocorrelation functions are computed from the estimated parameters of time series models, without user interaction. If only a few data are missing, the accuracy is almost the same as when all observations were available. For larger missing fractions, low-order time series models can still be estimated with a good accuracy if the total observation time is long enough. Autoregressive models are best estimated with the maximum likelihood method if data are missing. Maximum likelihood estimates of moving average and of autoregressive moving average models are not very useful with missing data. Those models are found most accurately if they are derived from the estimated parameters of an intermediate autoregressive model. With statistical criteria for the selection of model order and model type, a completely automatic and numerically reliable algorithm is developed that estimates the spectrum and the autocorrelation function in randomly missing data problems. The accuracy was better than what can be obtained with other methods, including the famous expectation–maximization (EM) algorithm.  相似文献   

14.
孤立点挖掘在高等学校科技统计数据分析中的应用   总被引:1,自引:0,他引:1  
孤立点挖掘是一项有价值的、重要的知识发现,研究孤立点的异常行为能发现隐藏在数据中有价值的信息。本文在介绍孤立点及其挖掘算法的基础上,讨论了基于距离和的孤立点挖掘算法,并将该算法创新地应用于高等学校科技统计数据分析中。结果表明,该算法可以有效地挖掘出高等学校科技统计数据中的异常现象,对数据的真实性的核对起到非常重要的作用。  相似文献   

15.
The response process of problem-solving items contains rich information about respondents' behaviours and cognitive process in the digital tasks, while the information extraction is a big challenge. The aim of the study is to use a data-driven approach to explore the latent states and state transitions underlying problem-solving process to reflect test-takers' behavioural patterns, and to investigate how these states and state transitions could be associated with test-takers' performance. We employed the Hidden Markov Modelling approach to identify test takers' hidden states during the problem-solving process and compared the frequency of states and/or state transitions between different performance groups. We conducted comparable studies in two problem-solving items with a focus on the US sample that was collected in PIAAC 2012, and examined the correlation between those frequencies from two items. Latent states and transitions between them underlying the problem-solving process were identified and found significantly different by performance groups. The groups with correct responses in both items were found more engaged in tasks and more often to use efficient tools to solve problems, while the group with incorrect responses was found more likely to use shorter action sequences and exhibit hesitative behaviours. Consistent behavioural patterns were identified across items. This study demonstrates the value of data-driven based HMM approach to better understand respondents' behavioural patterns and cognitive transmissions underneath the observable action sequences in complex problem-solving tasks.  相似文献   

16.
17.
The threat of cyber attacks motivates the need to monitor Internet traffic data for potentially abnormal behavior. Due to the enormous volumes of such data, statistical process monitoring tools, such as those traditionally used on data in the product manufacturing arena, are inadequate. “Exotic” data may indicate a potential attack; detecting such data requires a characterization of “typical” data. We devise some new graphical displays, including a “skyline plot,” that permit ready visual identification of unusual Internet traffic patterns in “streaming” data, and use appropriate statistical measures to help identify potential cyberattacks. These methods are illustrated on a moderate-sized data set (135,605 records) collected at George Mason University.  相似文献   

18.
数据仓库技术是近几年来出现并迅速发展的一种技术,它可以充分利用数据仓库中已存储的信息,帮助决策者进行决策。本文在介绍数据仓库的定义和数据仓库决策支持系统原理的基础上,详尽地讨论了数据仓库中的数据抽取与转换方法,并对数据仓库实现过程进行了实例探索和在线分析,最后根据数据挖掘方法进行了实际的应用。本文就是采用微软SQL Server 2000数据仓库技术设计了一商业用的数据仓库,使读者可清楚了解数据仓库的决策支持系统的实现过程。  相似文献   

19.
针对化工过程间安全分析问题,结合计算机领域中数据依赖技术,提出一种新的应用于化工过程的安全分析解决方案。以双容水槽液位控制系统为实例,分析工艺流程和变量之间的关系,从中提取9个状态,10个迁移过程以及迁移的条件、事件及执行过程等信息,建立其扩展有限状态机模型。通过考察迁移T8中L2变量,分析其数据依赖关系路径,确定数据依赖正负影响关系,实现基于数据依赖的化工过程安全分析新方法,并通过对T4中L2变量的分析验证了所提方法的有效性,使得扩展有限状态机数据依赖技术成为计算机自动推理来实现化工过程的安全分析的一种新的有效方法。  相似文献   

20.
Two approaches are presented to perform principal component analysis (PCA) on data which contain both outlying cases and missing elements. At first an eigendecomposition of a covariance matrix which can deal with such data is proposed, but this approach is not fit for data where the number of variables exceeds the number of cases. Alternatively, an expectation robust (ER) algorithm is proposed so as to adapt the existing methodology for robust PCA to data containing missing elements. According to an extensive simulation study, the ER approach performs well for all data sizes concerned. Using simulations and an example, it is shown that by virtue of the ER algorithm, the properties of the existing methods for robust PCA carry through to data with missing elements.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号