首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this study, a simple screening algorithm was developed to prevent the occurrence of Type II errors or samples with high prediction error that are not detected as outliers. The method is used to determine “good” and “bad” spectra and to prevent a false negative condition where poorly predicted samples appear to be within the calibration space, yet have inordinately large residual or prediction errors. The detection and elimination of this type of sample, which is a true outlier but not easily detected, is extremely important in medical decisions, since such erroneous data can lead to considerable mistakes in clinical analysis and medical diagnosis. The algorithm is based on a cross-correlation comparison between samples spectra measured over the region of 4160-4880 cm− 1. The correlation values are converted using the Fisher's z-transform, while a z-test of the transformed values is performed to screen out the outlier spectra. This approach allows the use of a tuning parameter used to decrease the percentage of samples with high analytical (residual) errors. The algorithm was tested using a dataset with known reference values to determine the number of false negative and false positive samples. The cross-correlation algorithm performance was tested on several hundred blood samples prepared at different hematocrit (24 to 48%) and glucose (30 to 500 mg/dL) levels using blood component materials from thirteen healthy human volunteers. Experimental results illustrate the effectiveness of the proposed algorithm in finding and screening out Type II outliers in terms of sensitivity and specificity, and the ability to predict or estimate future or validation datasets ensuring lower error of prediction. To our knowledge this is the first paper to introduce a statistically useful screening method based on spectra cross-correlation to detect the occurrence of Type II outliers (false negative samples) for routine analysis in a clinically relevant application for medical diagnosis.  相似文献   

2.
A multivariate dataset consists of n cases in d dimensions, and is often stored in an n by d data matrix. It is well-known that real data may contain outliers. Depending on the situation, outliers may be (a) undesirable errors, which can adversely affect the data analysis, or (b) valuable nuggets of unexpected information. In statistics and data analysis, the word outlier usually refers to a row of the data matrix, and the methods to detect such outliers only work when at least half the rows are clean. But often many rows have a few contaminated cell values, which may not be visible by looking at each variable (column) separately. We propose the first method to detect deviating data cells in a multivariate sample which takes the correlations between the variables into account. It has no restriction on the number of clean rows, and can deal with high dimensions. Other advantages are that it provides predicted values of the outlying cells, while imputing missing values at the same time. We illustrate the method on several real datasets, where it uncovers more structure than found by purely columnwise methods or purely rowwise methods. The proposed method can help to diagnose why a certain row is outlying, for example, in process control. It also serves as an initial step for estimating multivariate location and scatter matrices.  相似文献   

3.
An effectively designed product platform is vital to the final product family derived from it. A product platform design consists of platform configuration to decide which variables to make common across the product family and to determining the optimal values for platform and scaling variables for all product variants. Many existing product family design methods assume a given platform configuration, i.e. the platform variables are specified a priori by designers. However, selecting the right combination of common and scaling variables is not trivial. Most approaches are single-platform methods, in which design variables are either shared across all product variants or not at all. While in multiple-platform design, platform variables can have special value with regard to a subset of product variants within the product family, offering opportunities for superior overall design. This paper proposes a quantitative method for scale-based multiple-platform design using clustering analysis and Shannon's Entropy theory. Optimization methods are used to design the product family by holding the values of platform variables constant and to find the best values of the scaling variables. An information theoretical approach is used to help select platform variables based on the clustering analysis of individually designed products. Validity analysis is performed to determine the optimal settings for platform variables. Local clustering is further performed on each platform variable, to establish subsets of variants such that variants within a subset are more similar to each other than they are to variants in other subsets and a common value is used to represent the various values of variants in each subset. A case study is used to illustrate the process of the proposed method, and the design solutions are compared with that found by other methods given in previous literature. The comparison results verified that the multiple-platform design can lead to superior solutions of product family.  相似文献   

4.
Statistical process control consists of tools and techniques that are useful for improving a process or ensuring that a process is in a stable and satisfactory state. In many modern industrial applications, it is critically important to simultaneously monitor two or more correlated process quality variables, thus necessitating the development of multivariate statistical process control (MSPC) as an important area of research for the new century. Nevertheless, the existing MSPC research is mostly based on the assumption that the process data follow a multinormal distribution or a known distribution. However, it is well recognized that in many applications the underlying process distribution is unknown. In practice, among a set of correlated variables to be monitored, there is oftentimes a subset of variables that are easy and/or inexpensive to measure, whereas the remaining variables are difficult and/or expensive to measure but contain information that may help more quickly detect a shift in the process mean. We are motivated to develop a Phase II control chart to monitor variable dimension (VD) mean vector for unknown multivariate processes. The proposed chart is based on the exponentially weighted moving average (EWMA) of a depth-based statistic. The proposed chart is shown to lead to faster detection of mean shifts than the existing VDT2 and VD EWMAT2 charts studied in Aparisi et al. and Epprecht et al., respectively.  相似文献   

5.
The standardized difference in estimated Bayes risk between two subsets of groups of allocation variables is proposed as a test statistic for additional classification accuracy. This test is used in a minimal-best-subset algorithm that aims to select the optimal subset for the data at hand—that is, the smallest subset retaining most of the classification accuracy. A multivariate normal example confirms that all-possible-subsets and minimal-best discrimination procedures based on Wilks's lambda and Rao's test usually do not identify the best subsets according to estimated Bayes risk. The minimal-best discrimination subset was suboptimal in all of 100 bootstrapped samples: It contained too many groups in every case. In contrast, the minimal-best classification selected an optimal subset for 82 out of 100 bootstrap examples; appending a test of additional accuracy of the minimal-best subset versus the overall-best subset led to an optimal subset in the other 18 cases by suggesting the addition of more groups.  相似文献   

6.
This paper reports on the transfer of calibration models between Fourier transform near-infrared (FT-NIR) instruments from four different manufacturers. The piecewise direct standardization (PDS) method is compared with the new hybrid calibration method known as prediction augmented classical least squares/partial least squares (PACLS/PLS). The success of a calibration transfer experiment is judged by prediction error and by the number of samples that are flagged as outliers that would not have been flagged as such if a complete recalibration were performed. Prediction results must be acceptable and the outlier diagnostics capabilities must be preserved for the transfer to be deemed successful. Previous studies have measured the success of a calibration transfer method by comparing only the prediction performance (e.g., the root mean square error of prediction, RMSEP). However, our study emphasizes the need to consider outlier detection performance as well. As our study illustrates, the RMSEP values for a calibration transfer can be within acceptable range; however, statistical analysis of the spectral residuals can show that differences in outlier performance can vary significantly between competing transfer methods. There was no statistically significant difference in the prediction error between the PDS and PACLS/PLS methods when the same subset sample selection method was used for both methods. However, the PACLS/PLS method was better at preserving the outlier detection capabilities and therefore was judged to have performed better than the PDS algorithm when transferring calibrations with the use of a subset of samples to define the transfer function. The method of sample subset selection was found to make a significant difference in the calibration transfer results using the PDS algorithm, while the transfer results were less sensitive to subset selection when the PACLS/PLS method was used.  相似文献   

7.
The transfer of a calibration model for determining fiber content in flax stem was accomplished between two near-infrared spectrometers, which are the same brand but which require a standardization. In this paper, three factors, including transfer sample set, spectral type, and standardization method, were investigated to obtain the best standardization result. Twelve standardization files were produced from two sets of the transfer sample (sealed reference standards and a subset of the prediction set), two types of the transfer sample spectra (raw and preprocessed spectra), and three standardization methods (direct standardization (DS), piecewise direct standardization (PDS), and double window piecewise direct standardization (DWPDS)). The efficacy of the model transfer was evaluated based on the root mean square error of prediction, calculated using the independent prediction samples. Results indicated that the standardization using the sealed reference standards was unacceptable, but the standardization using the prediction subset was adequate. The use of the preprocessed spectra of the transfer samples led to the calibration transfers that were successful, especially for the PDS and the DWPDS correction. Finally, standardization using the prediction subset and their preprocessed spectra with DWPDS correction proved to be the best method for transferring the model.  相似文献   

8.
《技术计量学》2012,54(4):445-458
Abstract

There has been extensive work on data depth-based methods for robust multivariate data analysis. Recent developments have moved to infinite-dimensional objects, such as functional data. In this work, we propose a notion of depth, the total variation depth, for functional data, which has many desirable features and is well suited for outlier detection. The proposed depth is in the form of an integral of a univariate depth function. We show that the novel formation of the total variation depth leads to useful decomposition associated with shape and magnitude outlyingness of functional data. Compared to magnitude outliers, shape outliers are often masked among the rest of samples and more difficult to identify. We then further develop an effective procedure and visualization tools for detecting both types of outliers, while naturally accounting for the correlation in functional data. The outlier detection performance is investigated through simulations under various outlier models. Finally, the proposed methodology is demonstrated using real datasets of curves, images, and video frames.  相似文献   

9.
Perforations or polymeric membranes are not capable of simultaneously providing optimum O2 and CO2 levels for many fruits and vegetables contained in modified atmosphere packaging. However, combining these two gas transfer devices, either in series or in parallel, can provide the required gas selectivities to create optimal modified atmosphere conditions. A methodology for determining the perforation and membrane surface areas for individual and combined systems is described. Gradient diagrams are used to calculate the optimum system selectivity, ΔpO2 and ΔpCO2. These values can be used to select the appropriate gas exchange devices and to determine the appropriate perforation and membrane surface area.  相似文献   

10.
Selecting a suitable equation to represent a set of multifactor data that was collected for other purposes in a plant, pilot-plant, or laboratory can be troublesome. If there are k independent variables, there are 2 k possible linear equations to be examined; one equation using none of the variables, k using one variable, k(k – 1)/2 using two variables, etc. Often there are several equally good candidates. Selection depends on whether one needs a simple interpolation formula or estimates of the effects of individual independent variables. Fractional factorial designs for sampling the 2 k possibilities and a new statistic proposed by C. Mallows simplify the search for the best candidate. With the new statistic, regression equations can be compared graphically with respect to both bias and random error.  相似文献   

11.
Outliers are one of the main concerns in statistics. Parametric identification results of ordinary least squares are sensitive to outliers. Many robust estimators have been proposed to overcome this problem but there are still some drawbacks in existing methods. In this paper, a novel probabilistic method is proposed for robust parametric identification and outlier detection in linear regression problems. The crux of this method is to calculate the probability of outlier, which quantifies how probable it is that a data point is an outlier. There are several appealing features of the proposed method. First, not only the optimal values of the parameters and residuals but also the associated uncertainties are taken into account for outlier detection. Second, the size of the dataset is incorporated because it is one of the key variables to determine the probability of obtaining a large-residual data point. Third, the proposed method requires no information on the outlier distribution model. Fourth, the proposed approach provides the probability of outlier. In the illustrative examples, the proposed method is compared with three well-known methods. It turns out that the proposed method is substantially superior and it is capable of robust parametric identification and outlier detection even for very challenging situations.  相似文献   

12.
One of the often-stated goals of principal component analysis is to reduce into a low-dimensional space most of the essential information contained in a high-dimensional space. According to several reasonable criteria, principal components do this optimally. From a practical point of view, however, principal components suffer from the disadvantage that each component is a linear combination of all of the original variables. Thus interpretation of the results and possible subsequent data collection and analysis still involve all of the variables. An alternative approach is to select a subset of variables that contain, in some sense, as much information as possible. Methods for selecting such “principal variables” are presented and illustrated with examples.  相似文献   

13.
ABSTRACTS     
This article provides methods for constructing simultaneous prediction intervals to contain the means of the dependent variable in a regression model for each of k future samples at k sets of values of the independent variables, some or all of which may be different. The methods are compared with previously proposed approximate procedures. The construction of simultaneous confidence intervals to contain the true regression at all of k vectors of independent variables is also presented.  相似文献   

14.
In order to eliminate the influence of unavoidable outliers in training sample on a model's performance, a novel least square support vector machine regression, which combines outlier detection approach and adaptive weight value for the training sample, is proposed and named as adaptive weighted least square support vector machine regression (AWLS-SVM). Firstly, the effective robust 3σ principle is used to detect marked outliers for the training sample. Secondly, based on the training sample without marked outliers, least square support vector machine regression is employed to develop the model and the fitting error of each sample data is obtained. Thirdly, according to the fitting error of each sample data, the initial weight is calculated. The bigger the fitting error of sample data is, the smaller the weight value of the sample data. Thus, the potential outliers, which are not detected by the robust 3σ principle and have bigger fitting errors, have smaller weight values to reduce the influence of the potential outliers on the performance of model. Then, LS-SVM is applied for the weighted sample to develop the model again. Finally, via the proposed weight value iterative method, the weight values of the training sample are converged, and the model with good predicting performance is obtained. To illustrate the performance of AWLS-SVM, simulation experiment is designed to produce the training sample with marked outlier and some non-marked outliers. AWLS-SVM, AWLS-SVM without the robust 3σ principle, LS-SVM with the robust 3σ principle, LS-SVM, and radial basis function network are applied to develop the model based on the designed sample. The results show that the influence of marked and un-marked outliers on the model's performance is eliminated by AWLS-SVM, and that the predicting performance of AWLS-SVM is the best. Furthermore, the AWLS-SVM method was applied to develop the quantitative structure–activity relationships (QSAR) model of HIV-1 protease inhibitors, and the satisfactory result was obtained.  相似文献   

15.
The work summarised in this paper presents the second part of a two-paper series on quantitative whole spectrum analysis with MALDI-TOF MS on skimmed milk. In Part I experiments were carried out to search for optimal sample preparation and instrumental settings in terms of signal-to-noise ratios and repeatability. The results were utilised in the present study when trying to predict concentrations of cow, goat and ewe milk in mixed milk samples. Partial least squares regression was combined with suitable pre- and post-processing of spectra and concentration responses. A plotting method was used where predictions are visualised as a mixture design. The objective was to show that MALDI-TOF MS had potential for being used in quantitative analysis without involving peak comparison or other types of expert guided research. Predictions of a validation data set gave promising results with the best RMSEP values ranging from 5.4% (w/w) to 6.5% (w/w), for the different milk types used, and corresponding R2pred values ranging from 94.5% to 96.2%. This indicates that MALDI-TOF is sufficiently accurate and repeatable to be used in practical application for quantitative analysis. Three variable selection strategies based on visual inspections and regression modelling were also evaluated. These were all outperformed, with regard to prediction error, by the use of whole spectra and multivariate regression. The results indicate that multivariate regression on whole spectra can be far more effective than using a few selected variables.  相似文献   

16.
This article presents a new augmentation method to eliminate multicollinearity in observational datasets that contain several correlated variables. The purpose is to eliminate the correlations to facilitate the application of the least squares regression method. The procedure is based on the addition of new observations to the point in which an appropriate linear regression model can be constructed. Original data can be observational but the new information is obtained through designed experiments. The proposed method uses the R3 algorithm to perform the augmentations and the VIF statistic to determine the point in which the correlations have been significantly reduced.  相似文献   

17.
This paper deals with some multiple decision (ranking and selection) problems. Some relevant distribution theory is given and the associated confidence bounds are derived for the differences (ratios) between the parameters. The selection procedures select a non-empty, small, best subset such that the probability is at least equal to a specified value P* that the best population is selected in the subset. General results are given both for the unknown location and scale parameters of the k populations. Some desirable properties of these procedures are studied and proved. Selection of a subset to contain all populations better than a standard is also discussed. Performance characteristics of some procedures for the normal means problem are studied and tables are given for the probabilities of selecting the ith ranked population and for the expected proportion and the expected average rank in the selected subset. A brief review of work by other authors in the problems of selection and ranking and in other related problems is given.  相似文献   

18.
Supervised machine learning approaches are effective in text mining, but their success relies heavily on manually annotated corpora. However, there are limited numbers of annotated biomedical event corpora, and the available datasets contain insufficient examples for training classifiers; the common cure is to seek large amounts of training samples from unlabeled data, but such data sets often contain many mislabeled samples, which will degrade the performance of classifiers. Therefore, this study proposes a novel error data detection approach suitable for reducing noise in unlabeled biomedical event data. First, we construct the mislabeled dataset through error data analysis with the development dataset. The sample pairs’ vector representations are then obtained by the means of sequence patterns and the joint model of convolutional neural network and long short-term memory recurrent neural network. Following this, the sample identification strategy is proposed, using error detection based on pair representation for unlabeled data. With the latter, the selected samples are added to enrich the training dataset and improve the classification performance. In the BioNLP Shared Task GENIA, the experiments results indicate that the proposed approach is competent in extract the biomedical event from biomedical literature. Our approach can effectively filter some noisy examples and build a satisfactory prediction model.  相似文献   

19.
PUGH概念选择法在产品设计中的应用研究   总被引:1,自引:0,他引:1  
李永斌  陈婷  梁权攀 《包装工程》2018,39(10):167-172
目的在产品开发过程中,由于众多设计人员参与,导致设计方案的确定加入了人的主观情感,使得设计方案的确定变得盲目和随从,设计人员无法从众多方案中进行最优方案的选择。针对这个问题,提出PUGH概念选择法,为产品设计方案的选择提供有效的评判工具,可以让设计人员从众多设计方案中得到最优设计方案。方法提出在产品设计过程中应用PUGH概念选择法,通过将马斯洛需求层次理论与用户体验相结合并确定PUGH矩阵判断标准和基准方案,由评估小组根据判断标准和基准方案对可选方案进行比较,并对可选方案排序和计算出可选方案的总体性排序,通过总体性排序来确定最优方案,具体以一体式橱柜设计为例进行应用。结论通过对一体式橱柜设计方案的具体应用,验证了PUGH概念选择法在产品设计中的有效性和可行性。  相似文献   

20.
This paper describes the use of Computational Fluid Dynamics (CFD) and mathematical optimization to determine the optimum operating conditions and geometry of a continuous quenching process. The pump power as well as the quench rate of the steel plate in this process is influenced by many parameters. These include the nozzle and header geometry, plate speed, water flow rate, etc. Since an experimental approach is time consuming and costly, use is made of numerical techniques. Furthermore, it is sometimes impossible to measure certain values in this manufacturing process (e.g. the quench rate at a certain depth in the plate). These quantities can be obtained by CFD techniques. Using CFD without optimization on a trial‐and‐error basis, however, does not guarantee optimal solutions. A better approach, that has until recently been too expensive, is to combine CFD with mathematical optimization techniques, thereby incorporating the influence of the variables automatically. The current study investigates a simplified two‐dimensional continuous quenching process. The first part of the study investigates the operating conditions required to quench a plate at a specific quench rate. The second part of the study minimizes the pump power to quench a plate at a specific quench rate. The CFD simulation uses the STAR‐CD code to solve the Reynolds‐Averaged Navier–Stokes equations with the kϵ turbulence model. The optimization is carried out by means of Snyman's DYNAMIC‐Q method, which is specifically designed to handle constrained problems where the objective or constraint functions are expensive to evaluate. The paper illustrates how this optimization technique can be used to obtain the operating conditions needed for a manufacturing process with complex flow and heat transfer phenomena. The paper also illustrates how these techniques can be used in the design phase of such a manufacturing process to determine the optimum geometry and process parameters. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号