首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The forward search provides data-driven flexible trimming of a Cp statistic for the choice of regression models that reveals the effect of outliers on model selection. An informed robust model choice follows. Even in small samples, the statistic has a null distribution indistinguishable from an F distribution. Limits on acceptable values of the Cp statistic follow. Two examples of widely differing size are discussed. A powerful graphical tool is the generalized candlestick plot, which summarizes the information on all forward searches and on the choice of models. A comparison is made with the use of M-estimation in robust model choice.  相似文献   

2.
This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D2 uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D2 for ranges of parameters most frequently encountered in the study of biological sequences.  相似文献   

3.
The test statistics Ih, Ic, and In are derived by decomposing the numerator of the Moran's I test for high-value clustering, low-value clustering, and negative autocorrelation, respectively. Formulae to compute the means and variances of these test statistics are derived under a random permutation test scheme, and the p-values of the test statistics are computed by asymptotic normality. A set of simulations shows that test statistic Ih is likely to be significant only for high-value clustering, test statistic Ic is likely to be significant only for low-value clustering, and test statistic In is likely to be significant only for negatively correlated spatial structures. These test statistics were used to reexamine spatial distributions of sudden infant death syndrome in North Carolina and the pH values of streams in the Great Smoky Mountains. In both analyses, low-value clustering and high-value clustering were shown to exit simultaneously.  相似文献   

4.
Not only different databases but two classes of data within a database can also have different data structures. SVM and LS-SVM typically minimize the empirical ?-risk; regularized versions subject to fixed penalty (L2 or L1 penalty) are non-adaptive since their penalty forms are pre-determined. They often perform well only for certain types of situations. For example, LS-SVM with L2 penalty is not preferred if the underlying model is sparse. This paper proposes an adaptive penalty learning procedure called evolution strategies (ES) based adaptive Lp least squares support vector machine (ES-based Lp LS-SVM) to address the above issue. By introducing multiple kernels, a Lp penalty based nonlinear objective function is derived. The iterative re-weighted minimal solver (IRMS) algorithm is used to solve the nonlinear function. Then evolution strategies (ES) is used to solve the multi-parameters optimization problem. Penalty parameterp, kernel and regularized parameters are adaptively selected by the proposed ES-based algorithm in the process of training the data, which makes it easier to achieve the optimal solution. Numerical experiments are conducted on two artificial data sets and six real world data sets. The experiment results show that the proposed procedure offer better generalization performance than the standard SVM, the LS-SVM and other improved algorithms.  相似文献   

5.
In this contribution, variance properties of L2 model reduction are studied. That is, given an estimated model of high order we study the resulting variance of an L2 reduced approximation. The main result of the paper is showing that estimating a low-order output error (OE) model via L2 model reduction of a high-order model gives a smaller variance compared to estimating a low-order model directly from data in case of undermodeling. This has previously been shown to hold for Finite Impulse Response models, but is in this paper extended to general linear OE models.  相似文献   

6.
In this contribution we examine certain variance properties of model reduction. The focus is on L2 model reduction, but some general results are also presented. These general results can be used to analyze various other model reduction schemes. The models we study are finite impulse response (FIR) and output error (OE) models. We compare the variance of two estimated models. The first one is estimated directly from data and the other one is computed by reducing a high order model, by L2 model reduction. In the FIR case we show that it is never better to estimate the model directly from data, compared to estimating it via L2 model reduction of a high order FIR model. For OE models we show that the reduced model has the same variance as the directly estimated one if the reduced model class used contains the true system.  相似文献   

7.
The Pearson’s chi-squared statistic (X2) does not in general follow a chi-square distribution when it is used for goodness-of-fit testing for a multinomial distribution based on sparse contingency table data. We explore properties of [Zelterman, D., 1987. Goodness-of-fit tests for large sparse multinomial distributions. J. Amer. Statist. Assoc. 82 (398), 624-629] D2 statistic and compare them with those of X2 and compare the power of goodness-of-fit test among the tests using D2, X2, and the statistic (Lr) which is proposed by [Maydeu-Olivares, A., Joe, H., 2005. Limited- and full-information estimation and goodness-of-fit testing in 2n contingency tables: A unified framework. J. Amer. Statist. Assoc. 100 (471), 1009-1020] when the given contingency table is very sparse. We show that the variance of D2 is not larger than the variance of X2 under null hypotheses where all the cell probabilities are positive, that the distribution of D2 becomes more skewed as the multinomial distribution becomes more asymmetric and sparse, and that, as for the Lr statistic, the power of the goodness-of-fit testing depends on the models which are selected for the testing. A simulation experiment strongly recommends to use both D2 and Lr for goodness-of-fit testing with large sparse contingency table data.  相似文献   

8.
The Lepage-type statistic LM has been recently proposed. This is a combination of the Baumgartner statistic and the Ansari-Bradley statistic. The LM statistic is found to be more powerful than the Lepage statistic. A modified LM statistic is used for two-sample location and scale parameters. Furthermore, a modified Baumgartner statistic and the Mood statistic replace the Baumgartner and Ansari-Bradley statistics. Simulations are used to investigate the power of the Lepage-type statistics.  相似文献   

9.
Predicting labels of structured data such as sequences or images is a very important problem in statistical machine learning and data mining. The conditional random field (CRF) is perhaps one of the most successful approaches for structured label prediction via conditional probabilistic modeling. In such models, it is traditionally assumed that each label is a random variable from a nominal category set (e.g., class categories) where all categories are symmetric and unrelated from one another. In this paper we consider a different situation of ordinal-valued labels where each label category bears a particular meaning of preference or order. This setup fits many interesting problems/datasets for which one is interested in predicting labels that represent certain degrees of intensity or relevance. We propose a fairly intuitive and principled CRF-like model that can effectively deal with the ordinal-scale labels within an underlying correlation structure. Unlike standard log-linear CRFs, learning the proposed model incurs non-convex optimization. However, the new model can be learned accurately using efficient gradient search. We demonstrate the improved prediction performance achieved by the proposed model on several intriguing sequence/image label prediction tasks.  相似文献   

10.
11.
We consider testing for an exponential distribution with unspecified rate parameter when it is only possible to observe the counts in groups with boundaries specified before sighting the data. On the basis of a size and power study we recommend that tests of fit for the exponential distribution be based on the Anderson-Darling statistic and the SW2 statistic recommended by Gulati and Neus [(2001). Goodness-of-fit statistics for the exponential distribution when the data are grouped. In: Huber-Carol, C., Balakrishnan, N., Nikulin, M.S., Mesbah, M., (Eds.), Goodness-of-Fit Tests and Validity of Models. Birkhauser, Boston, pp. 113-123 (Chapter 9)]. We also suggest that inference based on one of these be complemented by examination of the components of the X2 statistic. We illustrate how to use these components to give improved models.  相似文献   

12.
Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer K, and some measure of homogeneity, the task is to split the sequence into K contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for 1D log-linear models, and by doing so reduce the computational time. We demonstrate empirically, that this approach can significantly reduce the computational burden of finding the optimal segmentation.  相似文献   

13.
14.
15.
In cancer classification based on gene expression data, it would be desirable to defer a decision for observations that are difficult to classify. For instance, an observation for which the conditional probability of being cancer is around 1/2 would preferably require more advanced tests rather than an immediate decision. This motivates the use of a classifier with a reject option that reports a warning in cases of observations that are difficult to classify. In this paper, we consider a problem of gene selection with a reject option. Typically, gene expression data comprise of expression levels of several thousands of candidate genes. In such cases, an effective gene selection procedure is necessary to provide a better understanding of the underlying biological system that generates data and to improve prediction performance. We propose a machine learning approach in which we apply the l1 penalty to the SVM with a reject option. This method is referred to as the l1 SVM with a reject option. We develop a novel optimization algorithm for this SVM, which is sufficiently fast and stable to analyze gene expression data. The proposed algorithm realizes an entire solution path with respect to the regularization parameter. Results of numerical studies show that, in comparison with the standard l1 SVM, the proposed method efficiently reduces prediction errors without hampering gene selectivity.  相似文献   

16.
Speaker discrimination is a vital aspect of speaker recognition applications such as speaker identification, verification, clustering, indexing and change-point detection. These tasks are usually performed using distance-based approaches to compare speaker models or features from homogeneous speaker segments in order to determine whether or not they belong to the same speaker. Several distance measures and features have been examined for all the different applications, however, no single distance or feature has been reported to perform optimally for all applications in all conditions. In this paper, a thorough analysis is made to determine the behavior of some frequently used distance measures, as well as features, in distinguishing speakers for different data lengths. Measures studied include the Mahalanobis distance, Kullback-Leibler (KL) distance, T 2 statistic, Hellinger distance, Bhattacharyya distance, Generalized Likelihood Ratio (GLR), Levenne distance, L 2 and L distances. The Mel-Scale Frequency Cepstral Coefficient (MFCC), Linear Predictive Cepstral Coefficients (LPCC), Line Spectral Pairs (LSP) and the Log Area Ratios (LAR) comprise the features investigated. The usefulness of these measures is studied for different data lengths. Generally, a larger data size for each speaker results in better speaker differentiating capability, as more information can be taken into account. However, in some applications such as segmentation of telephone data, speakers change frequently, making it impossible to obtain large speaker-consistent utterances (especially when speaker change-points are unknown). A metric is defined for determining the probability of speaker discrimination error obtainable for each distance measure using each feature set, and the effect of data size on this probability is observed. Furthermore, simple distance-based speaker identification and clustering systems are developed, and the performances of each distance and feature for various data sizes are evaluated on these systems in order to illustrate the importance of choosing the appropriate distance and feature for each application. Results show that for tasks which do not involve any limitation of data length, such as speaker identification, the Kullback Leibler distance with the MFCCs yield the highest speaker differentiation performance, which is comparable to results obtained using more complex state-of-the-art speaker identification systems. Results also indicate that the Hellinger and Bhattacharyya distances with the LSPs yield the best performance for small data sizes.  相似文献   

17.
K.B. Datta  A. RaiChaudhuri 《Automatica》2002,38(10):1791-1797
The design of a mixed H2/H linear state variable feedback suboptimal controller for a discrete-time singularly perturbed system using reduced order slow and fast subsystems is described. It is shown that the designed controller based on reduced order models and the corresponding performance index both are O(ε) close to those synthesized using the full order system.  相似文献   

18.
SPheno is a program that accurately calculates the supersymmetric particle spectrum within a high scale theory, such as minimal supergravity, gauge mediated supersymmetry breaking, anomaly mediated supersymmetry breaking, or string effective field theories. An interface exists for an easy implementation of other models. The program solves the renormalization group equations numerically to two-loop order with user-specified boundary conditions. The complete one-loop formulas for the masses are used which are supplemented by two-loop contributions in case of the neutral Higgs bosons and the μ parameter. The obtained masses and mixing matrices are used to calculate decay widths and branching ratios of supersymmetric particles as well as of Higgs bosons, b, Δρ and (g−2)μ. Moreover, the production cross sections of all supersymmetric particle as well as Higgs bosons at e+e colliders can be calculated including initial state radiation and longitudinal polarization of the incoming electrons/positrons. The program is structured such that it can easily be extend to include non-minimal models and/or complex parameters.  相似文献   

19.
Glucose-insulin system models are commonly used for identifying insulin sensitivity. With physiological, 2-compartment insulin kinetics models, accurate kinetic parameter values are required for reliable estimates of insulin sensitivity. This study uses data from 6 published microdialysis studies to determine the most appropriate parameter values for the transcapillary diffusion rate (nI) and cellular insulin clearance rate (nC).  相似文献   

20.
In this paper, a test statistic is constructed to test polynomial relationships in randomly right censored regression models based on the local polynomial smoothing technique. Two bootstrap procedures, namely the residual-based bootstrap and the naive bootstrap procedures, are suggested to derive the p-value of the test. Some simulations are conducted to empirically assess the performance of the two bootstrap procedures. The results demonstrate that the residual-based bootstrap performs much better than the naive bootstrap and the test method with the residual-based bootstrap to derive the p-value works satisfactorily. Although the limiting distribution of the test statistic and the consistency of the bootstrap approximations remain to be investigated, simulation results indicate that the proposed test method may be of some practical use. As a real example, the proposed test is applied to the Stanford heart transplant data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号