期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Robust model selection with flexible trimming

Marco Riani 《Computational statistics & data analysis》2010,54(12):3300-3312

The forward search provides data-driven flexible trimming of a C_p statistic for the choice of regression models that reveals the effect of outliers on model selection. An informed robust model choice follows. Even in small samples, the statistic has a null distribution indistinguishable from an F distribution. Limits on acceptable values of the C_p statistic follow. Two examples of widely differing size are discussed. A powerful graphical tool is the generalized candlestick plot, which summarizes the information on all forward searches and on the choice of models. A comparison is made with the use of M-estimation in robust model choice. 相似文献

2.

Empirical distribution of k-word matches in biological sequences

Sylvain Forêt Author Vitae Author Vitae Conrad J. Burden Author Vitae 《Pattern recognition》2009,42(4):539-12

This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D₂ statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D₂ statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D₂ have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D₂ uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D₂ for ranges of parameters most frequently encountered in the study of biological sequences. 相似文献

3.

A decomposition of Moran's I for clustering detection

Tonglin Zhang Ge Lin 《Computational statistics & data analysis》2007,51(12):6123-6137

The test statistics I_h, I_c, and I_n are derived by decomposing the numerator of the Moran's I test for high-value clustering, low-value clustering, and negative autocorrelation, respectively. Formulae to compute the means and variances of these test statistics are derived under a random permutation test scheme, and the p-values of the test statistics are computed by asymptotic normality. A set of simulations shows that test statistic I_h is likely to be significant only for high-value clustering, test statistic I_c is likely to be significant only for low-value clustering, and test statistic I_n is likely to be significant only for negatively correlated spatial structures. These test statistics were used to reexamine spatial distributions of sudden infant death syndrome in North Carolina and the pH values of streams in the Great Smoky Mountains. In both analyses, low-value clustering and high-value clustering were shown to exit simultaneously. 相似文献

4.

Evolution strategies based adaptive L_p LS-SVM

Liwei Wei 《Information Sciences》2011,181(14):3000-65

Not only different databases but two classes of data within a database can also have different data structures. SVM and LS-SVM typically minimize the empirical ?-risk; regularized versions subject to fixed penalty (L₂ or L₁ penalty) are non-adaptive since their penalty forms are pre-determined. They often perform well only for certain types of situations. For example, LS-SVM with L₂ penalty is not preferred if the underlying model is sparse. This paper proposes an adaptive penalty learning procedure called evolution strategies (ES) based adaptive L_p least squares support vector machine (ES-based L_p LS-SVM) to address the above issue. By introducing multiple kernels, a L_p penalty based nonlinear objective function is derived. The iterative re-weighted minimal solver (IRMS) algorithm is used to solve the nonlinear function. Then evolution strategies (ES) is used to solve the multi-parameters optimization problem. Penalty parameterp, kernel and regularized parameters are adaptively selected by the proposed ES-based algorithm in the process of training the data, which makes it easier to achieve the optimal solution. Numerical experiments are conducted on two artificial data sets and six real world data sets. The experiment results show that the proposed procedure offer better generalization performance than the standard SVM, the LS-SVM and other improved algorithms. 相似文献

5.

Variance analysis of L₂ model reduction when undermodeling—the output error case

Fredrik Tjärnström^{Author Vitae} 《Automatica》2003,39(10):1809-1815

In this contribution, variance properties of L₂ model reduction are studied. That is, given an estimated model of high order we study the resulting variance of an L₂ reduced approximation. The main result of the paper is showing that estimating a low-order output error (OE) model via L₂ model reduction of a high-order model gives a smaller variance compared to estimating a low-order model directly from data in case of undermodeling. This has previously been shown to hold for Finite Impulse Response models, but is in this paper extended to general linear OE models. 相似文献

6.

L₂ Model reduction and variance reduction

F. Tjärnström^{Author Vitae} L. LjungAuthor Vitae 《Automatica》2002,38(9):1517-1530

In this contribution we examine certain variance properties of model reduction. The focus is on L₂ model reduction, but some general results are also presented. These general results can be used to analyze various other model reduction schemes. The models we study are finite impulse response (FIR) and output error (OE) models. We compare the variance of two estimated models. The first one is estimated directly from data and the other one is computed by reducing a high order model, by L₂ model reduction. In the FIR case we show that it is never better to estimate the model directly from data, compared to estimating it via L₂ model reduction of a high order FIR model. For OE models we show that the reduced model has the same variance as the directly estimated one if the reduced model class used contains the true system. 相似文献

7.

Estimate-based goodness-of-fit test for large sparse multinomial distributions

Sung-Ho Kim Hyemi Choi 《Computational statistics & data analysis》2009,53(4):1122-1131

The Pearson’s chi-squared statistic (X²) does not in general follow a chi-square distribution when it is used for goodness-of-fit testing for a multinomial distribution based on sparse contingency table data. We explore properties of [Zelterman, D., 1987. Goodness-of-fit tests for large sparse multinomial distributions. J. Amer. Statist. Assoc. 82 (398), 624-629] D² statistic and compare them with those of X² and compare the power of goodness-of-fit test among the tests using D², X², and the statistic (L_r) which is proposed by [Maydeu-Olivares, A., Joe, H., 2005. Limited- and full-information estimation and goodness-of-fit testing in 2ⁿ contingency tables: A unified framework. J. Amer. Statist. Assoc. 100 (471), 1009-1020] when the given contingency table is very sparse. We show that the variance of D² is not larger than the variance of X² under null hypotheses where all the cell probabilities are positive, that the distribution of D² becomes more skewed as the multinomial distribution becomes more asymmetric and sparse, and that, as for the L_r statistic, the power of the goodness-of-fit testing depends on the models which are selected for the testing. A simulation experiment strongly recommends to use both D² and L_r for goodness-of-fit testing with large sparse contingency table data. 相似文献

8.

Lepage type statistic based on the modified Baumgartner statistic

Hidetoshi Murakami 《Computational statistics & data analysis》2007,51(10):5061-5067

The Lepage-type statistic L_M has been recently proposed. This is a combination of the Baumgartner statistic and the Ansari-Bradley statistic. The L_M statistic is found to be more powerful than the Lepage statistic. A modified L_M statistic is used for two-sample location and scale parameters. Furthermore, a modified Baumgartner statistic and the Mood statistic replace the Baumgartner and Ansari-Bradley statistics. Simulations are used to investigate the power of the Lepage-type statistics. 相似文献

9.

Conditional ordinal random fields for structured ordinal-valued label prediction

Minyoung Kim 《Data mining and knowledge discovery》2014,28(2):378-401

Predicting labels of structured data such as sequences or images is a very important problem in statistical machine learning and data mining. The conditional random field (CRF) is perhaps one of the most successful approaches for structured label prediction via conditional probabilistic modeling. In such models, it is traditionally assumed that each label is a random variable from a nominal category set (e.g., class categories) where all categories are symmetric and unrelated from one another. In this paper we consider a different situation of ordinal-valued labels where each label category bears a particular meaning of preference or order. This setup fits many interesting problems/datasets for which one is interested in predicting labels that represent certain degrees of intensity or relevance. We propose a fairly intuitive and principled CRF-like model that can effectively deal with the ordinal-scale labels within an underlying correlation structure. Unlike standard log-linear CRFs, learning the proposed model incurs non-convex optimization. However, the new model can be learned accurately using efficient gradient search. We demonstrate the improved prediction performance achieved by the proposed model on several intriguing sequence/image label prediction tasks. 相似文献

10.

SeaUV and SeaUV_C: Algorithms for the retrieval of UV/Visible diffuse attenuation coefficients from ocean color

Cédric G. Fichot Shubha Sathyendranath William L. Miller 《Remote sensing of environment》2008,112(4):1584-1602

相似文献

11.

Chi-squared components for tests of fit and improved models for the grouped exponential distribution

D.J. Best J.C.W. Rayner 《Computational statistics & data analysis》2007,51(8):3946-3954

We consider testing for an exponential distribution with unspecified rate parameter when it is only possible to observe the counts in groups with boundaries specified before sighting the data. On the basis of a size and power study we recommend that tests of fit for the exponential distribution be based on the Anderson-Darling statistic and the SW2 statistic recommended by Gulati and Neus [(2001). Goodness-of-fit statistics for the exponential distribution when the data are grouped. In: Huber-Carol, C., Balakrishnan, N., Nikulin, M.S., Mesbah, M., (Eds.), Goodness-of-Fit Tests and Validity of Models. Birkhauser, Boston, pp. 113-123 (Chapter 9)]. We also suggest that inference based on one of these be complemented by examination of the components of the X² statistic. We illustrate how to use these components to give improved models. 相似文献

12.

Fast sequence segmentation using log-linear models

Nikolaj Tatti 《Data mining and knowledge discovery》2013,27(3):421-441

Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer K, and some measure of homogeneity, the task is to split the sequence into K contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for 1D log-linear models, and by doing so reduce the computational time. We demonstrate empirically, that this approach can significantly reduce the computational burden of finding the optimal segmentation. 相似文献

13.

A novel model to predict a slab avalanche configuration using m:n-CA cellular automata

Pau FonsecaAuthor Vitae Màxim Colls Author VitaeJosep Casanovas Author Vitae 《Computers, Environment and Urban Systems》2011,35(1):12-24

相似文献

14.

Unified generalized iterative scaling and its applications

Wei Gao Ning-Zhong Shi Lianyan Fu 《Computational statistics & data analysis》2010,54(4):1066-1078

相似文献

15.

Gene selection and prediction for cancer classification using support vector machines with a reject option

Hosik ChoiSunghoon Kwon Yongdai Kim 《Computational statistics & data analysis》2011,55(5):1897-1908

In cancer classification based on gene expression data, it would be desirable to defer a decision for observations that are difficult to classify. For instance, an observation for which the conditional probability of being cancer is around 1/2 would preferably require more advanced tests rather than an immediate decision. This motivates the use of a classifier with a reject option that reports a warning in cases of observations that are difficult to classify. In this paper, we consider a problem of gene selection with a reject option. Typically, gene expression data comprise of expression levels of several thousands of candidate genes. In such cases, an effective gene selection procedure is necessary to provide a better understanding of the underlying biological system that generates data and to improve prediction performance. We propose a machine learning approach in which we apply the l₁ penalty to the SVM with a reject option. This method is referred to as the l₁ SVM with a reject option. We develop a novel optimization algorithm for this SVM, which is sufficiently fast and stable to analyze gene expression data. The proposed algorithm realizes an entire solution path with respect to the regularization parameter. Results of numerical studies show that, in comparison with the standard l₁ SVM, the proposed method efficiently reduces prediction errors without hampering gene selectivity. 相似文献

16.

Speaker distinguishing distances: a comparative study

Ananth N. Iyer Uchechukwu O. Ofoegbu Robert E. Yantorno Brett Y. Smolenski 《International Journal of Speech Technology》2007,10(2-3):95-107

Speaker discrimination is a vital aspect of speaker recognition applications such as speaker identification, verification, clustering, indexing and change-point detection. These tasks are usually performed using distance-based approaches to compare speaker models or features from homogeneous speaker segments in order to determine whether or not they belong to the same speaker. Several distance measures and features have been examined for all the different applications, however, no single distance or feature has been reported to perform optimally for all applications in all conditions. In this paper, a thorough analysis is made to determine the behavior of some frequently used distance measures, as well as features, in distinguishing speakers for different data lengths. Measures studied include the Mahalanobis distance, Kullback-Leibler (KL) distance, T ² statistic, Hellinger distance, Bhattacharyya distance, Generalized Likelihood Ratio (GLR), Levenne distance, L ₂ and L _∞ distances. The Mel-Scale Frequency Cepstral Coefficient (MFCC), Linear Predictive Cepstral Coefficients (LPCC), Line Spectral Pairs (LSP) and the Log Area Ratios (LAR) comprise the features investigated. The usefulness of these measures is studied for different data lengths. Generally, a larger data size for each speaker results in better speaker differentiating capability, as more information can be taken into account. However, in some applications such as segmentation of telephone data, speakers change frequently, making it impossible to obtain large speaker-consistent utterances (especially when speaker change-points are unknown). A metric is defined for determining the probability of speaker discrimination error obtainable for each distance measure using each feature set, and the effect of data size on this probability is observed. Furthermore, simple distance-based speaker identification and clustering systems are developed, and the performances of each distance and feature for various data sizes are evaluated on these systems in order to illustrate the importance of choosing the appropriate distance and feature for each application. Results show that for tasks which do not involve any limitation of data length, such as speaker identification, the Kullback Leibler distance with the MFCCs yield the highest speaker differentiation performance, which is comparable to results obtained using more complex state-of-the-art speaker identification systems. Results also indicate that the Hellinger and Bhattacharyya distances with the LSPs yield the best performance for small data sizes. 相似文献

17.

H₂/H_∞ Control of discrete singularly perturbed systems: the state feedback case

K.B. Datta A. RaiChaudhuri 《Automatica》2002,38(10):1791-1797

The design of a mixed H₂/H_∞ linear state variable feedback suboptimal controller for a discrete-time singularly perturbed system using reduced order slow and fast subsystems is described. It is shown that the designed controller based on reduced order models and the corresponding performance index both are O(ε) close to those synthesized using the full order system. 相似文献

18.

SPheno, a program for calculating supersymmetric spectra, SUSY particle decays and SUSY particle production at ee colliders

W. Porod 《Computer Physics Communications》2003,153(2):275-315

SPheno is a program that accurately calculates the supersymmetric particle spectrum within a high scale theory, such as minimal supergravity, gauge mediated supersymmetry breaking, anomaly mediated supersymmetry breaking, or string effective field theories. An interface exists for an easy implementation of other models. The program solves the renormalization group equations numerically to two-loop order with user-specified boundary conditions. The complete one-loop formulas for the masses are used which are supplemented by two-loop contributions in case of the neutral Higgs bosons and the μ parameter. The obtained masses and mixing matrices are used to calculate decay widths and branching ratios of supersymmetric particles as well as of Higgs bosons, b→sγ, Δρ and (g−2)_μ. Moreover, the production cross sections of all supersymmetric particle as well as Higgs bosons at e⁺e⁻ colliders can be calculated including initial state radiation and longitudinal polarization of the incoming electrons/positrons. The program is structured such that it can easily be extend to include non-minimal models and/or complex parameters. 相似文献

19.

Interstitial insulin kinetic parameters for a 2-compartment insulin model with saturable clearance

Christopher G. Pretty Aaron Le Compte Sophie Penning Liam Fisk Geoffrey M. Shaw Thomas Desaive J. Geoffrey Chase 《Computer methods and programs in biomedicine》2014

Glucose-insulin system models are commonly used for identifying insulin sensitivity. With physiological, 2-compartment insulin kinetics models, accurate kinetic parameter values are required for reliable estimates of insulin sensitivity. This study uses data from 6 published microdialysis studies to determine the most appropriate parameter values for the transcapillary diffusion rate (n_I) and cellular insulin clearance rate (n_C). 相似文献

20.

An empirical study of a test for polynomial relationships in randomly right censored regression models

Chun-Xia Zhang Chang-Lin Mei Jiang-She Zhang 《Computational statistics & data analysis》2007,51(12):6543-6556

In this paper, a test statistic is constructed to test polynomial relationships in randomly right censored regression models based on the local polynomial smoothing technique. Two bootstrap procedures, namely the residual-based bootstrap and the naive bootstrap procedures, are suggested to derive the p-value of the test. Some simulations are conducted to empirically assess the performance of the two bootstrap procedures. The results demonstrate that the residual-based bootstrap performs much better than the naive bootstrap and the test method with the residual-based bootstrap to derive the p-value works satisfactorily. Although the limiting distribution of the test statistic and the consistency of the bootstrap approximations remain to be investigated, simulation results indicate that the proposed test method may be of some practical use. As a real example, the proposed test is applied to the Stanford heart transplant data. 相似文献