首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
提出了一种基于MapReduce和上采样的两类非平衡大数据分类方法,该方法分为5步:(1)对于每一个正类样例,用MapReduce寻找其异类最近临;(2)在两个样例点之间的直线上生成若干个正类样例;(3)以新的正类样例子集的大小为基准,将负类样例随机划分为若干子集;(4)用负类样例子集和正类样例子集构造若干个平衡数据子集;(5)用平衡数据子集训练若干个分类器,并对训练好的分类器进行集成。在5个两类非平衡大数据集上与3种相关方法进行了实验比较,实验结果表明本文提出的优于这3种方法。  相似文献   

2.
When sampling minimal subsets for robust parameter estimation, it is commonly known that obtaining an all-inlier minimal subset is not sufficient; the points therein should also have a large spatial extent. This paper investigates a theoretical basis behind this principle, based on a little known result which expresses the least squares regression as a weighted linear combination of all possible minimal subset estimates. It turns out that the weight of a minimal subset estimate is directly related to the span of the associated points. We then derive an analogous result for total least squares which, unlike ordinary least squares, corrects for errors in both dependent and independent variables. We establish the relevance of our result to computer vision by relating total least squares to geometric estimation techniques. As practical contributions, we elaborate why naive distance-based sampling fails as a strategy to maximise the span of all-inlier minimal subsets produced. In addition we propose a novel method which, unlike previous methods, can consciously target all-inlier minimal subsets with large spans.  相似文献   

3.
Crude simulation for estimating reliability of a stochastic network often requires large sample size to obtain statistically significant results. In this paper, we propose a simple recursive importance and stratified sampling estimator which is shown to be unbiased and achieve smaller variance. Preallocation of sampling efforts of size two to each undetermined subnetwork on each stage makes it possible to estimate the variance of the proposed estimator and significantly enhances the effectiveness of variance reduction from stratification by deferring the termination of recursive stratification. Empirical results show that the proposed estimator achieves significant variance reduction, especially for highly reliable networks.  相似文献   

4.
Demographic and socio-economic information provided by the American Community Survey (ACS) have been increasingly relied upon in many planning and decision making contexts due to its timely and current estimates. However, ACS estimates are well known to be subject to larger sampling errors with a much smaller sample size compared with the decennial census data. To support the assessment of the reliability of ACS estimates, the US Census Bureau publishes a margin of error at the 90% confidence level alongside each estimate. While data error or uncertainty in ACS estimates has been widely acknowledged, little has been done to devise methods accounting for such error or uncertainty. This article focuses on addressing ACS data uncertainty issues in choropleth mapping, one of the most widely used methods to visually explore spatial distributions of demographic and socio-economic data. A new classification method is developed to explicitly integrate errors of estimation in the assessment of within-class variation and the associated groupings. The proposed method is applied to mapping the 2009–2013 ACS estimates of median household income at various scales. Results are compared with those generated using existing classification methods to demonstrate the effectiveness of the new classification scheme.  相似文献   

5.
Multidimensional projection‐based visualization methods typically rely on clustering and attribute selection mechanisms to enable visual analysis of multidimensional data. Clustering is often employed to group similar instances according to their distance in the visual space. However, considering only distances in the visual space may be misleading due to projection errors as well as the lack of guarantees to ensure that distinct clusters contain instances with different content. Identifying clusters made up of a few elements is also an issue for most clustering methods. In this work we propose a novel multidimensional projection‐based visualization technique that relies on representative instances to define clusters in the visual space. Representative instances are selected by a deterministic sampling scheme derived from matrix decomposition, which is sensitive to the variability of data while still been able to handle classes with a small number of instances. Moreover, the sampling mechanism can easily be adapted to select relevant attributes from each cluster. Therefore, our methodology unifies sampling, clustering, and feature selection in a simple framework. A comprehensive set of experiments validate our methodology, showing it outperforms most existing sampling and feature selection techniques. A case study shows the effectiveness of the proposed methodology as a visual data analysis tool.  相似文献   

6.

Preprocessing of data is ubiquitous, and choosing significant attributes has been one of the important steps in the prior processing of data. Feature selection is used to create a subset of relevant feature for effective classification of data. In a classification of high-dimensional data, the classifier usually depends on the feature subset that has been used for classification. The Relief algorithm is a popular heuristic approach to select significant feature subsets. The Relief algorithm estimates feature individually and selects top-scored feature for subset generation. Many extensions of the Relief algorithm have been developed. However, an important defect in the Relief-based algorithms has been ignored for years. Because of the uncertainty and noise of the instances used for measuring the feature score in the Relief algorithm, the outcome results will vacillate with the instances, which lead to poor classification accuracy. To fix this problem, a novel feature selection algorithm based on Chebyshev distance-outlier detection model is proposed called noisy feature removal-Relief, NFR-ReliefF in short. To demonstrate the performance of NFR-ReliefF algorithm, an extensive experiment, including classification tests, has been carried out on nine benchmarking high-dimensional datasets by uniting the proposed model with standard classifiers, including the naïve Bayes, C4.5 and KNN. The results prove that NFR-ReliefF outperforms the other models on most tested datasets.

  相似文献   

7.
The problem of reduction of training samples for synthesizing diagnostic models has been solved in the paper. The method of dimension reduction of training sample based on association rules has been proposed. It includes the implementation of stages of reduction of instances, features and superfluous terms, uses information on extracted association rules for evaluation of informativeness of features. The proposed method allows to create a partition of feature space with a fewer number of instances compared to the original sample, which in turn makes the synthesis of easier and more convenient for perception diagnostic models possible. The developed method has been implemented in the developed software and was used for the practical problem solving of reduction of training sample for the synthesis of a diagnostic model of confectionery products quality.  相似文献   

8.
Sampling is a fundamental method for generating data subsets. As many data analysis methods are deve-loped based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60000 show that when the significant difference level,α, is set to 0.05, the algorithm can exclude 99.9%, 99.0%, 93.1% and 96.7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0.03; by contrast, the subsets generated by random sampling pass only 83.8%of the tests, and the average distribution difference is approximately 0.24.  相似文献   

9.
主动学习通过主动选择要学习的样例进行标注,从而有效地降低学习算法的样本复杂度。针对当前主动学习算法普遍采用的平分版本空间策略,本文提出过半缩减版本空间的策略,这种策略避免了平分版本空间策略所要求的较强假设。基于过半缩减版本空间的策略,本文实现了一种选取具有最大可能性被误分类的样例作为训练样例的启发式主动动学习算法(CBMPMS)。该算法计算版本空间中随机抽取的假设组成的委员会和当前学习器对样例预测的类概率差异的熵,以此作为选择样例的标准。针对UCI数据集的实验表明,该算法能够在大多数数据集上取得比相关研究更好的性能。  相似文献   

10.
The major challenge in constructing a statistical shape model for a structure is shape correspondence, which identifies a set of corresponded landmarks across a population of shape instances to accurately estimate the underlying shape variation. Both global or pairwise shape-correspondence methods have been developed to automatically identify the corresponded landmarks. For global methods, landmarks are found by optimizing a comprehensive objective function that considers the entire population of shape instances. While global methods can produce very accurate shape correspondence, they tend to be very inefficient when the population size is large. For pairwise methods, all shape instances are corresponded to a given template independently. Therefore, pairwise methods are usually very efficient. However, if the population exhibits a large amount of shape variation, pairwise methods may produce very poor shape correspondence. In this paper, we develop a new method that attempts to address the limitations of global and pairwise methods. In particular, we first construct a shape tree to globally organize the population of shape instances by identifying similar shape instance pairs. We then perform pairwise shape correspondence between such similar shape instances with high accuracy. Finally, we combine these pairwise correspondences to achieve a unified correspondence for the entire population of shape instances. We evaluate the proposed method by comparing its performance to five available shape correspondence methods, and show that the proposed method achieves the accuracy of a global method with the efficiency of a pairwise method.  相似文献   

11.
Agile methods for software development promote iterative design and implementation. Most of them divide a project into functionalities, called user stories; at each iteration, often called a sprint, a subset of user stories are developed. The sprint planning phase is critical to ensure the project success, but it is also a difficult problem because several factors impact on the optimality of a sprint plan, e.g., the estimated complexity, business value, and affinity of the user stories to be included in each sprint. In this paper we present an approach for sprint planning based on an integer linear programming model. Given the estimates made by the project team and a set of development constraints, the optimal solution of the model is a sprint plan that maximizes the business value perceived by users. Solving to optimality the model by a general-purpose MIP solver, such as IBM Ilog Cplex, takes time and for some instances even finding a feasible solution requires too large computing times for an operational use. For this reason we propose an effective Lagrangian heuristic based on a relaxation of the proposed model and some greedy and exchange algorithms. Computational results on both real and synthetic projects show the effectiveness of the proposed approach.  相似文献   

12.
A method for calculating the tangent direction for a digital curve on the basis of the Hough transform is proposed. The initial data for the Hough transform are taken from tables of angular intervals obtained from a catalog of digital curves. Such a catalog is constructed for a chosen size of the local window on the basis of a specially developed indexation of curves. When determining the intervals of tangent directions for the catalog of curves, several methods of parameterization are used: straight line, elliptic or sinusoidal arc, and superposition of harmonics. The curve selection criterion is the maximum curvature of the corresponding smooth parametric curves. This makes it possible to adapt the method to a class of initial curves. Comparison with earlier-developed methods has demonstrated a significant increase in accuracy, especially with an increase in the curvature.  相似文献   

13.
Many data mining algorithms are often found to be sensitive to the size of dataset. It may result in large memory requirements and very slow response time to execute tasks on large datasets. Thus, data reduction is an important issue in the field of data mining. This paper proposes a novel method for spatial point-data reduction. The main idea is to search a small subset of instances composed of border instances from original training set by using a modified pulse coupled neural network (PCNN) model. Original training instances are mapped into some pulse coupled neurons, and a firing algorithm is presented for determining which instances locate in border regions and filtering noisy instances. The reduced set maintains the main characteristics of original dataset and avoids the influence of noise, thus it can keep or even improve the quality of data mining results. The proposed method is a general data reduction algorithm, which can be used to improve classification, regression and clustering algorithms. The method achieves approximately linear time complexity, and can be used to process large spatial datasets. Experiments demonstrate that the proposed method is efficient and effective.  相似文献   

14.
In a general command tracking and disturbance rejection problem, it is known that a sampled-data controller using zero-order hold may only guarantee the asymptotic tracking at the sampling instances, but cannot smooth out the ripples between the sampling instances. In this paper, a sampled-data robust servomechanism controller using exponential hold is developed for guaranteeing the asymptotic tracking not only at, but also between, the sampling instances. In this development, a so called “internally-reducible” condition to characterize a class of robust servomechanism controllers is derived first, then the proposed controller is shown to be contained in this class. Generally speaking, a sampled-data structure using exponential hold can provide more design freedoms so that it tends to simplify the construction of a robust servomechanism controller and facilitate the implementation on digital computers. An example for a DC motor control is presented to illustrate the advantages of this approach  相似文献   

15.
An increasing awareness of the need for high speed parallel processing systems for image analysis has stimulated a great deal of interest in the design and development of such systems. Efficient processing schemes for several specific problems have been developed providing some insight into the general problems encountered in designing efficient image processing algorithms for parallel architectures. However it is still not clear what architecture or architectures are best suited for image processing in general, or how one may go about determining those which are. An approach that would allow application requirements to specify architectural features would be useful in this context. Working towards this goal, general principles are outlined for formulating parallel image processing tasks by exploiting parallelism in the algorithms and data structures employed. A synchronous parallel processing model is proposed which governs the communication and interaction between these tasks. This model presents a uniform framework for comparing and contrasting different formulation strategies. In addition, techniques are developed for analyzing instances of this model to determine a high level specification of a parallel architecture that best ‘matches’ the requirements of the corresponding application. It is also possible to derive initial estimates of the component capabilities that are required to achieve predefined performance levels. Such analysis tools are useful both in the design stage, in the selection of a specific parallel architecture, or in efficiently utilizing an existing one. In addition, the architecture independent specification of application requirements makes it a useful tool for benchmarking applications.  相似文献   

16.
This paper deals with the classical one-dimensional integer cutting stock problem, which consists of cutting a set of available stock lengths in order to produce smaller ordered items. This process is carried out in order to optimize a given objective function (e.g., minimizing waste). Our study deals with a case in which there are several stock lengths available in limited quantities. Moreover, we have focused on problems of low demand. Some heuristic methods are proposed in order to obtain an integer solution and compared with others. The heuristic methods are empirically analyzed by solving a set of randomly generated instances and a set of instances from the literature. Concerning the latter, most of the optimal solutions of these instances are known, therefore it was possible to compare the solutions. The proposed methods presented very small objective function value gaps.  相似文献   

17.
Due to its NP-hard nature, it is still difficult to find an optimal solution for instances of the binary knapsack problem as small as 100 variables. In this paper, we developed a three-level hyper-heuristic framework to generate algorithms for the problem. From elementary components and multiple sets of problem instances, algorithms are generated. The best algorithms are selected to go through a second step process, where they are evaluated with problem instances that differ in size and difficulty. The problem instances are generated according to methods that are found in the literature. In all of the larger problem instances, the generated algorithms have less than 1 % error with respect to the optimal solution. Additionally, generated algorithms are efficient, taking on average fractions of a second to find a solution for any instance, with a standard deviation of 1 s. In terms of structure, hyper-heuristic algorithms are compact in size compared with those in the literature, allowing an in-depth analysis of their structure and their presentation to the scientific world.  相似文献   

18.
In gene-disease association studies, the cost of genotyping makes it economical to use a two-stage design where only a subset of the cohort is genotyped. At the first-stage, the follow-up data along with some risk factors or non-genetic covariates are collected for the cohort and a subset of the cohort is then selected for genotyping at the second-stage. Intuitively the selection of the subset for the second-stage could be carried out efficiently if the data collected at the first-stage are utilized. The information contained in the conditional probability of the genotype given the first-stage data and the initial estimates of the parameters of interest is being maximized for efficient selection of the subset. The proposed selection method is illustrated using the logistic regression and Cox’s proportional hazards model and algorithms that can find optimal or nearly optimal designs in discrete design space are presented. Simulation comparisons between D-optimal design, extreme selection and case-cohort design suggest that D-optimal design is the most efficient in terms of variance of estimated parameters, but extreme selection may be a good alternative for practical study design.  相似文献   

19.
选取最大可能预测错误样例的主动学习算法   总被引:5,自引:1,他引:4  
通过选取并提交专家标注最有信息量的样例,主动学习算法中可以有效地减轻标注大量未标注样例的负担.采样是主动学习算法中一个影响性能的关键因素.当前主流的采样算法往往考虑选取的样例尽可能平分版本空间.但这一方法假定版本空间中的每一假设都具有相同的概率成为目标函数,而这在真实世界问题中不可能满足.分析了平分版本策略的局限性.进而提出一种旨在尽可能最大限度减小版本空间的启发式采样算法MPWPS(the most possibly wrong-predicted sampling),该算法每次采样时选取当前分类器最有可能预测错误的样例,从而淘汰版本空间中多于半数的假设.这种方法使分类器在达到相同的分类正确率时,采样次数比当前主流的针对平分版本空间的主动学习算法采样次数更少.实验表明,在大多数数据集上,当达到相同的目标正确率时,MPWPS方法能够比传统的采样算法采样次数更少.  相似文献   

20.
近邻(Nearest Neighbor,NN)算法是一种简单实用的监督分类算法。但NN算法在分类未知类标的样例时,需要存储整个训练集,还要计算该样例到训练集中每一个样例之间的距离,所以NN算法的计算复杂度非常高。为了克服这一缺点,P.Hart提出了压缩近邻(Condensed Nearest Neighbor,CNN)规则算法,即从整个训练集中找原样例集的一致子集(一致子集是能正确分类训练集中其他样例的子集)。其计算复杂度依然比较高,特别是对于大型数据库,寻找其一致子集是非常耗费时间的。针对这一问题,提出了基于粗糙集技术的压缩近邻规则算法。该算法分为3步,首先利用粗糙集方法求属性约简(特征选择),以将冗余的属性去掉。然后选取靠近边界域的样例,以将冗余的样例去掉。最后从选出的样例中计算一致子集。该算法能同时沿垂直方向和水平方法进行数据约简。实验结果显示,所提出的方法是行之有效的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号