首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 828 毫秒
1.
This paper introduces a robust variant of AdaBoost, cw-AdaBoost, that uses weight perturbation to reduce variance error, and is particularly effective when dealing with data sets, such as microarray data, which have large numbers of features and small number of instances. The algorithm is compared with AdaBoost, Arcing and MultiBoost, using twelve gene expression datasets, using 10-fold cross validation. The new algorithm consistently achieves higher classification accuracy over all these datasets. In contrast to other AdaBoost variants, the algorithm is not susceptible to problems when a zero-error base classifier is encountered.  相似文献   

2.
Four 1 km global land cover products are currently available to the scientific community: the University of Maryland (UMD) global land cover product, the International Geosphere–Biosphere Programme Data and Information System Cover (IGBP‐DISCover), the MODerate resolution Imaging Spectrometer (MODIS) global land cover product and Global Land Cover 2000 (GLC2000). Because of differences in data sources, temporal scales, classification systems and methodologies, it is important to compare and validate these global maps before using them for a variety of studies at regional to global scales. This study aimed to perform the validation and comparison of the four global land cover datasets, and to examine the suitability and accuracy of different coarse spatial resolution datasets in mapping and monitoring cropland across China. To meet this objective, we compared the four global land cover products with the National Land Cover Dataset 2000 (NLCD‐2000) at three scales to evaluate the accuracy of estimates of aggregated cropland areas in China. This was followed by a spatial comparison to assess the accuracies of the four products in estimating the spatial distribution of cropland across China. A comparative analysis showed that there are varying levels of apparent discrepancies in estimating the cropland of China between these four global land cover datasets, and that both area totals and spatial (dis)agreement between them vary from region to region. Among these, the MODIS dataset has the best fit in depicting China's croplands. The coarse spatial resolution and the per pixel classification approach, as well as landscape heterogeneity, are the main reasons for the large discrepancies between the global land cover datasets tested and the reference data.  相似文献   

3.
4.
多变量时间序列的异常检测是一个具有挑战性的问题, 要求模型从复杂的时间动态中学习信息表示, 并推导出一个可区分的标准, 该标准能从大量正常时间点识别出少量的异常点. 但在时间序列分析中仍存在多变量时间序列复杂的时间相关性和高维度使得异常检测性能较差的问题, 针对上述问题, 本文提出了一种基于MLP (multi-layer perceptron)架构的模型(UMTS-Mixer), 由于MLP的线性结构对顺序敏感, 将其用来捕获时间相关性和跨通道相关性. 大量实验表明UMTS-Mixer能够有效地检测时间序列异常, 并在4个基准数据集上的表现更好, 同时, 在MSL和PSM两个数据集上取得了最高的F1, 分别为91.35%, 92.93%.  相似文献   

5.
To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.A preliminary version of this paper was published in the Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 2003, pp. 920–927.  相似文献   

6.
The handling of missing values is a topic of growing interest in the software quality modeling domain. Data values may be absent from a dataset for numerous reasons, for example, the inability to measure certain attributes. As software engineering datasets are sometimes small in size, discarding observations (or program modules) with incomplete data is usually not desirable. Deleting data from a dataset can result in a significant loss of potentially valuable information. This is especially true when the missing data is located in an attribute that measures the quality of the program module, such as the number of faults observed in the program module during testing and after release. We present a comprehensive experimental analysis of five commonly used imputation techniques. This work also considers three different mechanisms governing the distribution of missing values in a dataset, and examines the impact of noise on the imputation process. To our knowledge, this is the first study to thoroughly evaluate the relationship between data quality and imputation. Further, our work is unique in that it employs a software engineering expert to oversee the evaluation of all of the procedures and to ensure that the results are not inadvertently influenced by poor quality data. Based on a comprehensive set of carefully controlled experiments, we conclude that Bayesian multiple imputation and regression imputation are the most effective techniques, while mean imputation performs extremely poorly. Although a preliminary evaluation has been conducted using Bayesian multiple imputation in the empirical software engineering domain, this is the first work to provide a thorough and detailed analysis of this technique. Our studies also demonstrate conclusively that the presence of noisy data has a dramatic impact on the effectiveness of imputation techniques.  相似文献   

7.
The intrusion detection systems (IDSs) generate large number of alarms most of which are false positives. Fortunately, there are reasons for triggering alarms where most of these reasons are not attacks. In this paper, a new data mining technique has been developed to group alarms and to produce clusters. Hereafter, each cluster abstracted as a generalized alarm. The generalized alarms related to root causes are converted to filters to reduce future alarms load. The proposed algorithm makes use of nearest neighboring and generalization concepts to cluster alarms. As a clustering algorithm, the proposed algorithm uses a new measure to compute distances between alarms features values. This measure depends on background knowledge of the monitored network, making it robust and meaningful. The new data mining technique was verified with many datasets, and the averaged reduction ratio was about 82% of the total alarms. Application of the new technique to alarms log greatly helps the security analyst in identifying the root causes; and then reduces the alarm load in the future.  相似文献   

8.

We present a multilevel technique for the compression and reduction of univariate data and give an optimal complexity algorithm for its implementation. A hierarchical scheme offers the flexibility to produce multiple levels of partial decompression of the data so that each user can work with a reduced representation that requires minimal storage whilst achieving the required level of tolerance. The algorithm is applied to the case of turbulence modelling in which the datasets are traditionally not only extremely large but inherently non-smooth and, as such, rather resistant to compression. We decompress the data for a range of relative errors, carry out the usual analysis procedures for turbulent data, and compare the results of the analysis on the reduced datasets to the results that would be obtained on the full dataset. The results obtained demonstrate the promise of multilevel compression techniques for the reduction of data arising from large scale simulations of complex phenomena such as turbulence modelling.

  相似文献   

9.
Granularity of time is an important issue for the understanding of how actions performed at coarse levels of time interact with others, working at finer levels. However, it has not received much attention from most AI work on temporal logic. In simpler domains of application we may not need to consider it a problem but it becomes important in more complex domains, such as ecological modelling. In this domain, aggregation of processes working at different time granularities (and sometimes cyclically) is very difficult to achieve reliably. We have proposed a new time granularity theory based onmodular temporal classes, and have developed a temporal reasoning system to specify cyclical processes of simulation models in ecology at many levels of time.  相似文献   

10.
As the basis of data management and analysis, data quality issues have increasingly become a research hotspot in related fields, which contributes to optimization of big data and artificial intelligence technology. Generally, physical failures or technical defects in data collectors and recorders cause anomalies in collected data. These anomalies will strongly impact on subsequent data analysis and artificial intelligence processes; thus, data should be processed and cleaned accordingly before application. Existing repairing methods based on smoothing will cause a large number of originally correct data points being over-repaired into wrong values. The constraint-based methods such as sequential dependency and SCREEN cannot accurately repair data under complex conditions since the constraints are relatively simple. A time series data repairing method under multi-speed constraints is further proposed based on the principle of minimum repairing. Then, dynamic programming is used to solve the problem of data anomalies with optimal repairing. Specifically, multiple speed intervals are set to constrain time series data, and a series of candidate repairing points are formed for each data point according to the speed constraints. Next, the optimal repair solution is selected from these candidates based on the dynamic programming method. With regard to the feasibility study of this method, an artificial dataset, two real datasets, and another real dataset with real anomalies are employed for experiments in case of different rates of anomalies and data sizes. Experimental results demonstrate that, compared with the existing methods based on smoothing or constraints, the proposed method has better performance in terms of RMS errors and time cost. In addition, the investigation of clustering and classification accuracy with several datasets reveals the impact of data quality on subsequent data analysis and artificial intelligence. The proposed method can improve the quality of data analysis and artificial intelligence results.  相似文献   

11.
Clustering is a data analysis technique, particularly useful when there are many dimensions and little prior information about the data. Partitional clustering algorithms are efficient but suffer from sensitivity to the initial partition and noise. We propose here k-attractors, a partitional clustering algorithm tailored to numeric data analysis. As a preprocessing (initialization) step, it uses maximal frequent item-set discovery and partitioning to define the number of clusters k and the initial cluster “attractors.” During its main phase the algorithm uses a distance measure, which is adapted with high precision to the way initial attractors are determined. We applied k-attractors as well as k-means, EM, and FarthestFirst clustering algorithms to several datasets and compared results. Comparison favored k-attractors in terms of convergence speed and cluster formation quality in most cases, as it outperforms these three algorithms except from cases of datasets with very small cardinality containing only a few frequent item sets. On the downside, its initialization phase adds an overhead that can be deemed acceptable only when it contributes significantly to the algorithm's accuracy.  相似文献   

12.
The problem of anomaly detection in time series has received a lot of attention in the past two decades. However, existing techniques cannot locate where the anomalies are within anomalous time series, or they require users to provide the length of potential anomalies. To address these limitations, we propose a self-learning online anomaly detection algorithm that automatically identifies anomalous time series, as well as the exact locations where the anomalies occur in the detected time series. In addition, for multivariate time series, it is difficult to detect anomalies due to the following challenges. First, anomalies may occur in only a subset of dimensions (variables). Second, the locations and lengths of anomalous subsequences may be different in different dimensions. Third, some anomalies may look normal in each individual dimension but different with combinations of dimensions. To mitigate these problems, we introduce a multivariate anomaly detection algorithm which detects anomalies and identifies the dimensions and locations of the anomalous subsequences. We evaluate our approaches on several real-world datasets, including two CPU manufacturing data from Intel. We demonstrate that our approach can successfully detect the correct anomalies without requiring any prior knowledge about the data.  相似文献   

13.
吕品  董武世 《计算机工程与应用》2006,42(24):179-180,186
数据挖掘作为应用于数据分析的工具,往往会从大型数据库中毫无保留地揭露某些重要信息,这些重要信息由于一定的原因不能向外界透露。所以可以通过构造具有与原始的频繁集一样的特征的虚拟数据集来替代频繁集挖掘结果。文章给出了一种近似的反频繁集挖掘方法,分析了它的可计算复杂度,得出了近似反频繁集挖掘是一个NP完全问题,提出了近似的反频繁集挖掘的下一步研究重点。  相似文献   

14.
Lasso(Least absolute shrinkage and selection operator)是目前广为应用的一种稀疏特征选择算法。经典的Lasso算法通过对高维数据进行特征选择一定程度上降低了计算开销,然而,求解Lasso问题目前仍面临诸多困难与挑战,例如当特征维数和样本数量非常大时,甚至无法将数据矩阵加载到主存储器中。为了应对这一挑战,Screening加速技巧成为近年来研究的热点。Screening可以在问题优化求解之前将稀疏优化结果中系数必然为0的无效特征筛选出来并剔除,从而极大地降低数据维度,在不损失问题求解精度的前提下,加速稀疏优化问题的求解速度。首先推导了Lasso的对偶问题,根据对偶问题的特性得出基于对偶多面投影的Screening加速技巧,最后将Screening加速技巧引入Lasso特征选择算法,并在多个高维数据集上进行实验,通过加速比、识别率以及算法运行时间三个指标验证了Screening加速技巧在Lasso算法上的良好性能。  相似文献   

15.
A novel pruning approach using expert knowledge for data-specific pruning   总被引:1,自引:0,他引:1  
Classification is an important data mining task that discovers hidden knowledge from the labeled datasets. Most approaches to pruning assume that all dataset are equally uniform and equally important, so they apply equal pruning to all the datasets. However, in real-world classification problems, all the datasets are not equal and considering equal pruning rate during pruning tends to generate a decision tree with large size and high misclassification rate. We approach the problem by first investigating the properties of each dataset and then deriving data-specific pruning value using expert knowledge which is used to design pruning techniques to prune decision trees close to perfection. An efficient pruning algorithm dubbed EKBP is proposed and is very general as we are free to use any learning algorithm as the base classifier. We have implemented our proposed solution and experimentally verified its effectiveness with forty real world benchmark dataset from UCI machine learning repository. In all these experiments, the proposed approach shows it can dramatically reduce the tree size while enhancing or retaining the level of accuracy.  相似文献   

16.
Most biomedical signals are non-stationary. The knowledge of their frequency content and temporal distribution is then useful in a clinical context. The wavelet analysis is appropriate to achieve this task. The present paper uses this method to reveal hidden characteristics and anomalies of the human a-wave, an important component of the electroretinogram since it is a measure of the functional integrity of the photoreceptors. We here analyse the time–frequency features of the a-wave both in normal subjects and in patients affected by Achromatopsia, a pathology disturbing the functionality of the cones. The results indicate the presence of two or three stable frequencies that, in the pathological case, shift toward lower values and change their times of occurrence. The present findings are a first step toward a deeper understanding of the features of the a-wave and possible applications to diagnostic procedures in order to recognise incipient photoreceptoral pathologies.  相似文献   

17.
Data clustering methods are used extensively in the data mining literature to detect important patterns in large datasets in the form of densely populated regions in a multi-dimensional Euclidean space. Due to the complexity of the problem and the size of the dataset, obtaining quality solutions within reasonable CPU time and memory requirements becomes the central challenge. In this paper, we solve the clustering problem as a large scale p-median model, using a new approach based on the variable neighborhood search (VNS) metaheuristic. Using a highly efficient data structure and local updating procedure taken from the OR literature, our VNS procedure is able to tackle large datasets directly without the need for data reduction or sampling as employed in certain popular methods. Computational results demonstrate that our VNS heuristic outperforms other local search based methods such as CLARA and CLARANS even after upgrading these procedures with the same efficient data structures and local search. We also obtain a bound on the quality of the solutions by solving heuristically a dual relaxation of the problem, thus introducing an important capability to the solution process.  相似文献   

18.
Finding the rare instances or the outliers is important in many KDD (knowledge discovery and data-mining) applications, such as detecting credit card fraud or finding irregularities in gene expressions. Signal-processing techniques have been introduced to transform images for enhancement, filtering, restoration, analysis, and reconstruction. In this paper, we present a new method in which we apply signal-processing techniques to solve important problems in data mining. In particular, we introduce a novel deviation (or outlier) detection approach, termed FindOut, based on wavelet transform. The main idea in FindOut is to remove the clusters from the original data and then identify the outliers. Although previous research showed that such techniques may not be effective because of the nature of the clustering, FindOut can successfully identify outliers from large datasets. Experimental results on very large datasets are presented which show the efficiency and effectiveness of the proposed approach. Received 7 September 2000 / Revised 2 February 2001 / Accepted in revised form 31 May 2001 Correspondence and offprint requests to: A. Zhang, Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY 14260, USA. Email: azhang@cse.buffalo.eduau  相似文献   

19.
The explosion of the Internet provides us with a tremendous resource of images shared online. It also confronts vision researchers the problem of finding effective methods to navigate the vast amount of visual information. Semantic image understanding plays a vital role towards solving this problem. One important task in image understanding is object recognition, in particular, generic object categorization. Critical to this problem are the issues of learning and dataset. Abundant data helps to train a robust recognition system, while a good object classifier can help to collect a large amount of images. This paper presents a novel object recognition algorithm that performs automatic dataset collecting and incremental model learning simultaneously. The goal of this work is to use the tremendous resources of the web to learn robust object category models for detecting and searching for objects in real-world cluttered scenes. Humans contiguously update the knowledge of objects when new examples are observed. Our framework emulates this human learning process by iteratively accumulating model knowledge and image examples. We adapt a non-parametric latent topic model and propose an incremental learning framework. Our algorithm is capable of automatically collecting much larger object category datasets for 22 randomly selected classes from the Caltech 101 dataset. Furthermore, our system offers not only more images in each object category but also a robust object category model and meaningful image annotation. Our experiments show that OPTIMOL is capable of collecting image datasets that are superior to the well known manually collected object datasets Caltech 101 and LabelMe.  相似文献   

20.
Image analysis plays an important role both in medical diagnostics and in biology. The main reasons that prevent the creation of objective and reliable methods of analysis of biomedical images are the high variability and heterogeneity of the biological material, distortion introduced by the experimental procedures, and the large size of the images. This paper presents preliminary results on creating a system called Ter-aPro, which combines a platform for image processing (ProStack) and a raster data storage system (rasdaman). This integrated system can be used in a cloud environment, providing access to the methods of visualization, analysis, and processing of a large amount of images through the Internet. Such an approach increases the speed and quality of image understanding and softens the limitations imposed by other systems. The system allows us to view uploaded images in the browser without having to install additional software on any device connected to the Internet, such as tablet computers and smartphones. This paper presents the preliminary results of processing biomedical images.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号