首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A novel pruning approach using expert knowledge for data-specific pruning   总被引:1,自引:0,他引:1  
Classification is an important data mining task that discovers hidden knowledge from the labeled datasets. Most approaches to pruning assume that all dataset are equally uniform and equally important, so they apply equal pruning to all the datasets. However, in real-world classification problems, all the datasets are not equal and considering equal pruning rate during pruning tends to generate a decision tree with large size and high misclassification rate. We approach the problem by first investigating the properties of each dataset and then deriving data-specific pruning value using expert knowledge which is used to design pruning techniques to prune decision trees close to perfection. An efficient pruning algorithm dubbed EKBP is proposed and is very general as we are free to use any learning algorithm as the base classifier. We have implemented our proposed solution and experimentally verified its effectiveness with forty real world benchmark dataset from UCI machine learning repository. In all these experiments, the proposed approach shows it can dramatically reduce the tree size while enhancing or retaining the level of accuracy.  相似文献   

2.
Existing data analysis techniques have difficulty in handling multidimensional data. Multidimensional data has been a challenge for data analysis because of the inherent sparsity of the points. In this paper, we first present a novel data preprocessing technique called shrinking which optimizes the inherent characteristic of distribution of data. This data reorganization concept can be applied in many fields such as pattern recognition, data clustering, and signal processing. Then, as an important application of the data shrinking preprocessing, we propose a shrinking-based approach for multidimensional data analysis which consists of three steps: data shrinking, cluster detection, and cluster evaluation and selection. The process of data shrinking moves data points along the direction of the density gradient, thus generating condensed, widely-separated clusters. Following data shrinking, clusters are detected by finding the connected components of dense cells (and evaluated by their compactness). The data-shrinking and cluster-detection steps are conducted on a sequence of grids with different cell sizes. The clusters detected at these scales are compared by a cluster-wise evaluation measurement, and the best clusters are selected as the final result. The experimental results show that this approach can effectively and efficiently detect clusters in both low and high-dimensional spaces.  相似文献   

3.
Crisp input and output data are fundamentally indispensable in traditional data envelopment analysis (DEA). However, the input and output data in real-world problems are often imprecise or ambiguous. Some researchers have proposed interval DEA (IDEA) and fuzzy DEA (FDEA) to deal with imprecise and ambiguous data in DEA. Nevertheless, many real-life problems use linguistic data that cannot be used as interval data and a large number of input variables in fuzzy logic could result in a significant number of rules that are needed to specify a dynamic model. In this paper, we propose an adaptation of the standard DEA under conditions of uncertainty. The proposed approach is based on a robust optimization model in which the input and output parameters are constrained to be within an uncertainty set with additional constraints based on the worst case solution with respect to the uncertainty set. Our robust DEA (RDEA) model seeks to maximize efficiency (similar to standard DEA) but under the assumption of a worst case efficiency defied by the uncertainty set and it’s supporting constraint. A Monte-Carlo simulation is used to compute the conformity of the rankings in the RDEA model. The contribution of this paper is fourfold: (1) we consider ambiguous, uncertain and imprecise input and output data in DEA; (2) we address the gap in the imprecise DEA literature for problems not suitable or difficult to model with interval or fuzzy representations; (3) we propose a robust optimization model in which the input and output parameters are constrained to be within an uncertainty set with additional constraints based on the worst case solution with respect to the uncertainty set; and (4) we use Monte-Carlo simulation to specify a range of Gamma in which the rankings of the DMUs occur with high probability.  相似文献   

4.
Data reconciliation has played a significant role in rectifying process data which can meet the conservation laws in industrial processes. Generally, the actual measurements are often easily contaminated by different gross errors. Thus, it is essential to build robust data reconciliation methods to alleviate the impact of gross errors and provide accurate data. In this paper, a novel robust estimator is proposed to improve the robustness of data reconciliation method, which is based on a new robust estimation function. First, the main robust properties are analyzed with its objective and influence functions for the proposed robust estimator. Then, the effectiveness of the new robust data reconciliation method is demonstrated on a linear numerical case and a nonlinear example. Moreover, it is further used to a practical industrial evaporation production process, which also demonstrates that the process data can be better reconciled with the proposed robust estimator.  相似文献   

5.
Data clustering is a process of extracting similar groups of the underlying data whose labels are hidden. This paper describes different approaches for solving data clustering problem. Particle swarm optimization (PSO) has been recently used to address clustering task. An overview of PSO-based clustering approaches is presented in this paper. These approaches mimic the behavior of biological swarms seeking food located in different places. Best locations for finding food are in dense areas and in regions far enough from others. PSO-based clustering approaches are evaluated using different data sets. Experimental results indicate that these approaches outperform K-means, K-harmonic means, and fuzzy c-means clustering algorithms.  相似文献   

6.
A novel multiseed nonhierarchical data clustering technique   总被引:5,自引:0,他引:5  
Clustering techniques such as K-means and Forgy as well as their improved version ISODATA group data around one seed point for each cluster, It is well known that these methods do not work well if the shape of the cluster is elongated or nonconvex. We argue that for a elongated or nonconvex shaped cluster, more than one seed is needed, In this paper a multiseed clustering algorithm is proposed. A density based representative point selection algorithm is used to choose the initial seed points. To assign several seed points to one cluster, a minimal spanning tree guided novel technique is proposed. Also, a border point detection algorithm is proposed for the detection of shape of the cluster. This border in turn signifies whether the cluster is elongated or not, Experimental results show the efficiency of this clustering technique.  相似文献   

7.
Due to data sparseness and attribute redundancy in high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. To effectively address this issue, this paper presents a new optimization algorithm for clustering high-dimensional categorical data, which is an extension of the k-modes clustering algorithm. In the proposed algorithm, a novel weighting technique for categorical data is developed to calculate two weights for each attribute (or dimension) in each cluster and use the weight values to identify the subsets of important attributes that categorize different clusters. The convergence of the algorithm under an optimization framework is proved. The performance and scalability of the algorithm is evaluated experimentally on both synthetic and real data sets. The experimental studies show that the proposed algorithm is effective in clustering categorical data sets and also scalable to large data sets owning to its linear time complexity with respect to the number of data objects, attributes or clusters.  相似文献   

8.
Data clustering usually requires extensive computations of similarity measures between dataset members and cluster centers, especially for large datasets. Image clustering can be an intermediate process in image retrieval or segmentation, where a fast process is critically required for large image databases. This paper introduces a new approach of multi-agents for fuzzy image clustering (MAFIC) to improve the time cost of the sequential fuzzy \(c\)-means algorithm (FCM). The approach has the distinguished feature of distributing the computation of cluster centers and membership function among several parallel agents, where each agent works independently on a different sub-image of an image. Based on the Java Agent Development Framework platform, an implementation of MAFIC is tested on 24-bit large size images. The experimental results show that the time performance of MAFIC outperforms that of the sequential FCM algorithm by at least four times, and thus reduces the time needed for the clustering process.  相似文献   

9.
The growing usage of embedded devices and sensors in our daily lives has been profoundly reshaping the way we interact with our environment and our peers. As more and more sensors will pervade our future cities, increasingly efficient infrastructures to collect, process and store massive amounts of data streams from a wide variety of sources will be required. Despite the different application-specific features and hardware platforms, sensor network applications share a common goal: periodically sample and store data collected from different sensors in a common persistent memory. In this article, we present a clustering approach for rapidly and efficiently computing the best sampling rate which minimizes the Sum of Square Error for each particular sensor in a network. In order to evaluate the efficiency of the proposed approach, we carried out experiments on real electric power consumption data streams provided by EDF (électricité de France).  相似文献   

10.
Clustering divides data into meaningful or useful groups (clusters) without any prior knowledge. It is a key technique in data mining and has become an important issue in many fields. This article presents a new clustering algorithm based on the mechanism analysis of chaotic ant swarm (CAS). It is an optimization methodology for clustering problem which aims to obtain global optimal assignment by minimizing the objective function. The proposed algorithm combines three advantages into one: finding global optimal solution to the objective function, not sensitive to clusters with different size and density and suitable to multi-dimensional data sets. The quality of this approach is evaluated on several well-known benchmark data sets. Compared with the popular clustering method named k-means algorithm and the PSO-based clustering technique, experimental results show that our algorithm is an effective clustering technique and can be used to handle data sets with complex cluster sizes, densities and multiple dimensions.  相似文献   

11.
This paper proposes a multi-step procedure that integrates robust methods, clustering analysis and data envelopment analysis (DEA) to identify bank branch managerial clusters and to study efficiency performance. By applying robust techniques based on principal component analysis, we look for (1) the detection of influential branches, i.e., exhibiting extreme operating behaviors, and (2) the clustering of branches based on operating characteristics. Our premise is that influential branches affect both the clustering and the determination of efficiency performance. The application of the procedure yields various aggregate influential-based branch profiles along with cluster profiles. These aggregate profiles provide valuable insights on the determinants of branch efficiency performance and operating patterns. Using the profiles as contextual information, DEA input-oriented slack-based models are applied to study branch efficiency performance from meta-frontier and cluster-frontier perspectives. Branch performance is characterized in terms of influential-based and cluster profiles, and efficiency designations. This allows for the understanding of how efficiency and peer selection are affected by influential branches, and how the profiles can be used to inform design decisions.  相似文献   

12.
Multimedia Tools and Applications - This paper presents a novel and fast method for k-means clustering based object tracking for coloured frames, based on histogram back-projection method. The...  相似文献   

13.
Zhu  Yun  Wang  Weiye  Yu  Gaohang  Wang  Jun  Tang  Lei 《Multimedia Tools and Applications》2022,81(23):33171-33184

The inevitable problem of missing data is ubiquitous in the real transportation system, which makes the data-driven intelligent transportation system suffer from incorrect response. We propose a Bayesian robust Candecomp/Parafac (CP) tensor decomposition (BRCP) approach to deal with missing data and outliers by integrating the general form of transportation system domain knowledge. Specifically, when the lower rank tensor captures the global information, the sparse tensor is added to capture the local information, which can robustly predict the distribution of missing items and under the fully Bayesian treatment, the effective variational reasoning can prevent the over fitting problem. Real and reliable traffic data sets are used to evaluate the performance of the model in two data missing scenarios, which the experimental results show that the proposed BRCP model achieves the best imputation accuracy and is better than the most advanced baseline (Bayesian Gaussian CP decomposition (BGCP), high accuracy low-rank tensor completion (HaLRTC) and SVD-combined tensor decomposition (STD)), even in the case of high missed detection rate, the model still has the best performance and robustness.

  相似文献   

14.
Based on the molecular kinetic theory, a molecular dynamics-like data clustering approach is proposed in this paper. Clusters are extracted after data points fuse in the iterating space by the dynamical mechanism that is similar to the interacting mechanism between molecules through molecular forces. This approach is to find possible natural clusters without pre-specifying the number of clusters. Compared with 3 other clustering methods (trimmed k-means, JP algorithm and another gravitational model based method), this approach found clusters better than the other 3 methods in the experiments.  相似文献   

15.
A self-organizing map (SOM) is a nonlinear, unsupervised neural network model that could be used for applications of data clustering and visualization. One of the major shortcomings of the SOM algorithm is the difficulty for non-expert users to interpret the information involved in a trained SOM. In this paper, this problem is tackled by introducing an enhanced version of the proposed visualization method which consists of three major steps: (1) calculating single-linkage inter-neuron distance, (2) calculating the number of data points in each neuron, and (3) finding cluster boundary. The experimental results show that the proposed approach has the strong ability to demonstrate the data distribution, inter-neuron distances, and cluster boundary, effectively. The experimental results indicate that the effects of visualization of the proposed algorithm are better than that of other visualization methods. Furthermore, our proposed visualization scheme is not only intuitively easy understanding of the clustering results, but also having good visualization effects on unlabeled data sets.  相似文献   

16.
Kernel-based clustering is one of the most popular methods for partitioning nonlinearly separable datasets. However, exhaustive search for the global optimum is NP-hard. Iterative procedure such as k-means can be used to seek one of the local minima. Unfortunately, it is easily trapped into degenerate local minima when the prototypes of clusters are ill-initialized. In this paper, we restate the optimization problem of kernel-based clustering in an online learning framework, whereby a conscience mechanism is easily integrated to tackle the ill-initialization problem and faster convergence rate is achieved. Thus, we propose a novel approach termed conscience online learning (COLL). For each randomly taken data point, our method selects the winning prototype based on the conscience mechanism to bias the ill-initialized prototype to avoid degenerate local minima and efficiently updates the winner by the online learning rule. Therefore, it can more efficiently obtain smaller distortion error than k-means with the same initialization. The rationale of the proposed COLL method is experimentally analyzed. Then, we apply the COLL method to the applications of digit clustering and video clustering. The experimental results demonstrate the significant improvement over existing kernel-based clustering methods.  相似文献   

17.
Over the last couple of years there has been a substantial increase of malicious attacks that are using the Internet as an infection vector. One solution to counter this problem is to implement a filter at the network connection level. Due to the large amount of data that has to be filtered in real-time, any practical approach has to consider both memory usage and performance limitations in order to deliver a fast response time. This paper presents a cloud-based mechanism that can be used to filter large amounts of network traffic with respect to both memory and response time limitations. The algorithms have been tested on data flows of more than 750 million of URLs/day. We will address different practical problems, such as storage, computation time and large data flow clustering. In the end we will also present different statistical results that we obtained over a period of 2 months.  相似文献   

18.
A robust information clustering algorithm   总被引:1,自引:0,他引:1  
Song Q 《Neural computation》2005,17(12):2672-2698
We focus on the scenario of robust information clustering (RIC) based on the minimax optimization of mutual information (MI). The minimization of MI leads to the standard mass-constrained deterministic annealing clustering, which is an empirical risk-minimization algorithm. The maximization of MI works out an upper bound of the empirical risk via the identification of outliers (noisy data points). Furthermore, we estimate the real risk VC-bound and determine an optimal cluster number of the RIC based on the structural risk-minimization principle. One of the main advantages of the minimax optimization of MI is that it is a nonparametric approach, which identifies the outliers through the robust density estimate and forms a simple data clustering algorithm based on the square error of the Euclidean distance.  相似文献   

19.
A similarity-based robust clustering method   总被引:6,自引:0,他引:6  
This paper presents an alternating optimization clustering procedure called a similarity-based clustering method (SCM). It is an effective and robust approach to clustering on the basis of a total similarity objective function related to the approximate density shape estimation. We show that the data points in SCM can self-organize local optimal cluster number and volumes without using cluster validity functions or a variance-covariance matrix. The proposed clustering method is also robust to noise and outliers based on the influence function and gross error sensitivity analysis. Therefore, SCM exhibits three robust clustering characteristics: 1) robust to the initialization (cluster number and initial guesses), 2) robust to cluster volumes (ability to detect different volumes of clusters), and 3) robust to noise and outliers. Several numerical data sets and actual data are used in the SCM to show these good aspects. The computational complexity of SCM is also analyzed. Some experimental results of comparing the proposed SCM with the existing methods show the superiority of the SCM method.  相似文献   

20.
已有的聚类算法大多仅考虑单一的目标,导致对某些形状的数据集性能较弱,对此提出一种基于改进粒子群优化的无标记数据鲁棒聚类算法。优化阶段:首先,采用多目标粒子群优化的经典形式生成聚类解集合;然后,使用K-means算法生成随机分布的初始化种群,并为其分配随机初始化的速度;最终,采用MaxiMin策略确定帕累托最优解。决策阶段:测量帕累托解集与理想解的距离,将距离最短的帕累托解作为最终聚类解。对比实验结果表明,本算法对不同形状的数据集均可获得较优的类簇数量,对目标问题的复杂度具有较好的鲁棒性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号