首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Almost all subspace clustering algorithms proposed so far are designed for numeric datasets. In this paper, we present a k-means type clustering algorithm that finds clusters in data subspaces in mixed numeric and categorical datasets. In this method, we compute attributes contribution to different clusters. We propose a new cost function for a k-means type algorithm. One of the advantages of this algorithm is its complexity which is linear with respect to the number of the data points. This algorithm is also useful in describing the cluster formation in terms of attributes contribution to different clusters. The algorithm is tested on various synthetic and real datasets to show its effectiveness. The clustering results are explained by using attributes weights in the clusters. The clustering results are also compared with published results.  相似文献   

2.
3.
Clustering is a widely used technique in data mining applications for discovering patterns in underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either numeric or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining applications. In this article, we present two algorithms that extend the Squeezer algorithm to domains with mixed numeric and categorical attributes. The performance of the two algorithms has been studied on real and artificially generated datasets. Comparisons with other clustering algorithms illustrate the superiority of our approaches. © 2005 Wiley Periodicals, Inc. Int J Int Syst 20: 1077–1089, 2005.  相似文献   

4.
Rough Set理论中连续属性的离散化方法   总被引:95,自引:0,他引:95  
苗夺谦 《自动化学报》2001,27(3):296-302
Rough Set(RS)理论是一种新的处理不精确、不完全与不相容知识的数学工具.传 统的RS理论只能对数据库中的离散属性进行处理,而绝大多数现实的数据库既包含了离散 属性,又包含了连续属性.文中针对传统RS理论的这一缺陷,利用决策表相容性的反馈信 息,提出了一种领域独立的基于动态层次聚类的连续属性离散化算法.该方法为RS理论处 理离散与连续属性提供了一种统一的框架,从而极大地拓广了RS理论的应用范围.通过一 些例子将本算法与现有方法进行了比较分析,得到了令人鼓舞的结果.  相似文献   

5.
Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.  相似文献   

6.
Inductive learning systems can be effectively used to acquire classification knowledge from examples. Many existing symbolic learning algorithms can be applied in domains with continuous attributes when integrated with a discretization algorithm to transform the continuous attributes into ordered discrete ones. In this paper, a new information theoretic discretization method optimized for supervised learning is proposed and described. This approach seeks to maximize the mutual dependence as measured by the interdependence redundancy between the discrete intervals and the class labels, and can automatically determine the most preferred number of intervals for an inductive learning application. The method has been tested in a number of inductive learning examples to show that the class-dependent discretizer can significantly improve the classification performance of many existing learning algorithms in domains containing numeric attributes  相似文献   

7.
在介绍了现有数值型属性分裂方法的基础上,引出了纯区间的概念,提出了一种基于纯区间归约的数值型属性分裂方法。该方法将属性值域用等宽直方图的方法划分为多个区间,对纯区间和非纯区间分别处理。理论分析和实验结果表明该方法在保证了分裂精度的同时,减小了搜索空间。  相似文献   

8.
Clustering is one of the most popular techniques in data mining. The goal of clustering is to identify distinct groups in a dataset. Many clustering algorithms have been published so far, but often limited to numeric or categorical data. However, most real world data are mixed, numeric and categorical. In this paper, we propose a clustering algorithm CAVE which is based on variance and entropy, and is capable of mining mixed data. The variance is used to measure the similarity of the numeric part of the data. To express the similarity between categorical values, distance hierarchy has been proposed. Accordingly, the similarity of the categorical part is measured based on entropy weighted by the distances in the hierarchies. A new validity index for evaluating the clustering results has also been proposed. The effectiveness of CAVE is demonstrated by a series of experiments on synthetic and real datasets in comparison with that of several traditional clustering algorithms. An application of mining a mixed dataset for customer segmentation and catalog marketing is also presented.  相似文献   

9.
K-means type clustering algorithms for mixed data that consists of numeric and categorical attributes suffer from cluster center initialization problem. The final clustering results depend upon the initial cluster centers. Random cluster center initialization is a popular initialization technique. However, clustering results are not consistent with different cluster center initializations. K-Harmonic means clustering algorithm tries to overcome this problem for pure numeric data. In this paper, we extend the K-Harmonic means clustering algorithm for mixed datasets. We propose a definition for a cluster center and a distance measure. These cluster centers and the distance measure are used with the cost function of K-Harmonic means clustering algorithm in the proposed algorithm. Experiments were carried out with pure categorical datasets and mixed datasets. Results suggest that the proposed clustering algorithm is quite insensitive to the cluster center initialization problem. Comparative studies with other clustering algorithms show that the proposed algorithm produce better clustering results.  相似文献   

10.
The problem of identifying meaningful patterns in a database lies at the very heart of data mining. A core objective of data mining processes is the recognition of inter-attribute correlations. Not only are correlations necessary for predictions and classifications – since rules would fail in the absence of pattern – but also the identification of groups of mutually correlated attributes expedites the selection of a representative subset of attributes, from which existing mappings allow others to be derived. In this paper, we describe a scalable, effective algorithm to identify groups of correlated attributes. This algorithm can handle non-linear correlations between attributes, and is not restricted to a specific family of mapping functions, such as the set of polynomials. We show the results of our evaluation of the algorithm applied to synthetic and real world datasets, and demonstrate that it is able to spot the correlated attributes. Moreover, the execution time of the proposed technique is linear on the number of elements and of correlations in the dataset.  相似文献   

11.
Extended Naive Bayes classifier for mixed data   总被引:2,自引:0,他引:2  
Naive Bayes induction algorithm is very popular in classification field. Traditional method for dealing with numeric data is to discrete numeric attributes data into symbols. The difference of distinct discredited criteria has significant effect on performance. Moreover, several researches had recently employed the normal distribution to handle numeric data, but using only one value to estimate the population easily leads to the incorrect estimation. Therefore, the research for classification of mixed data using Naive Bayes classifiers is not very successful. In this paper, we propose a classification method, Extended Naive Bayes (ENB), which is capable for handling mixed data. The experimental results have demonstrated the efficiency of our algorithm in comparison with other classification algorithms ex. CART, DT and MLP’s.  相似文献   

12.
在现实世界中经常遇到混合数值属性和分类属性的数据, k-prototypes是聚类该类型数据的主要算法之一。针对现有混合属性聚类算法的不足,提出一种基于分布式质心和新差异测度的改进的 k-prototypes 算法。在新算法中,首先引入分布式质心来表示簇中的分类属性的簇中心,然后结合均值和分布式质心来表示混合属性的簇中心,并提出一种新的差异测度来计算数据对象与簇中心的距离,新差异测度考虑了不同属性在聚类过程中的重要性。在三个真实数据集上的仿真实验表明,与传统的聚类算法相比,本文算法的聚类精度要优于传统的聚类算法,从而验证了本文算法的有效性。  相似文献   

13.
针对现有行人属性识别方法忽视行人属性的互相关性和空间信息导致识别性能较低的问题,将任务视为时空序列多标签图像分类问题,提出基于卷积神经网络(CNN)和卷积长短期记忆网络(ConvLSTM)并融合通道注意力机制的模型。用CNN和通道注意力提取行人属性的显著性和相关性视觉特征;用ConvLSTM进一步提取视觉特征的空间信息和属性相关性;以优化序列对行人属性进行预测。在两个常用行人属性数据集PETA和RAP上进行大量实验,取得了最佳性能,证明了该方法的优越性和有效性。  相似文献   

14.
《Knowledge》2007,20(4):419-425
Many classification algorithms require that training examples contain only discrete values. In order to use these algorithms when some attributes have continuous numeric values, the numeric attributes must be converted into discrete ones. This paper describes a new way of discretizing numeric values using information theory. Our method is context-sensitive in the sense that it takes into account the value of the target attribute. The amount of information each interval gives to the target attribute is measured using Hellinger divergence, and the interval boundaries are decided so that each interval contains as equal amount of information as possible. In order to compare our discretization method with some current discretization methods, several popular classification data sets are selected for discretization. We use naive Bayesian classifier and C4.5 as classification tools to compare the accuracy of our discretization method with that of other methods.  相似文献   

15.
16.
Due to recent advances in data collection and processing, data publishing has emerged by some organizations for scientific and commercial purposes. Published data should be anonymized such that staying useful while the privacy of data respondents is preserved. Microaggregation is a popular mechanism for data anonymization, but naturally operates on numerical datasets. However, the type of data in the real world is usually mixed i.e., there are both numeric and categorical attributes together. In this paper, we propose a novel transformation based method for microaggregation of mixed data called TBM. The method uses multidimensional scaling to generate a numeric equivalent from mixed dataset. The partitioning step of microaggregation is performed on the equivalent dataset but the aggregation step on the original data. TBM can microaggregate large mixed datasets in a short time with low information loss. Experimental results show that the proposed method attains better trade-off between data utility and privacy in a shorter time in comparison with the traditional methods.  相似文献   

17.
18.
现有的混合信息系统知识发现模型涵盖的数据类型大多为符号型、数值型条件属性及符号型决策属性,且大多数模型的关注点是属性约简或特征选择,针对规则提取的研究相对较少。针对涵盖更多数据类型的混合信息系统构建一个动态规则提取模型。首先修正了现有的属性值距离的计算公式,对错层型属性值的距离给出了一种定义形式,从而定义了一个新的混合距离。其次提出了针对数值型决策属性诱导决策类的3种方法。其后构造了广义邻域粗糙集模型,提出了动态粒度下的上下近似及规则提取算法,构建了基于邻域粒化的动态规则提取模型。该模型可用于具有以下特点的信息系统的规则提取: (1)条件属性集可包括单层符号型、错层符号型、数值型、区间型、集值型、未知型等; (2)决策属性集可包括符号型、数值型。利用UCI数据库中的数据集进行了对比实验,分类精度表明了规则提取算法的有效性。  相似文献   

19.
Partitional clustering of categorical data is normally performed by using K-modes clustering algorithm, which works well for large datasets. Even though the design and implementation of K-modes algorithm is simple and efficient, it has the pitfall of randomly choosing the initial cluster centers for invoking every new execution that may lead to non-repeatable clustering results. This paper addresses the randomized center initialization problem of K-modes algorithm by proposing a cluster center initialization algorithm. The proposed algorithm performs multiple clustering of the data based on attribute values in different attributes and yields deterministic modes that are to be used as initial cluster centers. In the paper, we propose a new method for selecting the most relevant attributes, namely Prominent attributes, compare it with another existing method to find Significant attributes for unsupervised learning, and perform multiple clustering of data to find initial cluster centers. The proposed algorithm ensures fixed initial cluster centers and thus repeatable clustering results. The worst-case time complexity of the proposed algorithm is log-linear to the number of data objects. We evaluate the proposed algorithm on several categorical datasets and compared it against random initialization and two other initialization methods, and show that the proposed method performs better in terms of accuracy and time complexity. The initial cluster centers computed by the proposed approach are close to the actual cluster centers of the different data we tested, which leads to faster convergence of K-modes clustering algorithm in conjunction to better clustering results.  相似文献   

20.
In this paper, we propose an efficient rule discovery algorithm, called FD_Mine, for mining functional dependencies from data. By exploiting Armstrong’s Axioms for functional dependencies, we identify equivalences among attributes, which can be used to reduce both the size of the dataset and the number of functional dependencies to be checked. We first describe four effective pruning rules that reduce the size of the search space. In particular, the number of functional dependencies to be checked is reduced by skipping the search for FDs that are logically implied by already discovered FDs. Then, we present the FD_Mine algorithm, which incorporates the four pruning rules into the mining process. We prove the correctness of FD_Mine, that is, we show that the pruning does not lead to the loss of useful information. We report the results of a series of experiments. These experiments show that the proposed algorithm is effective on 15 UCI datasets and synthetic data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号