首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Knowledge》2007,20(5):485-494
Any attribute set in an information system may be evolving in time when new information arrives. Approximations of a concept by rough set theory need updating for data mining or other related tasks. For incremental updating approximations of a concept, methods using the tolerance relation and similarity relation have been previously studied in literature. The characteristic relation-based rough sets approach provides more informative results than the tolerance-and-similarity relation based approach. In this paper, an attribute generalization and its relation to feature selection and feature extraction are firstly discussed. Then, a new approach for incrementally updating approximations of a concept is presented under the characteristic relation-based rough sets. Finally, the approach of direct computation of rough set approximations and the proposed approach of dynamic maintenance of rough set approximations are employed for performance comparison. An extensive experimental evaluation on a large soybean database from MLC shows that the proposed approach effectively handles a dynamic attribute generalization in data mining.  相似文献   

2.
基于粗糙集的医疗数据挖掘研究与应用   总被引:1,自引:0,他引:1       下载免费PDF全文
医疗数据挖掘能够对现有病历数据库中数据进行自动分析并且提供有价值的医学知识。针对临床病历数据库中存在大量重复样本和冗余属性,从而影响医疗诊断的精度和速度这一问题,建立了基于信息论的粗糙集理论模型和SQL语言之间的关系,提出了基于SQL语言的条件信息熵属性约简算法,利用数据库查询语言实现了数据清洗、求核和属性约简等过程。实验结果表明该算法实现简单,运行效率高,为粗糙集理论更广泛地应用于具体的医疗数据挖掘提供了一种方法。  相似文献   

3.
Pawlak经典粗糙集理论主要针对离散值,对复杂现实世界中的区间值却不能有效支持。在区间值信息系统中,基于灰格运算和Hausdorff距离提出了一种区间值的邻域关系。在该邻域关系基础之上,依次提出了基于邻域关系、最大相容类和邻域系统三种灰色粗集模型,提高了近似空间的精确度;同时讨论了三种灰色粗集模型之间的上、下近似空间,并用实例进行分析及验证。  相似文献   

4.
Many real data sets in databases may vary dynamically. With such data sets, one has to run a knowledge acquisition algorithm repeatedly in order to acquire new knowledge. This is a very time-consuming process. To overcome this deficiency, several approaches have been developed to deal with dynamic databases. They mainly address knowledge updating from three aspects: the expansion of data, the increasing number of attributes and the variation of data values. This paper focuses on attribute reduction for data sets with dynamically varying data values. Information entropy is a common measure of uncertainty and has been widely used to construct attribute reduction algorithms. Based on three representative entropies, this paper develops an attribute reduction algorithm for data sets with dynamically varying data values. When a part of data in a given data set is replaced by some new data, compared with the classic reduction algorithms based on the three entropies, the developed algorithm can find a new reduct in a much shorter time. Experiments on six data sets downloaded from UCI show that the algorithm is effective and efficient.  相似文献   

5.
介绍应用粗集理论和遗传算法相结合进行数据挖掘的方法.利用目前企业采集到的关键设备运行状态的大量数据,首先运用粗集理论的属性约简消去冗余的属性,然后以约简后的数据作为样本训练集,应用优化改进的遗传算法建立分类模型.根据构建的分类模型,可以发现故障设备运行的内在规律,快速对未知故障设备进行归类,从而为故障诊断与故障预测提供决策依据.  相似文献   

6.
一种基于粗糙集理论的最简决策规则挖掘算法   总被引:1,自引:2,他引:1       下载免费PDF全文
钱进  孟祥萍  刘大有  叶飞跃 《控制与决策》2007,22(12):1368-1372
研究粗糙集理论中可辨识矩阵,扩展了类别特征矩阵,提出一种基于粗糙集理论的最筒决策规则算法.该算法根据决策属性将原始决策表分成若干个等价子决策表.借助核属性和属性频率函数对各类别特征矩阵挖掘出最简决策规则.与可辨识矩阵相比,采用类别特征矩阵可有效减少存储空间和时间复杂度。增强规则的泛化能力.实验结果表明,采用所提出的算法获得的规则更为简洁和高效.  相似文献   

7.
In this article, we discuss the need to evaluate the performance of data mining procedures and argue that tests done with real data sets cannot provide all the information needed for a thorough assessment of their performance characteristics. We argue that artificial data sets are therefore essential. After a discussion of the desirable characteristics of such artificial data, we describe two pseudo-random generators. The first is based on the multi-variate normal distribution and gives the investigator full control of the degree of correlation between the variables in the artificial data sets. The second is inspired by fractal techniques for synthesizing artificial landscapes and can produce data whose classification complexity can be controlled by a single parameter. We conclude with a discussion of the additional work necessary to achieve the ultimate goal of a method of matching data sets to the most appropriate data mining technique.  相似文献   

8.
Self-organising maps (SOM) have become a commonly-used cluster analysis technique in data mining. However, SOM are not able to process incomplete data. To build more capability of data mining for SOM, this study proposes an SOM-based fuzzy map model for data mining with incomplete data sets. Using this model, incomplete data are translated into fuzzy data, and are used to generate fuzzy observations. These fuzzy observations, along with observations without missing values, are then used to train the SOM to generate fuzzy maps. Compared with the standard SOM approach, fuzzy maps generated by the proposed method can provide more information for knowledge discovery.  相似文献   

9.
基于动态粗集理论,提出了一种改进的动态粗集K-均值算法。改进后的算法依据数据对象的迁移系数大小,被划分到某一类的膨胀上近似集或膨胀下近似集;在计算类的质心时采用其中数据对象集的迁移系数的平均值作为权值来衡量它对质心的贡献。在UCI机器学习数据库原始数据集及其噪音数据集上的实验结果表明,改进后的动态粗集K-均值算法提高了聚类的准确性,降低了迭代次数。  相似文献   

10.
经典的粗糙集理论刻画目标概念运用静态的粒度分析,不便于刻画人们问题求解的动态认知过程。已有文献分别用正向近似和逆向近似对目标概念和目标决策进行刻画,并成功地应用于分层聚类算法和规则提取方面。基于动态粒度原理,提出双向近似的概念,获得双向近似的一些重要性质,并将其应用于决策表中决策规则的获取。  相似文献   

11.
This paper describes problems, challenges, and opportunities forintelligent simulation of physical systems. Prototype intelligent simulation tools have been constructed for interpreting massive data sets from physical fields and for designing engineering systems. We identify the characteristics of intelligent simulation and describe several concrete application examples. These applications, which include weather data interpretation, distributed control optimization, and spatio-temporal diffusion-reaction pattern analysis, demonstrate that intelligent simulation tools are indispensable for the rapid prototyping of application programs in many challenging scientific and engineering domains.  相似文献   

12.
Most data-mining algorithms assume static behavior of the incoming data. In the real world, the situation is different and most continuously collected data streams are generated by dynamic processes, which may change over time, in some cases even drastically. The change in the underlying concept, also known as concept drift, causes the data-mining model generated from past examples to become less accurate and relevant for classifying the current data. Most online learning algorithms deal with concept drift by generating a new model every time a concept drift is detected. On one hand, this solution ensures accurate and relevant models at all times, thus implying an increase in the classification accuracy. On the other hand, this approach suffers from a major drawback, which is the high computational cost of generating new models. The problem is getting worse when a concept drift is detected more frequently and, hence, a compromise in terms of computational effort and accuracy is needed. This work describes a series of incremental algorithms that are shown empirically to produce more accurate classification models than the batch algorithms in the presence of a concept drift while being computationally cheaper than existing incremental methods. The proposed incremental algorithms are based on an advanced decision-tree learning methodology called “Info-Fuzzy Network” (IFN), which is capable to induce compact and accurate classification models. The algorithms are evaluated on real-world streams of traffic and intrusion-detection data.  相似文献   

13.
Outlier mining in large high-dimensional data sets   总被引:17,自引:0,他引:17  
A new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers that remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.  相似文献   

14.
动态数据挖掘研究   总被引:1,自引:0,他引:1  
提出了一种新的数据挖掘形式--动态数据挖掘(DDM),寻求在不断更替产生的动态数据信息中找出能被应用的知识.给出动态数据挖掘的体系结构,并分析了动态数据挖掘实现过程,运用滑动窗口与动态数据窗口动态采集与处理动态新增数据,同时运用后续数据进行挖掘结果评价,用K标号法平滑地使用动态目标数据集进行数据挖掘,得出了一个动态数据挖掘测试算法.  相似文献   

15.
Fuzzy set theory, soft set theory and rough set theory are mathematical tools for dealing with uncertainties and are closely related. Feng et al. introduced the notions of rough soft set, soft rough set and soft rough fuzzy set by combining fuzzy set, rough set and soft set all together. This paper is devoted to the further discussion of the combinations of fuzzy set, rough set and soft set. A new soft rough set model is proposed and its properties are derived. Furthermore, fuzzy soft set is employed to granulate the universe of discourse and a more general model called soft fuzzy rough set is established. The lower and upper approximation operators are presented and their related properties are surveyed.  相似文献   

16.
P-AutoClass: scalable parallel clustering for mining large data sets   总被引:3,自引:0,他引:3  
Data clustering is an important task in the area of data mining. Clustering is the unsupervised classification of data items into homogeneous groups called clusters. Clustering methods partition a set of data items into clusters, such that items in the same cluster are more similar to each other than items in different clusters according to some defined criteria. Clustering algorithms are computationally intensive, particularly when they are used to analyze large amounts of data. A possible approach to reduce the processing time is based on the implementation of clustering algorithms on scalable parallel computers. This paper describes the design and implementation of P-AutoClass, a parallel version of the AutoClass system based upon the Bayesian model for determining optimal classes in large data sets. The P-AutoClass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. The system architecture, its implementation, and experimental performance results on different processor numbers and data sets are presented and compared with theoretical performance. In particular, experimental and predicted scalability and efficiency of P-AutoClass versus the sequential AutoClass system are evaluated and compared.  相似文献   

17.
介绍了一种基于统计分析的数据离散化方法——谱系聚类法,以胶合板缺陷检测数据为应用对象进行了基于谱系聚类的数据离散化研究,并与其它离散化方法进行了对比分析,对比结果表明经谱系聚类方法离散化后的数据,再进行粗糙集约简时,会有更多的冗余属性和记录被约掉,从而可以降低模型的复杂程度,加快获取知识的进程,提高分类的准确率。工程实践证明谱系聚类是一种有效的可用于数据预处理的离散化方法,结合粗糙集算法可以获取满意的数据挖掘结果。  相似文献   

18.
19.
Visual data mining in large geospatial point sets   总被引:2,自引:0,他引:2  
Visual data-mining techniques have proven valuable in exploratory data analysis, and they have strong potential in the exploration of large databases. Detecting interesting local patterns in large data sets is a key research challenge. Particularly challenging today is finding and deploying efficient and scalable visualization strategies for exploring large geospatial data sets. One way is to share ideas from the statistics and machine-learning disciplines with ideas and methods from the information and geo-visualization disciplines. PixelMaps in the Waldo system demonstrates how data mining can be successfully integrated with interactive visualization. The increasing scale and complexity of data analysis problems require tighter integration of interactive geospatial data visualization with statistical data-mining algorithms.  相似文献   

20.
A new method for mining regression classes in large data sets   总被引:9,自引:0,他引:9  
Extracting patterns and models of interest from large databases is attracting much attention in a variety of disciplines. Knowledge discovery in databases (KDD) and data mining (DM) are areas of common interest to researchers in machine learning, pattern recognition, statistics, artificial intelligence, and high performance computing. An effective and robust method, the regression class mixture decomposition (RCMD) method, is proposed for the mining of regression classes in large data sets, especially those contaminated by noise. A concept, called “regression class” which is defined as a subset of the data set that is subject to a regression model, is proposed as a basic building block on which the mining process is based. A large data set is treated as a mixture population in which there are many such regression classes and others not accounted for by the regression models. Iterative and genetic-based algorithms for the optimization of the objective function in the RCMD method are also constructed. It is demonstrated that the RCMD method can resist a very large proportion of noisy data, identify each regression class, assign an inlier set of data points supporting each identified regression class, and determine the a priori unknown number of statistically valid models in the data set. Although the models are extracted sequentially, the final result is almost independent of the extraction order due to a dynamic classification strategy employed in the handling of overlapping regression classes. The effectiveness and robustness of the RCMD method are substantiated by a set of simulation experiments and a real-life application showing the way it can be used to fit mixed data to linear regression classes and nonlinear structures in various situations  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号