首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
EDM: A general framework for Data Mining based on Evidence Theory   总被引:16,自引:0,他引:16  
Data Mining or Knowledge Discovery in Databases [1, 15, 23] is currently one of the most exciting and challenging areas where database techniques are coupled with techniques from Artificial Intelligence and mathematical sub-disciplines to great potential advantage. It has been defined as the non-trivial extraction of implicit, previously unknown and potentially useful information from data. A lot of research effort is being directed towards building tools for discovering interesting patterns which are hidden below the surface in databases. However, most of the work being done in this field has been problem-specific and no general framework has yet been proposed for Data Mining. In this paper we seek to remedy this by proposing, EDM — Evidence-based Data Mining — a general framework for Data Mining based on Evidence Theory.

Having a general framework for Data Mining offers a number of advantages. It provides a common method for representing knowledge which allows prior knowledge from the user or knowledge discoveryd by another discovery process to be incorporated into the discovery process. A common knowledge representation also supports the discovery of meta-knowledge from knowledge discovered by different Data Mining techniques. Furthermore, a general framework can provide facilities that are common to most discovery processes, e.g. incorporating domain knowledge and dealing with missing values.

The framework presented in this paper has the following additional advantages. The framework is inherently parallel. Thus, algorithms developed within this framework will also be parallel and will therefore be expected to be efficient for large data sets — a necessity as most commercial data sets, relational or otherwise, are very large. This is compounded by the fact that the algorithms are complex. Also, the parallelism within the framework allows its use in parallel, distributed and heterogeneous databases. The framework is easily updated and new discovery methods can be readily incorporated within the framework, making it ‘general’ in the functional sense in addition to the representational sense considered above. The framework provides an intuitive way of dealing with missing data during the discovery process using the concept of Ignorance borrowed from Evidence Theory.

The framework consists of a method for representing data and knowledge, and methods for data manipulation or knowledge discovery. We suggest an extension of the conventional definition of mass functions in Evidence Theory for use in Data Mining, as a means to represent evidence of the existence of rules in the database. The discovery process within EDM consists of a series of operations on the mass functions. Each operation is carried out by an EDM operator. We provide a classification for the EDM operators based on the discovery functions performed by them and discuss aspects of the induction, domain and combination operator classes.

The application of EDM to two separate Data Mining tasks is also addressed, highlighting the advantages of using a general framework for Data Mining in general and, in particular, using one that is based on Evidence Theory.  相似文献   


2.
Data mining has proven to be very useful in order to extract information from data in many different contexts. However, due to the complexity of data mining techniques, it is required the know-how of an expert in this field to select and use them. Actually, adequately applying data mining is out of the reach of novice users which have expertise in their area of work, but lack skills to employ these techniques. In this paper, we use both model-driven engineering and scientific workflow standards and tools in order to develop named S3Mining framework, which supports novice users in the process of selecting the data mining classification algorithm that better fits with their data and goal. To this aim, this selection process uses the past experiences of expert data miners with the application of classification techniques over their own datasets. The contributions of our S3Mining framework are as follows: (i) an approach to create a knowledge base which stores the past experiences of experts users, (ii) a process that provides the expert users with utilities for the construction of classifiers’ recommenders based on the existing knowledge base, (iii) a system that allows novice data miners to use these recommenders for discovering the classifiers that better fit for solving their problem at hand, and (iv) a public implementation of the framework’s workflows. Finally, an experimental evaluation has been conducted to shown the feasibility of our framework.  相似文献   

3.
随着计算机软硬件技术、通讯技术以及信息处理技术的飞速发展与广泛应用,现代数据管理技术也在加速发展。本文从当前数据库技术面临的新的问题和主要挑战谈起,接下来从对象-关系数据库、XML及XML在数据管理中的应用、在Web、VC中的应用、语义Web等几个方面各有所侧重的对现代数据管理技术的研究现状和发展趋势进行了评述。  相似文献   

4.
数据挖掘技术在入侵检测系统中的应用   总被引:9,自引:0,他引:9  
温智宇  唐红  吴渝 《计算机工程与应用》2003,39(17):153-156,160
入侵检测系统是近年来出现的网络安全技术。该文首先介绍了入侵检测系统的相关技术和评测指标,然后着重介绍了将数据挖掘技术应用于入侵检测系统,在此基础上设计了一个入侵检测系统结构框图,并提出了一种基于数据挖掘技术的入侵检测系统自适应产生模型,从而说明将数据挖掘技术应用于入侵检测是有效的。  相似文献   

5.
CCDM 2014数据挖掘竞赛基于医学诊断数据,提出了实际生活中广泛出现的多类标问题和多类分类问题。针对两个问题出现的类别不平衡现象以及训练样本较少等特点,为了更好地完成数据挖掘任务,借助二次学习和集成学习的思想,提出了一个新的学习框架--二次集成学习。该学习框架通过首次集成学习得到若干置信度较高的样本,将其加入到原始训练集,并在新的训练集上进行二次学习,进而得到泛化性能更高的分类器。竞赛结果表明,与常用的集成学习相比,二次集成学习在两个问题上均取得了非常理想的结果。  相似文献   

6.
This paper develops tests and validates a model for the antecedents of open source software (OSS) defects, using Data and Text Mining. The public archives of OSS projects are used to access historical data on over 5,000 active and mature OSS projects. Using domain knowledge and exploratory analysis, a wide range of variables is identified from the process, product, resource, and end-user characteristics of a project to ensure that the model is robust and considers all aspects of the system. Multiple Data Mining techniques are used to refine the model and data is enriched by the use of Text Mining for knowledge discovery from qualitative information. The study demonstrates the suitability of Data Mining and Text Mining for model building. Results indicate that project type, end-user activity, process quality, team size and project popularity have a significant impact on the defect density of operational OSS projects. Since many organizations, both for profit and not for profit, are beginning to use Open Source Software as an economic alternative to commercial software, these results can be used in the process of deciding what software can be reasonably maintained by an organization.  相似文献   

7.
Visual data mining in large geospatial point sets   总被引:2,自引:0,他引:2  
Visual data-mining techniques have proven valuable in exploratory data analysis, and they have strong potential in the exploration of large databases. Detecting interesting local patterns in large data sets is a key research challenge. Particularly challenging today is finding and deploying efficient and scalable visualization strategies for exploring large geospatial data sets. One way is to share ideas from the statistics and machine-learning disciplines with ideas and methods from the information and geo-visualization disciplines. PixelMaps in the Waldo system demonstrates how data mining can be successfully integrated with interactive visualization. The increasing scale and complexity of data analysis problems require tighter integration of interactive geospatial data visualization with statistical data-mining algorithms.  相似文献   

8.
Classification with imbalanced data-sets has become one of the most challenging problems in Data Mining. Being one class much more represented than the other produces undesirable effects in both the learning and classification processes, mainly regarding the minority class. Such a problem needs accurate tools to be undertaken; lately, ensembles of classifiers have emerged as a possible solution. Among ensemble proposals, the combination of Bagging and Boosting with preprocessing techniques has proved its ability to enhance the classification of the minority class.In this paper, we develop a new ensemble construction algorithm (EUSBoost) based on RUSBoost, one of the simplest and most accurate ensemble, which combines random undersampling with Boosting algorithm. Our methodology aims to improve the existing proposals enhancing the performance of the base classifiers by the usage of the evolutionary undersampling approach. Besides, we promote diversity favoring the usage of different subsets of majority class instances to train each base classifier. Centered on two-class highly imbalanced problems, we will prove, supported by the proper statistical analysis, that EUSBoost is able to outperform the state-of-the-art methods based on ensembles. We will also analyze its advantages using kappa-error diagrams, which we adapt to the imbalanced scenario.  相似文献   

9.
数据挖掘技术在生物信息学中的应用探索   总被引:1,自引:0,他引:1  
生物信息的分析已成为计算机研究人员的最重要的课题之一。作为其中关键的分析技术,数据挖掘技术在生物信息学领域具有良好的研究与应用前景。生物信息学中的数据挖掘研究仍然处于起步阶段,有很多问题需要解决。本文结合数据挖掘技术与生物信息学研究背景对在生物信息学中的数据挖掘技术的应用状况进行研究。  相似文献   

10.
数据挖掘技术作为一个新兴的技术在许多领域都有成功的应用,本文从数据挖掘的定义、数据挖掘技术的分类以及数据挖掘技术的发展和发掘工具三个方面对数据挖掘技术做了概述,并对数据挖掘技术在过程监控中的应用进行了探讨。  相似文献   

11.
生物信息的分析已成为计算机研究人员的最重要的课题之一。作为其中关键的分析技术,数据挖掘技术在生物信息学领域具有良好的研究与应用前景.生物信息学中的数据挖掘研究仍然处于起步阶段,有很多问题需要解决。本文结合数据挖掘技术与生物信息学研究背景对在生物信息学中的数据挖掘技术的应用状况进行研究。  相似文献   

12.
OLAM-OLAP与DM相结合的新体系结构   总被引:1,自引:0,他引:1  
OLAM(数据联机分析挖掘)是OLAP(联机分析处理)与DM(数据挖掘)相结合而形成的一个新的体系结构。从应用的视角阐述了OLAP与DM的区别与联系,对当前OLAM领域的研究热点以及应用、存在的问题进行了详细的概括与阐述。  相似文献   

13.
The visual senses for humans have a unique status, offering a very broadband channel for information flow. Visual approaches to analysis and mining attempt to take advantage of our abilities to perceive pattern and structure in visual form and to make sense of, or interpret, what we see. Visual Data Mining techniques have proven to be of high value in exploratory data analysis and they also have a high potential for mining large databases. In this work, we try to investigate and expand the area of visual data mining by proposing new visual data mining techniques for the visualization of mining outcomes.  相似文献   

14.
王荣  陈纯 《计算机应用与软件》2007,24(11):98-99,113
数据挖掘是从海量数据中提取隐含在其中的、针对某些用户的信息的高级处理过程.属性选择是数据挖掘领域非常重要的一个研究方向,属性选择的好坏对挖掘的性能和结果有着很大的影响.提出了一种新的属性选择算法,即基于信息增益和卡方检验的属性选择算法,并在离网预测模型中得到了应用,取得了相当不错的效果.  相似文献   

15.
This paper surveys a selection of personal research projects which addressed problems related to Software Engineering, and whose solution was inspired by ideas from the field of Knowledge Representation and Reasoning. Surprisingly often, the research was also related to problems in Databases. We discuss, in part, to what extent did the KR ideas provide ready-made solutions to SE and DB problems, and how frequently we had to invent new KR techniques.  相似文献   

16.
Any crossover operator has both beneficial and detrimental effects: it can bring building blocks together or it can tear them apart. In this paper, we provide evidence that the recombination can be biased towards its more beneficial aspects by modifying both the parent selection process and the number of children created by each pair of parents. We exclude both high rank and low rank individuals from being selected as parents. The new idea is that the worst individuals do not have valuable building blocks to contribute, and it is too risky to subject the best individuals to crossover and have their building blocks separated. In a further refinement, we allow the number of children per family to be correlated to the diversity of their parents, and thus increase the pressure of sibling rivalry (competition). These ideas are tested on well-known test functions such as the hierarchical if-and-only-if, royal road, concatenated trap functions and the one dimensional Ising spin model. Four different parent selection schemes are compared and simulations are shown for both two children (fixed) and many children (variable) families. The results indicate that these changes are beneficial for a wide class of problems.  相似文献   

17.
This paper presents a review of the use of intelligent data analysis techniques in Hydrocarbon Exploration. The term “intelligent” is used in its broadest sense. The process of hydrocarbon exploration exploits data which have been collected from different sources. Different dimensions of data are analyzed by using Statistical Analysis, Data Mining, Artificial Neural Networks and Artificial Intelligence. This review is meant not only to describe the evolution of intelligent data analysis techniques used in different phases of hydrocarbon exploration but also signifying the growing use of Data Mining in various application domains; we avoided a general review of Data Mining and other intelligent data analysis techniques in this paper. The volume of general literature might affect the precision of our view regarding the application of these techniques in hydrocarbon exploration. The review reveals the suitability of existing techniques to data collected from diverse sources in addition to the use of analytical techniques for the process of hydrocarbon exploration.  相似文献   

18.
Data mining techniques can be used for discovering interesting patterns in complicated manufacturing processes. These patterns are used to improve manufacturing quality. Classical representations of quality data mining problems usually refer to the operations settings and not to their sequence. This paper examines the effect of the operation sequence on the quality of the product using data mining techniques. For this purpose a novel decision tree framework for extracting sequence patterns is developed. The proposed method is capable to mine sequence patterns of any length with operations that are not necessarily immediate precedents. The core induction algorithmic framework consists of four main steps. In the first step, all manufacturing sequences are represented as string of tokens. In the second step a large set of regular expression-based patterns are induced by employing a sequence patterns. In the third step we use feature selection methods to filter out the initial set, and leave only the most useful patterns. In the last stage, we transform the quality problem into a classification problem and employ a decision tree induction algorithm. A comparative study performed on benchmark databases illustrates the capabilities of the proposed framework.  相似文献   

19.
随着信息技术的发展,积累了越来越多的数据。数据挖掘技术为人类处理这些海量数据提供了有力武器。首先介绍了数据挖掘技术的概念,然后对数据挖掘系统的构成和数据挖掘的流程进行了分析,最后详细分析了数据挖掘的常用方法。  相似文献   

20.
从知识图谱到数据中台: 华谱系统   总被引:1,自引:0,他引:1  
针对碎片化的各姓氏家谱数据, 华谱系统通过构建家谱知识图谱的数据中台, 能够解决数据孤岛、烟囱式开发等问题. “数据中台”是一个源自国内的新近技术概念, 在华谱系统建设中, 我们通过家谱知识图谱的构建和应用, 对这个概念进行了正式定义. 基于这个定义和对应的7项核心功能, 本文提出一种用于家谱数据分析的数据中台建设架构Huapu-CP (华谱系统), 并通过该架构详细介绍面向家谱领域的数据中台核心技术, 分析数据中台构建的关键问题.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号