首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Data mining, through its capacity to discover knowledge embedded in large databases to improve organizational decision-making, has the potential to contribute to efficiencies and cost savings in the increasingly costly healthcare industry. One important aspect of the methods of mining medical databases includes reducing dimensionality through feature selection. Traditionally feature selection is accomplished through stepwise regression, which tends to produce an unnecessarily high number of "significant" variables. This paper applies a filter-based feature selection method using inconsistency rate measure and discretization, to a medical claims database to predict the adequacy of duration of antidepressant medication utilization. Compared to traditional stepwise logistic regression, which selected seven variables from a total of nine potential explanatory variables to characterize patients with inadequate antidepressant medication utilization, the filter-based method selected two variables (age and number of claims) to achieve a similar prediction accuracy. This comparison suggests it may be feasible and efficient to apply the filter-based feature selection method to reduce the dimensionality of healthcare databases.  相似文献   

2.
《Information Systems》2005,30(1):71-88
Many large organizations have multiple databases distributed in different branches, and therefore multi-database mining is an important task for data mining. To reduce the search cost in the data from all databases, we need to identify which databases are most likely relevant to a data mining application. This is referred to as database selection. For real-world applications, database selection has to be carried out multiple times to identify relevant databases that meet different applications. In particular, a mining task may be without reference to any specific application. In this paper, we present an efficient approach for classifying multiple databases based on their similarity between each other. Our approach is application-independent.  相似文献   

3.
Web使用挖掘研究及实现   总被引:4,自引:2,他引:4  
Web使用挖掘并不是简单地把数据挖掘算法应用在Web日志上,由于WWW体系结构的特殊性(包括Web站点上物理路径和逻辑路径的不一致),必须采用一种新的框架来处理挖掘过程。整个挖掘过程可以分为两大部分:ECLF日志预处理和在预处理后的数据集上进行挖掘。文中从应用的角度出发,在分析了这两个过程的具体流程后,给出了一个完整的Web使用模式挖掘解决方案和从Web日志中挖掘关联规则的系统原型。  相似文献   

4.
Mining of periodic patterns in time-series databases is an interesting data mining problem. It can be envisioned as a tool for forecasting and prediction of the future behavior of time-series data. Incremental mining refers to the issue of maintaining the discovered patterns over time in the presence of more items being added into the database. Because of the mostly append only nature of updating time-series data, incremental mining would be very effective and efficient. Several algorithms for incremental mining of partial periodic patterns in time-series databases are proposed and are analyzed empirically. The new algorithms allow for online adaptation of the thresholds in order to produce interactive mining of partial periodic patterns. The storage overhead of the incremental online mining algorithms is analyzed. Results show that the storage overhead for storing the intermediate data structures pays off as the incremental online mining of partial periodic patterns proves to be significantly more efficient than the nonincremental nononline versions. Moreover, a new problem, termed merge mining, is introduced as a generalization of incremental mining. Merge mining can be defined as merging the discovered patterns of two or more databases that are mined independently of each other. An algorithm for merge mining of partial periodic patterns in time-series databases is proposed and analyzed.  相似文献   

5.
增量式K-Medoids聚类算法   总被引:3,自引:0,他引:3  
高小梅  冯志  冯兴杰 《计算机工程》2005,31(Z1):181-183
聚类是一种非常有用的数据挖掘方法,可用于发现隐藏在数据背后的分组和数据分布信息。目前已经提出了许多聚类算法及其变种,但在增量式聚类算法研究方面所做的工作较少。当数据集因更新而发生变化时,数据挖掘的结果也应该进行相应的更新。由于数据量大,在更新后的数据集上重新执行聚类算法以更新挖掘结果显然比较低效,因此亟待研究增量式聚类算法。该文通过对K-Medoids聚类算法的改进,提出一种增量式K-Medoids聚类算法。它能够很好地解决传统聚类算法在伸缩性、数据定期更新时所面临的问题。  相似文献   

6.
数据挖掘语言标准化的研究是开发新一代数据挖掘系统的关键。DMX(Data Mining Extensions,数据挖掘扩展)是OLE DBFor DM规范支持的数据挖掘查询语言,支持数据挖掘系统直接对关系数据库进行挖掘,是数据挖掘原语标准化发展中的一个突破。该文介绍了OLE DB For DM规范下数据挖掘的主要步骤,给出了Microsoft SQL Server Analysis Services中基于DMX的实现方法。  相似文献   

7.
关联规则挖掘是数据挖掘研究的重要分支。发现频繁项目序列集又是关联规则挖掘中的一个关键阶段。十几年来,许多发现频繁项目集的算法已经被提出。近几年来,人们更关注于在大型数据集中高效发现频繁项目集的算法研究,特别是在减少数据库的扫描次数、提高内存利用率等方面。该文提出一个称为DFISP的算法,它是基于数据分段扫描策略的,并且只需两次数据库扫描即可完成频繁项目序列集的生成。实验表明,DFISP算法是稳定而高效的。  相似文献   

8.
分布式环境下约束性关联规则的快速挖掘   总被引:2,自引:0,他引:2  
研究人员针对单机环境提出了约束性关联规则的挖掘算法,但它们不适用于分布式环境.为此本文讨论分布式环境下约束性关联规则的快速挖掘技术,提出一种基于分布式环境的约束性关联规则快速挖掘算法DCAR,其中包括局部约束性频繁项目集挖掘算法MLFC和全局约束性频繁项目集挖掘算法MGFC.该算法根据布尔约束条件产生向导集,采用一种新的候选项集生成函数Reorder-gen,该函数通过向导集高效地产生分布式环境中满足约束条件的、数量较少且完备的候选项集,并且求解全局约束性频繁项集过程中,传送局部候选项集支持数的通信量为O(n),从而提高了算法的挖掘效率.将本文提出的算法加以实现,实验结果表明DCAR算法高效可行,其效率大约是DMA-IC算法的2-3倍.  相似文献   

9.
传统的操作型数据库存在数据分散、数据规范不统一、数据可分析能力较差等缺点,所以论文引入数据仓库技术,在分析其系统体系结构的基础上,提出基于数据仓库的智能电力信息系统。该系统利用数据挖掘技术和专家系统的推理模式,设计提出了电力生产决策支持系统的体系结构,将为电力生产管理的信息化提供强大的决策支持。  相似文献   

10.
快速挖掘全局最大频繁项目集   总被引:18,自引:1,他引:18  
挖掘最大频繁项目集是多种数据挖掘应用中的关键问题.现行可用的最大频繁项目集挖掘算法大多基于单机环境,针对分布式环境下的全局最大频繁项目集挖掘尚不多见.若将基于单机环境的最大频繁项目集挖掘算法运用于分布式环境,或运用分布式环境下的全局频繁项目集挖掘算法来挖掘全局最大频繁项目集,均会产生大量的候选频繁项目集,且网络通信代价高.为此,提出了快速挖掘全局最大频繁项目集算法FMGMFI(fast mining global maximum frequent itemsets),该算法采用FP-tree存储结构,可方便地从各局部FP-tree的相关路径中得到项目集的频度,同时采用自顶向下和自底向上的双向搜索策略,可有效地降低网络通信代价.实验结果表明,FMGMF算法是有效、可行的.  相似文献   

11.
序贯模式是时间相关数据库中存在的一种十分有用的知识模式,其发掘方法的研究有着十分重要的意义,本文给出了一种挖掘数据库中序贯模式的算法,通过认真地研究了挖掘过程中的中间及结果数据的存储结构,从而大大地减少了对数据库的扫描遍数,提高了算法的效率。  相似文献   

12.
Frequent itemset mining allows us to find hidden, important information from large databases. Moreover, processing incremental databases in the itemset mining area has become more essential because a huge amount of data has been accumulated continually in a variety of application fields and users want to obtain mining results from such incremental data in more efficient ways. One of the major problems in incremental itemset mining is that the corresponding mining results can be very large-scale according to threshold settings and data volumes. In addition, it is considerably hard to analyze all of them and find meaningful information. Furthermore, not all of the mining results become actually important information. In this paper, to solve these problems, we propose an algorithm for mining weighted maximal frequent itemsets from incremental databases. By scanning a given incremental database only once, the proposed algorithm can not only conduct its mining operations suitable for the incremental environment but also extract a smaller number of important itemsets compared to previous approaches. The proposed method also has an effect on expert and intelligent systems since it can automatically provide more meaningful pattern results reflecting characteristics of given incremental databases and threshold settings, which can help users analyze the given data more easily. Our comprehensive experimental results show that the proposed algorithm is more efficient and scalable than previous state-of-the-art algorithms.  相似文献   

13.
Database mining must capture the dynamics of data in the real-world. In other words, mining models would be able to maintain all changed rules when a database is updated. This paper presents a new model of aggregating association rules aimed at not only maintaining association rules in dynamical databases, but also aggregating association rules from different data sources. Indeed, this aggregation of association rules is useful for making decisions. Our experiments show that this model is efficient to aggregate rules from both of dynamical databases and different data sources.  相似文献   

14.
Toward multidatabase mining: identifying relevant databases   总被引:2,自引:0,他引:2  
Various tools and systems for knowledge discovery and data mining have been developed and are available for applications. However, when there are many databases, an immediate question is where one should start mining. It is not true that data mining is better the more databases there are. It is only true when the databases involved are relevant to the task at hand. By breaking away from the conventional data mining assumption that many databases should be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most relevant to an application; without doing so, the mining process can be lengthy, aimless, and ineffective. A measure of relevance is thus proposed for mining tasks with the objective of finding patterns or regularities of certain attributes. An efficient algorithm for identifying relevant databases is described. Experiments are conducted to verify the measure's performance and to exemplify its application  相似文献   

15.
Incorporating constraints into frequent itemset mining not only improves data mining efficiency, but also leads to concise and meaningful results. In this paper, a framework for closed constrained gradient itemset mining in retail databases is proposed by introducing the concept of gradient constraint into closed itemset mining. A tailored version of CLOSET+, LCLOSET, is first briefly introduced, which is designed for efficient closed itemset mining from sparse databases. Then, a newly proposed weaker but antimonotone measure, top-X average measure, is proposed and can be adopted to prune search space effectively. Experiments show that a combination of LCLOSET and the top-X average pruning provides an efficient approach to mining frequent closed gradient itemsets.  相似文献   

16.
This paper presents a simple, efficient computer-based method for discovering causal relationships from databases that contain observational data. Observational data is passively observed, as contrasted with experimental data. Most of the databases available for data mining are observational. There is great potential for mining such databases to discover causal relationships. We illustrate how observational data can constrain the causal relationships among measured variables, sometimes to the point that we can conclude that one variable is causing another variable. The presentation here is based on a constraint-based approach to causal discovery. A primary purpose of this paper is to present the constraint-based causal discovery method in the simplest possible fashion in order to (1) readily convey the basic ideas that underlie more complex constraint-based causal discovery techniques, and (2) permit interested readers to rapidly program and apply the method to their own databases, as a start toward using more elaborate causal discovery algorithms.  相似文献   

17.
Discovering Frequent Closed Partial Orders from Strings   总被引:2,自引:0,他引:2  
Mining knowledge about ordering from sequence data is an important problem with many applications, such as bioinformatics, Web mining, network management, and intrusion detection. For example, if many customers follow a partial order in their purchases of a series of products, the partial order can be used to predict other related customers' future purchases and develop marketing campaigns. Moreover, some biological sequences (e.g., microarray data) can be clustered based on the partial orders shared by the sequences. Given a set of items, a total order of a subset of items can be represented as a string. A string database is a multiset of strings. In this paper, we identify a novel problem of mining frequent closed partial orders from strings. Frequent closed partial orders capture the nonredundant and interesting ordering information from string databases. Importantly, mining frequent closed partial orders can discover meaningful knowledge that cannot be disclosed by previous data mining techniques. However, the problem of mining frequent closed partial orders is challenging. To tackle the problem, we develop Frecpo (for frequent closed partial order), a practically efficient algorithm for mining the complete set of frequent closed partial orders from large string databases. Several interesting pruning techniques are devised to speed up the search. We report an extensive performance study on both real data sets and synthetic data sets to illustrate the effectiveness and the efficiency of our approach  相似文献   

18.
Mining sequential patterns by pattern-growth: the PrefixSpan approach   总被引:12,自引:0,他引:12  
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [R. Agrawal et al. (1994)] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [J. Han et al. (2000)], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [M. Zaki, (2001)] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.  相似文献   

19.
When extracting knowledge (or patterns) from multiple databases, the data from different databases might be too large in volume to be merged into one database for centralized mining on one computer, the local information sources might be hidden from a global decision maker due to privacy concerns, and different local databases may have different contribution to the global pattern. Dealing with multiple databases is essentially different from mining from a single database. In multi-database mining, the global patterns must be obtained by carefully analyzing the local patterns from individual databases. In this paper, we propose a nonlinear method, named KEMGP (kernel estimation for mining global patterns), to tackle this problem, which adopts kernel estimation to synthesizing local patterns for global patterns. We also adopt a method to divide all the data in different databases according to attribute dimensionality, which reduces the total space complexity. We test our algorithm on a customer management system, where the application is to obtain all globally interesting patterns by analyzing the individual databases. The experimental results show that our method is efficient.  相似文献   

20.
谢东  杨路明  蒲保兴  刘波 《计算机工程》2007,33(22):66-67,8
结合概率数据库技术,以元组匹配所产生的聚类为基础,提出了一种新的基于聚类的非一致性数据的概率方法。基于可信聚类,给出了基本的查询重写技术,在有聚集的查询中,考虑了合适的元组概率、区间值、期望值。在不进行程序预处理的情况下,“重写”能被商业数据库系统有效地优化和执行,采用不一致性数据的区分度和数据库大小去理解其适应性,并使用了TPC-H基准的数据和查询。实验显示了该方法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号