首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
提出了基于超结构的分布式系统的关联规则挖掘的分布式算法 (HSDM) ,该算法与现有的相关分布式挖掘算法相比 ,具有明显的优点 .该算法不需要产生候选项集 ,只需两次扫描各站点局部数据库 ,挖掘速度快 .该算法还采用自底向上的挖掘方式 ,能够对其超结构进行有效剪枝 ,从而大大减少了各站点之间的数据交换 ,提高了算法的效率  相似文献   

2.
针对Apriori算法的不足,提出基于项数布尔矩阵的改进算法MPIN_Apriori。改进算法运用分治思想将数据集分段处理,使用事务项数进行矩阵压缩并利用向量交运算和先验剪枝直接生成局部频繁k-项集,最终合并为全局频繁k-项集。该算法从根本上改进了Apriori算法频繁迭代的流程,避免了连接运算而且极大减轻了内存负担。实验结果表明在进行大型数据库频繁项集挖掘时其效率明显高于Apriori算法,而且对分布式数据挖掘有参考价值。  相似文献   

3.
分布式环境下约束性关联规则的快速挖掘   总被引:2,自引:0,他引:2  
研究人员针对单机环境提出了约束性关联规则的挖掘算法,但它们不适用于分布式环境.为此本文讨论分布式环境下约束性关联规则的快速挖掘技术,提出一种基于分布式环境的约束性关联规则快速挖掘算法DCAR,其中包括局部约束性频繁项目集挖掘算法MLFC和全局约束性频繁项目集挖掘算法MGFC.该算法根据布尔约束条件产生向导集,采用一种新的候选项集生成函数Reorder-gen,该函数通过向导集高效地产生分布式环境中满足约束条件的、数量较少且完备的候选项集,并且求解全局约束性频繁项集过程中,传送局部候选项集支持数的通信量为O(n),从而提高了算法的挖掘效率.将本文提出的算法加以实现,实验结果表明DCAR算法高效可行,其效率大约是DMA-IC算法的2-3倍.  相似文献   

4.
分布式不确定数据上的概率Skyline计算   总被引:2,自引:1,他引:1       下载免费PDF全文
提出了分布式不确定数据上概率skyline的低通信开销算法。首先给出了一种间接的对象分布信息——剪枝空间,分布节点通过共享全局剪枝空间,能够减少通信开销。为了降低传输剪枝空间带来的额外通信开销,对表示剪枝空间的虚拟对象集合进行基于距离的压缩。与基本算法相比,100个分布节点时,在真实数据集上节省了69%的通信开销;在均匀、正相关、反相关三种标准模拟数据上分别节省60.5%、41.8%、24.5%的通信开销。  相似文献   

5.
提出一种基于FP—tree的最大频繁项目挖掘算法DMFIA—D,该算法运用双向搜索策略。根据FP—tree构造特征自顶向下选取最大频繁候选项集,自底向上对候选项集进行计数、剪枝最终确定最大频繁项目集。由于减少了最大频繁候选集,并对候选集进行有效剪枝,从而缩短算法的挖掘时间,提高挖掘效率。  相似文献   

6.
Tip分解作为图数据管理领域的热点研究问题,已被广泛应用于文档聚类和垃圾邮件组检测等实际场景中.随着图数据规模的爆炸式增长,单机内存已无法满足其存储需求,亟需研究分布式环境下Tip分解技术.现有分布式图计算系统的通信模式无法适用于二部图,为此,首先提出一种基于中继的通信模式,以实现分布式环境下处理二部图时消息的有效传递;其次,提出分布式butterfly计数算法(DBC)和tip分解算法(DTD),特别地,为解决处理大规模二部图时DBC面临的内存溢出问题,提出了一种可控的并行顶点激活策略;最后,引入基于顶点优先级的消息剪枝策略和消息有效性剪枝策略,通过减少冗余通信和计算开销,进一步提高算法效率.实验平台部署于国家超算中心高性能分布式集群上,在多个真实数据集上的实验结果验证了所提算法的有效性和高效性.  相似文献   

7.
为了克服Skyline查询的不足即结果集大小无法控制,提出了Skyline代表点查询,返回k个可描述全局Skyline轮廓的Skyline代表点。研究了分布式环境下的Skyline代表点查询,提出了Naive算法和FDRA。Naive算法首先转移每个子节点上满足条件的两个局部代表点,再通过比较传来的局部代表点间的评价函数值大小决定子节点是否需要传送余下的局部点,以实现剪枝非代表点;与之相比,FDRA的改进在于过滤元组的选择,运用反馈方法,将每次动态更新最大评价函数值的点作为过滤元组,大大降低了计算代价,中心服务器每次只发送过滤元组到分布节点,这样可以尽早且最大限度地剪枝不可能成为代表的Skyline点。提出的算法降低了服务器间的通信开销,返回了正确的结果集,实验论证了算法的有效性与高效性。  相似文献   

8.
在传统剪枝策略中,具有相同事务集的父子结点搜索空间没有充分剪枝,效率较低.为此,提出父子等价的剪枝策略.采用深度优先搜索集合枚举树,对于父子结点中具有相同事务集的搜索空间进行剪枝,有效地缩小搜索空间,减少频繁项计算的次数,给出基于该剪枝策略的最大频繁项集挖掘算法.实验结果表明,该算法可缩短同一支持度下的最大频繁项集挖掘时间.  相似文献   

9.
在传统LIPI数据挖掘算法中,需要反复扫描投影数据库寻找局部频繁项并重复构造大量重复投影,造成数据挖掘耗时,效率低下的不足.为了提高算法的计算速度,提出改进的LIPI数据挖掘算法.算法借助连接2-序列位置信息表(LIPI)找到序列模式的下一项,完成K-1序列位置信息与2-序列位置信息的连接,实现序列模式放缩式增长,得出K-序列与K-序列相应的位置信息数据,避免对投影数据库反复扫描;引入了BIDE算法的前后向剪枝策略,检查相同末项序列位置信息表进行前向剪枝,消除大量重复投影的构建,提高挖掘算法的效率.实验结果表明,改进后的算法能快速的寻找到局部频繁项,有效提高了数据挖掘的效率.  相似文献   

10.
针对Apriori算法从数据中挖掘频繁项集的计算时间效率较低和空间内存占用较高的问题提出一种ATSAHT-Apriori(Adjacency Table Storage and Hash Table-Apriori)算法。该算法利用哈希表来存储数据,极大地提高了项集支持度频数的计算效率,结合图存储的思想利用邻接表来存储候选项集,极大地优化了内存空间占用,同时将候选项集构建大根堆,通过堆排序的思想与动态剪枝算法思想优化了频繁项集的计算速度和候选项集存储的内存空间,有效地优化了传统Apriori算法的计算时间效率和内存空间占用方面的不足。一系列对比实验表明,ATSAHT-Apriori算法在时间效率和空间效率都有一定的提高。  相似文献   

11.
Efficient mining of association rules in distributed databases   总被引:14,自引:0,他引:14  
Many sequential algorithms have been proposed for the mining of association rules. However, very little work has been done in mining association rules in distributed databases. A direct application of sequential algorithms to distributed databases is not effective, because it requires a large amount of communication overhead. In this study, an efficient algorithm called DMA (Distributed Mining of Association rules), is proposed. It generates a small number of candidate sets and requires only O(n) messages for support-count exchange for each candidate set, where n is the number of sites in a distributed database. The algorithm has been implemented on an experimental testbed, and its performance is studied. The results show that DMA has superior performance, when compared with the direct application of a popular sequential algorithm, in distributed databases  相似文献   

12.
基于DDMINER分布式数据库系统中频繁项目集的更新   总被引:13,自引:0,他引:13  
吉根林  杨明  赵斌  孙志挥 《计算机学报》2003,26(10):1387-1392
给出了一种分布式数据挖掘系统的体系结构DDMINER,对分布式数据库系统中频繁项目集的更新问题进行探讨,既考虑了数据库中事务增加的情况,又考虑了事务删除的情况;提出了一种基于DDMINER的局部频繁项目集的更新算法ULF和全局频繁项目集的更新算法UGF.该算法能够产生较少数量的候选频繁项目集,在求解全局频繁项目集过程中,传送候选局部频繁项目集支持数的通信量为O(n);将文章提出的算法用Java语言加以实现,并对算法性能进行了研究;实验结果表明这些算法是正确、可行的,并且具有较高的效率.  相似文献   

13.
Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms, such as federated and peer-to-peer databases, are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of genomic databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the emergence of systems which automatically analyze these databases, and by the expectancy that these databases will soon contain large amounts of highly dimensional genomic data. Current decision tree algorithms require high communication bandwidth when executed on such data, which are large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data-in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99 percent. Scalability tests show that the algorithm scales well with both the size of the data set, the dimensionality of the data, and the size of the distributed system.  相似文献   

14.
Pervasive computing deployments are increasingly using sensor networks to build instrumented environments that provide local data to immersed mobile applications. These applications demand opportunistic and unpredictable interactions with local devices. While this direct communication has the potential to reduce both overhead and latency, it deviates significantly from existing uses of sensor networks that funnel information to a static central collection point. This pervasive computing driven perspective demands new communication abstractions that enable the required direct communication among mobile applications and sensors. This paper presents the scene abstraction, which allows immersed applications to create dynamic distributed data structures over the immersive sensor network. A scene is created based on application requirements, properties of the underlying network, and properties of the physical environment. This paper details our work on defining scenes, providing an abstract model, an implementation, and an evaluation.  相似文献   

15.
数据复制是提高数据库系统性能和可用性的重要技术。近年来出现的基于组通信技术的数据库复制协议较之传统的数据库复制协议因其实现简单灵活、性能较优,在构建实用复制数据库系统时得到广泛应用,维持各节点副本一致性是数据复制技术研究的核心问题。本文讨论了当故障节点恢复后重新加入系统或增加新节点后如何恢复、维护系统节节点副本间的一致性的问题。本文提出了一个针对分布复制数据库系统的恢复协议,该协议结合基于组通信技术的复制协议,可在不影响系统正常事务处理的情况下,实现故障恢复后节点或全新节点重新加入系统时的系统正确恢复,并给出了理论证明。同时还证明,该恢复协议对于分布在不可靠网络上的复制数据库的恢复问题同样适用。  相似文献   

16.
Decomposition of knowledge for concurrent processing   总被引:1,自引:0,他引:1  
In some environments, it is more difficult for distributed systems to cooperate. In fact, some distributed systems are highly heterogeneous and might not readily cooperate. In order to alleviate these problems, we have developed an environment that preserves the autonomy of the local systems, while enabling distributed processing. This is achieved by: modeling the different application systems into a central knowledge base (called a Metadatabase); providing each application system with a local knowledge processor; and distributing the knowledge within these local shells. This paper is concerned with describing the knowledge decomposition process used for its distribution. The decomposition process is used to minimize the needed cooperation among the local knowledge processors, and is accomplished by “serializing” the rule execution process. A rule is decomposed into an ordered set of subrules, each of which is executed in sequence and located in a specific local knowledge processor. The goals of the decomposition algorithm are to minimize the number of subrules produced, hence reducing the time spent in communication, and to assure that the sequential execution of the subrules is “equivalent” to the execution of the original rule  相似文献   

17.
This paper addresses the problem of query optimization fordynamic databases in distributed environments where data frequently changetheir values. An adaptive query optimization algorithm is proposed toevaluate queries. Rather than constructing a full plan for an access path andexecuting it, the algorithm constructs a partial plan, executes it, updatesthe statistics, and constructs a new partial plan. Since a partial plan isconstructed based on the latest statistics, the algorithm is adaptive to data modifications and errors from the statistics. The algorithm extends the SDD-1algorithm by considering local processing cost as well as communication cost.Whereas the SDD-1 algorithm only uses semi-joins to reduce communication cost,the algorithm reduces it with joins as well. It is proved that the adaptivealgorithm is more efficient than the SDD-1 algorithm.  相似文献   

18.
快速挖掘全局最大频繁项目集   总被引:18,自引:1,他引:18       下载免费PDF全文
挖掘最大频繁项目集是多种数据挖掘应用中的关键问题.现行可用的最大频繁项目集挖掘算法大多基于单机环境,针对分布式环境下的全局最大频繁项目集挖掘尚不多见.若将基于单机环境的最大频繁项目集挖掘算法运用于分布式环境,或运用分布式环境下的全局频繁项目集挖掘算法来挖掘全局最大频繁项目集,均会产生大量的候选频繁项目集,且网络通信代价高.为此,提出了快速挖掘全局最大频繁项目集算法FMGMFI(fast mining global maximum frequent itemsets),该算法采用FP-tree存储结构,可方便地从各局部FP-tree的相关路径中得到项目集的频度,同时采用自顶向下和自底向上的双向搜索策略,可有效地降低网络通信代价.实验结果表明,FMGMF算法是有效、可行的.  相似文献   

19.
Sequential Pattern Mining in Multi-Databases via Multiple Alignment   总被引:2,自引:0,他引:2  
To efficiently find global patterns from a multi-database, information in each local database must first be mined and summarized at the local level. Then only the summarized information is forwarded to the global mining process. However, conventional sequential pattern mining methods based on support cannot summarize the local information and is ineffective for global pattern mining from multiple data sources. In this paper, we present an alternative local mining approach for finding sequential patterns in the local databases of a multi-database. We propose the theme of approximate sequential pattern mining roughly defined as identifying patterns approximately shared by many sequences. Approximate sequential patterns can effectively summerize and represent the local databases by identifying the underlying trends in the data. We present a novel algorithm, ApproxMAP, to mine approximate sequential patterns, called consensus patterns, from large sequence databases in two steps. First, sequences are clustered by similarity. Then, consensus patterns are mined directly from each cluster through multiple alignment. We conduct an extensive and systematic performance study over synthetic and real data. The results demonstrate that ApproxMAP is effective and scalable in mining large sequences databases with long patterns. Hence, ApproxMAP can efficiently summarize a local database and reduce the cost for global mining. Furthremore, we present an elegant and uniform model to identify both high vote sequential patterns and exceptional sequential patterns from the collection of these consensus patterns from each local databases.  相似文献   

20.
OSAF-tree--可迭代的移动序列模式挖掘及增量更新方法   总被引:1,自引:0,他引:1  
移动通信技术和无限定位技术的发展积累了海量的、动态增长的时空数据.利用数据挖掘技术从移动用户的时空行为轨迹当中挖掘用户移动序列模式,在移动通信、交通管理、基于位置服务等领域有着广泛的应用前景.由于移动环境网络资源珍贵、数据量大的特点,传统的序列模式挖掘方法在效率上很难满足需求.OSAF-tree算法基于投影的概念,只需要对数据库进行一遍扫描,就可以很好地处理移动序列模式的挖掘及其增量更新和迭代挖掘问题,这是一个非常高效的算法.与已有的方法相比,OSAF-tree算法在性能和I/O代价等方面都具有明显的优势.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号