首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 125 毫秒
1.
王希馗 《硅谷》2011,(10):191-192,157
利用Apriori算法和FP-growth算法挖掘密集型数据集的全部频繁项集代价高昂,针对该问题提出一种基于链表数组的关联规则挖掘算法,该方法使用链表数组为每个项目建立事务链表,只需要扫描数据库1次,就能够快速得到每个候选项的支持度,从而有效的发现频繁项集。通过与经典算法分析对比表明,该算法具有较快的挖掘速度。  相似文献   

2.
Apriori算法是当前使用最广泛的关联规则挖掘方法中最为经典的算法之一;但是该算法需要反复的扫描数据库,在I/O上花消很大,并且在得到频繁-2项集的过程中会产生庞大的候选-2项集,其次在筛选得到频繁-k项集时,并没排除那些不应该参组合的元素,而导致该算法效率很低,针对上面影响计算效率的三个方面提出基于压缩事务矩阵相乘得到频繁项目集的算法,只需一次扫描数据库,经过压缩处理产生产生事务矩阵,通过矩阵间运算得到频繁项目集,有效提高了关联规则的挖掘效率。  相似文献   

3.
顿毅杰 《硅谷》2010,(5):62-62,121
关联规则挖掘其主要研究目的是从大型数据集中发现隐藏的、有趣的、属性间存在的规律与数据间的联系。关联规则挖掘算法主要目的是从事务数据集项间挖掘出有意义的关联关系。Apriori算法是关联规则挖掘算法中最经典的方法。由Apriori算法产生的候选项集仍是巨量的。通过对Apriori算法中的候选项集支持频度的深入研究总结五条规律,并将这五条规律应用到Apriori算法中。  相似文献   

4.
丁邦旭 《硅谷》2012,(5):152-153
数据流的特点要求挖掘算法只能经过一次扫描获得挖掘结果,并且要求较低的空间复杂度。结合数据流的特点,提出一种基于滑动窗口的数据流频繁项集挖掘新算法MFIM。该算法采用二进制向量矩阵表示滑动窗口中的事务序列,以这种新的结构来记录频繁项集的动态变化,有效地挖掘数据流频繁项集。理论分析与实验结果表明该算法能获得较好的时间复杂度与空间复杂度。  相似文献   

5.
为改进基于数据库垂直表示的频繁项集挖掘算法的性能,给出了用索引数组方法来改进计算性能的思路.提出了索引数组的概念及其计算方法,并提出了一种新的高效的频繁项集挖掘算法Index-FIMiner.该算法大大减少了不必要的tidset求交及相应的频繁性判断操作,同时也论证了代表项可直接与其包含索引中的所有项集的组合进行连接,这些结果项集的支持度均与代表项的支持度相等,从而降低了这些频繁项集的处理代价,提高了算法的性能.实验结果表明,Index-FIMiner算法具有较高的挖掘效率.  相似文献   

6.
胡俊 《硅谷》2010,(21):175-175
随着数据库技术的发展,高效的数据挖掘算法有助于人们重新认识数据、理解数据。基于FP-树的关联规则挖掘算法FP-growth是当前应用最广的一种挖掘频繁项目集的算法。本文简要描述了该算法的几种主要发展方向。  相似文献   

7.
服务于网络广告的使用挖掘的主动数据收集技术   总被引:1,自引:0,他引:1  
重点对Web使用挖掘中模式挖掘前进行的工作多且事务识别不够准确。提出新的解决方法,主动数据收集技术,然后从网络广告的逻辑结构、模式挖掘所需的数据结构、重点算法和软件架构等方面研究主动数据收集技术。结果表明主动数据收集技术。为Web使用挖掘提供比服务器日志更全面而准确的数据.  相似文献   

8.
王蕾 《硅谷》2011,(24):69-70
安全运营中心SOC(Security Operation Center)是安全管理体系中的核心平台,而关联分析引擎则是SOC中的核心内核,设计并实现一种关联分析引擎的原型,创新地将序列模式挖掘中的Carma算法应用在引擎中,并对算法进行改进,提出在原算法第一步中限定当前交易子集v的长度及减弱v被插入集合V的条件,显著减少频繁项集挖掘所产生的模式数量,提高挖掘速度,该引擎能够有效的消除或减少重复及冗余报警,并能从海量数据中对多步攻击的报警事件进行逻辑关联,发现隐藏的攻击策略,自动产生关联规则并具有预警功能。  相似文献   

9.
生产实践中会产生大量的时间序列数据,而通过对时间序列数据的挖掘可以指导生产实践.时间序列数据通常维度高,为了保证原始序列的变化形态,有研究提出了时间序列重要点表示,但是选取重要点后,序列仍然受到噪声点的影响.为此首次提出在预处理阶段对时间序列进行多项式平滑滤波(Savitzky-Golay),然后对平滑后的序列选取重要点.重要点的选取使用连续三点取值的算法,为了更大程度上降低时间序列的长度,对极值点的选取增加了新的约束条件.时间序列相似性度量使用快速动态时间规整算法(FastDTW),实验表明所提算法具有可行性和有效性.  相似文献   

10.
彭宅铭  程龙生  姚启峰 《振动与冲击》2022,(13):239-245+251
退化模式挖掘对复杂系统剩余寿命预测具有重要意义。为了解系统运行状态,掌握其退化规律,提出一种基于时间序列聚类的退化模式挖掘方法。首先,利用改进马田系统筛选并融合多传感器数据特征,构建健康指数来表征系统的退化趋势。然后,采用累积和算法将健康曲线进行分段处理,获取退化曲线,并利用基于动态时间弯曲距离度量的层次聚类算法将退化模式进行归类。最后,以相似度和退化时间为判别依据,对系统的退化模式进行有效识别。以航空发动机为对象的研究表明,该方法能够有效的挖掘和识别退化模式,为复杂系统剩余寿命预测提供依据。  相似文献   

11.
Correct identification of a peptide sequence from MS/MS data is still a challenging research problem, particularly in proteomic analyses of higher eukaryotes where protein databases are large. The scoring methods of search programs often generate cases where incorrect peptide sequences score higher than correct peptide sequences (referred to as distraction). Because smaller databases yield less distraction and better discrimination between correct and incorrect assignments, we developed a method for editing a peptide-centric database (PC-DB) to remove unlikely sequences and strategies for enabling search programs to utilize this peptide database. Rules for unlikely missed cleavage and nontryptic proteolysis products were identified by data mining 11 849 high-confidence peptide assignments. We also evaluated ion exchange chromatographic behavior as an editing criterion to generate subset databases. When used to search a well-annotated test data set of MS/MS spectra, we found no loss of critical information using PC-DBs, validating the methods for generating and searching against the databases. On the other hand, improved confidence in peptide assignments was achieved for tryptic peptides, measured by changes in DeltaCN and RSP. Decreased distraction was also achieved, consistent with the 3-9-fold decrease in database size. Data mining identified a major class of common nonspecific proteolytic products corresponding to leucine aminopeptidase (LAP) cleavages. Large improvements in identifying LAP products were achieved using the PC-DB approach when compared with conventional searches against protein databases. These results demonstrate that peptide properties can be used to reduce database size, yielding improved accuracy and information capture due to reduced distraction, but with little loss of information compared to conventional protein database searches.  相似文献   

12.
“Sequential pattern mining” is a prominent and significant method to explore the knowledge and innovation from the large database. Common sequential pattern mining algorithms handle static databases. Pragmatically, looking into the functional and actual execution, the database grows exponentially thereby leading to the necessity and requirement of such innovation, research, and development culminating into the designing of mining algorithm. Once the database is updated, the previous mining result will be incorrect, and we need to restart and trigger the entire mining process for the new updated sequential database. To overcome and avoid the process of rescanning of the entire database, this unique system of incremental mining of sequential pattern is available. The previous approaches, system, and techniques are a priori-based frameworks but mine patterns is an advanced and sophisticated technique giving the desired solution. We propose and incorporate an algorithm called STISPM for incremental mining of sequential patterns using the sequence tree space structure. STISPM uses the depth-first approach along with backward tracking and the dynamic lookahead pruning strategy that removes infrequent and irregular patterns. The process and approach from the root node to any leaf node depict a sequential pattern in the database. The structural characteristic of the sequence tree makes it convenient and appropriate for incremental sequential pattern mining. The sequence tree also stores all the sequential patterns with its count and statistics, so whenever the support system is withdrawn or changed, our algorithm using frequent sequence tree as the storage structure can find and detect all the sequential patterns without mining the database once again.  相似文献   

13.
Protein sequence comparison is the most powerful tool for the inference of novel protein structure and function. This type of inference is commonly based on the similar sequence-similar structure-similar function paradigm, and derived by sequence similarity searching on databases of protein sequences. As entire genomes have been being determined at a rapid rate, computational methods for comparing protein sequences will be more essential for probing the complexity of molecular machines. In this paper we introduce a pattern-comparison algorithm, which is based on the mathematical concepts of linear predictive coding (LPC) and LPC cepstral distortion measure, for computing similarities/dissimilarities between protein sequences. Experimental results on a real data set of functionally related and functionally nonrelated protein sequences have shown the effectiveness of the proposed approach on both accuracy and computational efficiency.  相似文献   

14.
We consider the economic lot scheduling problem with returns by assuming that each item is returned by a constant rate of demand. The goal is to find production frequencies, production sequences, production times, as well as idle times for several items subject to returns at a single facility. We propose a heu ristic algorithm based on a time-varying (TV) lot sizes approach. The problem is decomposed into two distinct portions: in the first, we find a combinatorial part (production frequencies and sequences) and in the second, we determine a continuous part (production and idle times) in a specific production sequence. We report computational results that show that, in many cases, the proposed TV lot sizes approach with consideration of returns yields a relatively minor error.  相似文献   

15.
针对频繁项集挖掘存在数据和模式冗余的问题,对数据流最大频繁项集挖掘算法进行了研究。针对目前典型的数据流最大频繁模式挖掘算法DSM-MFI存在消耗大量存储空间及执行效率低等问题,提出了一种挖掘数据流界标窗口内最大频繁项集的算法MMFI-DS,该算法首先采用SEFI-tree存储包含在不断增长的数据流中相关最大频繁项集的重要信息,同时删除SEFI-tree中大量不频繁项目,然后使用自顶向下和自底向上双向搜索策略挖掘界标窗口中一系列的最大频繁项集。理论分析与实验表明,该算法比DSM-MFI算法具有更高的效率,并能节省存储空间。  相似文献   

16.
Existing studies have challenged the current definition of named bacterial species, especially in the case of highly recombinogenic bacteria. This has led to considering the use of computational procedures to examine potential bacterial clusters that are not identified by species naming. This paper describes the use of sequence data obtained from MLST databases as input for a k-means algorithm extended to deal with housekeeping gene sequences as a metric of similarity for the clustering process. An implementation of the k-means algorithm has been developed based on an existing source code implementation, and it has been evaluated against MLST data. Results point out to potential bacterial clusters that are close to more than one different named species and thus may become candidates for alternative classifications accounting for genotypic information. The use of hierarchical clustering with sequence comparison as similarity metric has the potential to find clusters different from named species by using a more informed cluster formation strategy than a conventional nominal variant of the algorithm.  相似文献   

17.
In this study, we propose a simple and novel data structure using hyper-links, H-struct, and a new mining algorithm, H-mine, which takes advantage of this data structure and dynamically adjusts links in the mining process. A distinct feature of this method is that it has a very limited and precisely predictable main memory cost and runs very quickly in memory-based settings. Moreover, it can be scaled up to very large databases using database partitioning. When the data set becomes dense, (conditional) FP-trees can be constructed dynamically as part of the mining process. Our study shows that H-mine has an excellent performance for various kinds of data, outperforms currently available algorithms in different settings, and is highly scalable to mining large databases. This study also proposes a new data mining methodology, space-preserving mining, which may have a major impact on the future development of efficient and scalable data mining methods.  相似文献   

18.
This paper introduces a new optimization algorithm for the minimization of the time sidelobes of the correlation function of a pseudonoise (PN) sequence by applying dynamic weighting to the sequence. The resulting optimized time sidelobe level sequences are to be used in direct sequence spread spectrum (DS-SS) systems with digital modulations such as BPSK, DPSK, QPSK, etc. The new optimization algorithm starts with a PN sequence. It first optimizes the correlation time sidelobes for the case where the consecutive data bits are identical (11 or 00). It then optimizes the correlation time sidelobes for the case of alternating consecutive data bits (10 or 01). The suppressed time sidelobe level sequences are derived by iterating these algorithms alternately starting from the initial PN sequence. The derived suppressed time sidelobe sequences show excellent correlation characteristics when compared to conventional PN sequences such as maximal length sequences, Gold sequences and Barker codes. Surface acoustic wave (SAW) devices were used to implement the optimized time sidelobe level sequences in a matched filter pair. The design of the apodized SAW-matched filters and their predicted second order effects are presented. The experimental results for the SAW-matched filters for the optimized time sidelobe level sequences derived from a Barker code were found to be in good agreement with the theoretical predictions from this new algorithm.  相似文献   

19.
The characterization of proteomes by mass spectrometry is largely limited to organisms with sequenced genomes. To identify proteins from organisms with unsequenced genomes, database sequences from related species must be employed for sequence-similarity protein identifications. Peptide sequence tags (Mann, 1994) have been used successfully for the identification of proteins in sequence databases using partially interpreted tandem mass spectra of tryptic peptides. We have extended the ability of sequence tag searching to the identification of proteins whose sequences are yet unknown but are homologous to known database entries. The MultiTag method presented here assigns statistical significance to matches of multiple error-tolerant sequence tags to a database entry and ranks alignments by their significance. The MultiTag approach has the distinct advantage over other sequence-similarity approaches of being able to perform sequence-similarity identifications using only very short (2-4) amino acid residue stretches of peptide sequences, rather than complete peptide sequences deduced by de novo interpretation of tandem mass spectra. This feature facilitates the identification of low abundance proteins, since noisy and low-intensity tandem mass spectra can be utilized.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号