首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
武优西  刘茜  闫文杰  郭磊  吴信东 《软件学报》2021,32(11):3331-3350
无重叠条件序列模式挖掘是一种间隙约束序列模式挖掘方法,与同类挖掘方法相比,该方法更容易发现有价值的频繁模式,其核心问题是计算给定模式在序列中的支持度或出现数,进而判定该模式的频繁性.而计算模式支持度问题实质是无重叠条件模式匹配.当前研究采用迭代搜索无重叠出现,然后剪枝无用结点的方式计算模式的支持度,其计算时间复杂度为O (m×m×n×W),其中,m,nW分别为模式长度、序列长度及最大间隙.为了进一步提高无重叠条件模式匹配计算速度,从而有效地降低无重叠条件序列模式挖掘时间,提出了一种高效的算法,该算法将模式匹配问题转换为一棵网树,然后从网树的最小树根结点出发,采用回溯策略迭代搜索最左孩子方式计算无重叠最小出现,在网树上剪枝该出现后,无需进一步查找并剪枝无效结点即可实现问题的求解.理论证明了该算法的完备性,并将该算法的时间复杂度降低为O (m×n×W).在此基础上,继续指明该问题还存在另外3种相似的求解策略,分别是从最左叶子出发迭代查找最左双亲方式、从最右树根出发迭代查找最右孩子方式和从最右叶子出发迭代查找最右双亲方式.实验结果验证了该算法的性能,特别是在序列模式挖掘中,应用该方法的挖掘算法可以降低挖掘时间.  相似文献   

2.
具有通配符间隙约束的模式匹配问题在信息检索、计算生物学和序列模式挖掘等研究领域有重要的应用.提出了更一般性的模式匹配问题,即一般间隙和长度约束的严格模式匹配(strict pattern matching with generalgaps and length constraints,简称SPANGLO).该问题具有如下4 个特点:它是一种严格的精确模式匹配;允许序列中任意位置的字符被多次使用;模式串中可以包含多个一般间隙;对出现的总体长度进行了约束.最坏情况下,一个SPANGLO 实例将转换出指数个非负间隙的严格模式匹配实例.为了有效地解决该问题,提出了子网树及其相关概念和性质.在此基础上提出了求解算法Subnettree Spanglo(SETS),并给出算法的正确性和完备性证明,同时指出该算法的空间复杂度与时间复杂度分别为O(m×MaxLen×W)O(MaxLen×W×m2×n),其中,m,n,MaxLenW分别是模式和序列的长度、出现的最大长度约束和模式的最大间距.实验结果既验证了SPANGLO 问题转换方法的正确性,又验证了该算法的正确性和有效性.  相似文献   

3.
Mining minimal distinguishing subsequence patterns with gap constraints   总被引:1,自引:4,他引:1  
Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the gap constraint. We present an efficient algorithm called ConSGapMiner (Contrast Sequences with Gap Miner), to mine all MDSs satisfying a minimum and maximum gap constraint, plus a maximum length constraint. It employs highly efficient bitset and boolean operations, for powerful gap-based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports.  相似文献   

4.
Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single, user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations. In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.
Michael HahslerEmail:
  相似文献   

5.
Given a database of graphs, structure mining algorithms search for all substructures that satisfy constraints such as minimum frequency, minimum confidence, minimum interest and maximum frequency. In order to make frequent subgraph mining more efficient, we propose to search with steps of increasing complexity. We present the GrAph/Sequence/Tree extractiON (Gaston) tool that implements this idea by searching first for frequent paths, then frequent free trees and finally cyclic graphs. We give results on large molecular databases.  相似文献   

6.
Sequential pattern mining (SPM) is an important data mining problem with broad applications. SPM is a hard problem due to the huge number of intermediate subsequences to be considered. State of the art approaches for SPM (e.g., PrefixSpan Pei et al. 2001) are largely based on the pattern-growth approach, where for each frequent prefix subsequence, only its related suffix subsequences need to be considered, and the database is recursively projected into smaller ones. Many authors have promoted the use of constraints to focus on the most promising patterns according to the interests of the end user. The top-k SPM problem is also used to cope with the difficulty of thresholding and to control the number of solutions. State of the art methods developed for SPM and top-k SPM, though efficient, are locked into a rather rigid search strategy, and suffer from the lack of declarativity and flexibility. Indeed, adding new constraints usually amounts to changing the data-structures used in the core of the algorithm, and combining these new constraints often require new developments. Recent works (e.g. Kemmar et al. 2014; Négrevergne and Guns 2015) have investigated the use of Constraint Programming (CP) for SPM. However, despite their nice declarative aspects, all these modelings have scaling problems, due to the huge size of their constraint networks. To address this issue, we propose the Prefix-Projection global constraint, which encapsulates both the subsequence relation as well as the frequency constraint. Its filtering algorithm relies on the principle of projected databases which allows to keep in the variables domain, only values leading to a frequent pattern in the database. Prefix-Projection filtering algorithm enforces domain consistency on the variable succeeding the current frequent prefix in polynomial time. This global constraint also allows for a straightforward implementation of additional constraints such as size, item membership, regular expressions and any combination of them. Experimental results show that our approach clearly outperforms existing CP approaches and competes well with the state-of-the-art methods on large datasets for mining frequent sequential patterns, sequential patterns under various constraints, and top-k sequential patterns. Unlike existing CP methods, our approach achieves a better scalability.  相似文献   

7.
In this paper, we present Para Miner which is a generic and parallel algorithm for closed pattern mining. Para Miner is built on the principles of pattern enumeration in strongly accessible set systems. Its efficiency is due to a novel dataset reduction technique (that we call EL-reduction), combined with novel technique for performing dataset reduction in a parallel execution on a multi-core architecture. We illustrate Para Miner’s genericity by using this algorithm to solve three different pattern mining problems: the frequent itemset mining problem, the mining frequent connected relational graphs problem and the mining gradual itemsets problem. In this paper, we prove the soundness and the completeness of Para Miner. Furthermore, our experiments show that despite being a generic algorithm, Para Miner can compete with specialized state of the art algorithms designed for the pattern mining problems mentioned above. Besides, for the particular problem of gradual itemset mining, Para Miner outperforms the state of the art algorithm by two orders of magnitude.  相似文献   

8.
The problem of enumerating minimal unsatisfiable subsets (MUSes) of an infeasible constraint system is challenging due first to the complexity of computing even a single MUS and second to the potentially intractable number of MUSes an instance may contain. In the face of the latter issue, when complete enumeration is not feasible, a partial enumeration of MUSes can be valuable, ideally with a time cost for each MUS output no greater than that needed to extract a single MUS. Recently, two papers independently presented a new MUS enumeration algorithm well suited to partial MUS enumeration (Liffiton and Malik, 2013, Previti and Marques-Silva, 2013). The algorithm exhibits good anytime performance, steadily producing MUSes throughout its execution; it is constraint agnostic, applying equally well to any type of constraint system; and its flexible structure allows it to incorporate advances in single MUS extraction algorithms and eases the creation of further improvements and modifications. This paper unifies and expands upon the earlier work, presenting a detailed explanation of the algorithm’s operation in a framework that also enables clearer comparisons to previous approaches, and we present a new optimization of the algorithm as well. Expanded experimental results illustrate the algorithm’s improvement over past approaches and newly explore some of its variants.  相似文献   

9.
An instance of the maximum constraint satisfaction problem (Max CSP) is a finite collection of constraints on a set of variables, and the goal is to assign values to the variables that maximises the number of satisfied constraints. Max CSP captures many well-known problems (such as Maxk-SAT and Max Cut) and is consequently NP-hard. Thus, it is natural to study how restrictions on the allowed constraint types (or constraint language) affect the complexity and approximability of Max CSP. The PCP theorem is equivalent to the existence of a constraint language for which Max CSP has a hard gap at location 1; i.e. it is NP-hard to distinguish between satisfiable instances and instances where at most some constant fraction of the constraints are satisfiable. All constraint languages, for which the CSP problem (i.e., the problem of deciding whether all constraints can be satisfied) is currently known to be NP-hard, have a certain algebraic property. We prove that any constraint language with this algebraic property makes Max CSP have a hard gap at location 1 which, in particular, implies that such problems cannot have a PTAS unless P=NP. We then apply this result to Max CSP restricted to a single constraint type; this class of problems contains, for instance, Max Cut and Max DiCut. Assuming PNP, we show that such problems do not admit PTAS except in some trivial cases. Our results hold even if the number of occurrences of each variable is bounded by a constant. Finally, we give some applications of our results.  相似文献   

10.
In the UK alone there are currently over 4.2 million operational CCTV cameras, that is virtually one camera for every 14th person, and this figure is increasing at a fast rate throughout the world (especially after the tragic events of 9/11 and 7/7) (Norris, McCahill, & Wood, 2004). Security concerns are not the only factor driving the rapid growth of CCTV cameras. Another important reason is the access of hidden knowledge extracted from CCTV footage to be used for effective business decision making, such as store designing, customer services, product marketing, reducing store shrinkage, etc.Events occurring in observed scenes are one of the most important semantic entities that can be extracted from videos (Anwar & Naftel, 2008). Most of the work presented in the past is based upon finding frequent event patterns or deals with discovering already known abnormal events. In contrast, in this paper we present a framework to discover unknown anomalous events associated with a frequent sequence of events (AEASP); that is to discover events, which are unlikely to follow a frequent sequence of events. This information can be very useful for discovering unknown abnormal events and can provide early actionable intelligence to redeploy resources to specific areas of view (such as PTZ camera or attention of a CCTV user). Discovery of anomalous events against a sequential pattern can also provide business intelligence for store management in the retail sector. The proposed event mining framework is an extension to our previous research work presented in Anwar et al. (2010) and also takes the temporal aspect of anomalous events against frequent sequence of events into consideration, that is to discover anomalous events which are true for a specific time interval only and might not be an anomalous events against frequent sequence of events over a whole time spectrum and vice versa. To confront the memory expensive process of searching all the instances of multiple sequential patterns in each data sequence an efficient dynamic sequential pattern search mechanism is introduced. Different experiments are conducted to evaluate the proposed anomalous events against frequent sequence of events mining algorithm’s accuracy and performance.  相似文献   

11.
The mobile wireless market has been attracting many customers. Technically, the paradigm of anytime-anywhere connectivity raises previously unthinkable challenges, including the management of million of mobile customers, their profiles, the profiles-based selective information dissemination, and server-side computing infrastructure design issues to support such a large pool of users automatically and intelligently. In this paper, we propose a data mining technique for discovering frequent behavioral patterns from a collection of trajectories gathered by Global Positioning System. Although the search space for spatiotemporal knowledge is extremely challenging, imposing spatial and temporal constraints on spatiotemporal sequences makes the computation feasible. Specifically, the mined patterns are incorporated with synthetic constraints, namely spatiotemporal sequence length restriction, minimum and maximum timing gap between events, time window of occurrence of the whole pattern, inclusion or exclusion event constraints, and frequent movement patterns predictive of one ore more classes. The algorithm for mining all frequent constrained patterns is named cAllMOP. Moreover, to control the density of pattern regions a clustering algorithm is exploited. The proposed method is efficient and scalable. Its efficiency is better than that of the previous algorithms AllMOP and GSP with respect to the compactness of discovered knowledge, execution time, and memory requirement.  相似文献   

12.
Mining frequent tree patterns has many applications in different areas such as XML data, bioinformatics and World Wide Web. The crucial step in frequent pattern mining is frequency counting, which involves a matching operator to find occurrences (instances) of a tree pattern in a given collection of trees. A widely used matching operator for tree-structured data is subtree homeomorphism, where an edge in the tree pattern is mapped onto an ancestor-descendant relationship in the given tree. Tree patterns that are frequent under subtree homeomorphism are usually called embedded patterns. In this paper, we present an efficient algorithm for subtree homeomorphism with application to frequent pattern mining. We propose a compact data-structure, called occ, which stores only information about the rightmost paths of occurrences and hence can encode and represent several occurrences of a tree pattern. We then define efficient join operations on the occ data-structure, which help us count occurrences of tree patterns according to occurrences of their proper subtrees. Based on the proposed subtree homeomorphism method, we develop an effective pattern mining algorithm, called TPMiner. We evaluate the efficiency of TPMiner on several real-world and synthetic datasets. Our extensive experiments confirm that TPMiner always outperforms well-known existing algorithms, and in several cases the improvement with respect to existing algorithms is significant.  相似文献   

13.
This paper describes the educational game, TopOpt Game, which invites the player to solve various optimization challenges. The main purpose of gamifying topology optimization is to create a supplemental educational tool which can be used to introduce concepts of topology optimization to newcomers as well as to train human intuition of topology optimization. The players are challenged to solve the standard minimum compliance problem in 2D by distributing material in a design domain given a number of loads and supports with a material constraint. A statistical analysis of the gameplay data shows that players achieve higher scores the more they play the game. The game is freely available for the iOS platform at Apple’s App Store and at http://www.topopt.dtu.dk/?q=node/909 for Windows and OSX.  相似文献   

14.
Frequent subgraph mining has been extensively studied on certain graph data. However, uncertainty is intrinsic in graph data in practice, but there is very few work on mining uncertain graph data. This paper focuses on mining frequent subgraphs over uncertain graph data under the probabilistic semantics. Specifically, a measure called ${\varphi}$ -frequent probability is introduced to evaluate the degree of recurrence of subgraphs. Given a set of uncertain graphs and two real numbers ${0 < \varphi, \tau < 1}$ , the goal is to quickly find all subgraphs with ${\varphi}$ -frequent probability at least τ. Due to the NP-hardness of the problem and to the #P-hardness of computing the ${\varphi}$ -frequent probability of a subgraph, an approximate mining algorithm is proposed to produce an ${(\varepsilon, \delta)}$ -approximate set Π of “frequent subgraphs”, where ${0 < \varepsilon < \tau}$ is error tolerance, and 0 <?δ?< 1 is a confidence bound. The algorithm guarantees that (1) any frequent subgraph S is contained in Π with probability at least ((1 ? δ) /2) s , where s is the number of edges in S; (2) any infrequent subgraph with ${\varphi}$ -frequent probability less than ${\tau - \varepsilon}$ is contained in Π with probability at most δ/2. The theoretical analysis shows that to obtain any frequent subgraph with probability at least 1 ? Δ, the input parameter δ of the algorithm must be set to at most ${1 - 2 (1 - \Delta)^{1 / \ell_{\max}}}$ , where 0 <?Δ <?1, and ? max is the maximum number of edges in frequent subgraphs. Extensive experiments on real uncertain graph data verify that the proposed algorithm is practically efficient and has very high approximation quality. Moreover, the difference between the probabilistic semantics and the expected semantics on mining frequent subgraphs over uncertain graph data has been discussed in this paper for the first time.  相似文献   

15.
The field of data mining has become accustomed to specifying constraints on patterns of interest. A large number of systems and techniques has been developed for solving such constraint-based mining problems, especially for mining itemsets. The approach taken in the field of data mining contrasts with the constraint programming principles developed within the artificial intelligence community. While most data mining research focuses on algorithmic issues and aims at developing highly optimized and scalable implementations that are tailored towards specific tasks, constraint programming employs a more declarative approach. The emphasis lies on developing high-level modeling languages and general solvers that specify what the problem is, rather than outlining how a solution should be computed, yet are powerful enough to be used across a wide variety of applications and application domains.This paper contributes a declarative constraint programming approach to data mining. More specifically, we show that it is possible to employ off-the-shelf constraint programming techniques for modeling and solving a wide variety of constraint-based itemset mining tasks, such as frequent, closed, discriminative, and cost-based itemset mining. In particular, we develop a basic constraint programming model for specifying frequent itemsets and show that this model can easily be extended to realize the other settings. This contrasts with typical procedural data mining systems where the underlying procedures need to be modified in order to accommodate new types of constraint, or novel combinations thereof. Even though the performance of state-of-the-art data mining systems outperforms that of the constraint programming approach on some standard tasks, we also show that there exist problems where the constraint programming approach leads to significant performance improvements over state-of-the-art methods in data mining and as well as to new insights into the underlying data mining problems. Many such insights can be obtained by relating the underlying search algorithms of data mining and constraint programming systems to one another. We discuss a number of interesting new research questions and challenges raised by the declarative constraint programming approach to data mining.  相似文献   

16.
Pattern matching (or string matching) is an essential task in computer science, especially in sequential pattern mining, since pattern matching methods can be used to calculate the support (or the number of occurrences) of a pattern and then to determine whether the pattern is frequent or not. A state-of-the-art sequential pattern mining with gap constraints (or flexible wildcards) uses the number of non-overlapping occurrences to denote the frequency of a pattern. Non-overlapping means that any two occurrences cannot use the same character of the sequence at the same position of the pattern. In this paper, we investigate strict pattern matching under the non-overlapping condition. We show that the problem is in P at first. Then we propose an algorithm, called NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless nodes in the Nettree after removing the rightmost root-leaf path. We show that NETLAP-Best is a complete algorithm and analyse the time and space complexities of the algorithm. Extensive experimental results demonstrate the correctness and efficiency of NETLAP-Best.  相似文献   

17.
Closing the approximability gap between \(3/2\) and 2 for the minimum makespan problem on unrelated machines is one of the most important open questions in scheduling. Almost all known approximation algorithms for the problem are based on linear programs (LPs). In this paper, we identify a surprisingly simple class of instances which constitute the core difficulty for LPs: the so far hardly studied unrelated graph balancing case in which each job can be assigned to at most two machines. We prove that already for this basic setting the strongest LP-relaxation studied so far—the configuration-LP—has an integrality gap of 2, matching the best known approximation factor for the general case. This points toward an interesting direction of future research. For the objective of maximizing the minimum machine load in the unrelated graph balancing setting, we present an elegant purely combinatorial 2-approximation algorithm with only quadratic running time. Our algorithm uses a novel preprocessing routine that estimates the optimal value as good as the configuration-LP. This improves on the computationally costly LP-based algorithm by Chakrabarty et al. (Proceedings of the 50th Annual Symposium on Foundations of Computer Science (FOCS 2009), pp 107–116, 2009) that achieves the same approximation guarantee.  相似文献   

18.
In graph mining, a frequency measure for graphs is anti-monotonic if the frequency of a pattern never exceeds the frequency of a subpattern. The efficiency and correctness of most graph pattern miners relies critically on this property. We study the case where frequent subgraphs have to be found in one graph. Vanetik et al. (Data Min Knowl Disc 13(2):243?C260, 2006) already gave sufficient and necessary conditions for anti-monotonicity of graph measures depending only on the edge-overlaps between the instances of the pattern in a labeled graph. We extend these results to homomorphisms, isomorphisms and homeomorphisms on both labeled and unlabeled, directed and undirected graphs, for vertex- and edge-overlap. We show a set of reductions between the different morphisms that preserve overlap. As a secondary contribution, we prove that the popular maximum independent set measure assigns the minimal possible normalized frequency and we introduce a new measure based on the minimum clique partition that assigns the maximum possible normalized frequency. In that way, we obtain that all normalized anti-monotonic overlap graph measures are bounded from above and below. We also introduce a new measure sandwiched between the former two based on the polynomial time computable Lovász ??-function.  相似文献   

19.
吴信东  谢飞  黄咏明  胡学钢  高隽 《软件学报》2013,24(8):1804-1815
很多应用领域产生大量的序列数据。如何从这些序列数据中挖掘具有重要价值的模式,已成为序列模式挖掘研究的主要任务。研究这样一个问题:给定序列S、支持度阈值和间隔约束,从序列S中挖掘所有出现次数不小于给定支持度阈值的频繁序列模式,并且要求模式中任意两个相邻元素在序列中的出现位置满足用户定义的间隔约束。设计了一种有效的带有通配符的模式挖掘算法One-Off Mining,模式在序列中的出现满足One-Off条件,即模式的任意两次出现都不共享序列中同一位置的字符。在生物DNA序列上的实验结果表明,One-Off Mining比相关的序列模式挖掘算法具有更好的时间性能和完备性。  相似文献   

20.
Discovering patterns with great significance is an important problem in data mining discipline. An episode is defined to be a partially ordered set of events for consecutive and fixed-time intervals in a sequence. Most of previous studies on episodes consider only frequent episodes in a sequence of events (called simple sequence). In real world, we may find a set of events at each time slot in terms of various intervals (hours, days, weeks, etc.). We refer to such sequences as complex sequences. Mining frequent episodes in complex sequences has more extensive applications than that in simple sequences. In this paper, we discuss the problem on mining frequent episodes in a complex sequence. We extend previous algorithm MINEPI to MINEPI+MINEPI+ for episode mining from complex sequences. Furthermore, a memory-anchored algorithm called EMMA is introduced for the mining task. Experimental evaluation on both real-world and synthetic data sets shows that EMMA is more efficient than MINEPI+MINEPI+.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号