首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 975 毫秒
1.
Mining frequent tree patterns has many applications in different areas such as XML data, bioinformatics and World Wide Web. The crucial step in frequent pattern mining is frequency counting, which involves a matching operator to find occurrences (instances) of a tree pattern in a given collection of trees. A widely used matching operator for tree-structured data is subtree homeomorphism, where an edge in the tree pattern is mapped onto an ancestor-descendant relationship in the given tree. Tree patterns that are frequent under subtree homeomorphism are usually called embedded patterns. In this paper, we present an efficient algorithm for subtree homeomorphism with application to frequent pattern mining. We propose a compact data-structure, called occ, which stores only information about the rightmost paths of occurrences and hence can encode and represent several occurrences of a tree pattern. We then define efficient join operations on the occ data-structure, which help us count occurrences of tree patterns according to occurrences of their proper subtrees. Based on the proposed subtree homeomorphism method, we develop an effective pattern mining algorithm, called TPMiner. We evaluate the efficiency of TPMiner on several real-world and synthetic datasets. Our extensive experiments confirm that TPMiner always outperforms well-known existing algorithms, and in several cases the improvement with respect to existing algorithms is significant.  相似文献   

2.
In this paper we consider integration of SMT solvers with the filtering algorithms for the finite domain alldifferent constraint. Such integration makes SMT solvers suitable for solving constraint satisfaction problems with the alldifferent constraint involved. First, we present a novel algorithm for explaining inconsistencies and propagations in the alldifferent constraint. We compare it to Katsirelos’ algorithm and flow-based algorithms that are commonly used for that purpose. Then we describe our DPLL(T)-compliant SMT theory solver for constraint satisfaction problems that include alldifferent constraints. We also provide an experimental evaluation of our approach.  相似文献   

3.
Suffix array is a powerful data structure, used mainly for pattern detection in strings. The main disadvantage of a full suffix array is its quadratic O(n2) space capacity when the actual suffixes are needed. In our previous work [39], we introduced the innovative All Repeated Patterns Detection (ARPaD) algorithm and the Moving Longest Expected Repeated Pattern (MLERP) process. The former detects all repeated patterns in a string using a partition of the full Suffix Array and the latter is capable of analyzing large strings regardless of their size. Furthermore, the notion of Longest Expected Repeated Pattern (LERP), also introduced by the authors in a previous work, significantly reduces to linear O(n) the space capacity needed for the full suffix array. However, so far the LERP value has to be specified in ad hoc manner based on experimental or empirical values. In order to overcome this problem, the Probabilistic Existence of LERP theorem has been proven in this paper and, furthermore, a formula for an accurate upper bound estimation of the LERP value has been introduced using only the length of the string and the size of the alphabet used in constructing the string. The importance of this method is the optimum upper bounding of the LERP value without any previous preprocess or knowledge of string characteristics. Moreover, the new data structure LERP Reduced Suffix Array is defined; it is a variation of the suffix array, and has the advantage of permitting the classification and parallelism to be implemented directly on the data structure. All other alternative methodologies deal with the very common problem of fitting any kind of data structure in a computer memory or disk in order to apply different time efficient methods for pattern detection. The current advanced and elegant proposed methodology allows us to alter the above-mentioned problem such that smaller classes of the problem can be distributed on different systems and then apply current, state-of-the-art, techniques such as parallelism and cloud computing using advanced DBMSs which are capable of handling the storage and analysis of big data. The implementation of the above-described methodology can be achieved by invoking our innovative ARPaD algorithm. Extensive experiments have been conducted on small, comparable strings of Champernowne Constant and DNA as well as on extremely large strings of π with length up to 68 billion digits. Furthermore, the novelty and superiority of our methodology have been also tested on real life application such as a Distributed Denial of Service (DDoS) attack early warning system.  相似文献   

4.
Model-based testing has mainly focused on models where concurrency is interpreted as interleaving (like the ioco theory for labeled transition systems), which may be too coarse when one wants concurrency to be preserved in the implementation. In order to test such concurrent systems, we choose to use Petri nets as specifications and define a concurrent conformance relation named co-ioco. We present a test generation algorithm based on Petri net unfolding able to build a complete test suite w.r.t our co-ioco conformance relation. In addition, we propose several coverage criteria that allow to select finite prefixes of an unfolding in order to build manageable test suites.  相似文献   

5.
6.
Hoare logic [1] is a logic used as a way of specifying semantics of programming languages, which has been extended to be a separation logic to reason about mutable heap structure [2]. In a model M of Hoare logic, each program α induces an M-computable function f α M on the universe of M; and the M-recursive functions are defined on M. It will be proved that the class of all the M-computable functions f α M induced by programs is equal to the class of all the M-recursive functions. Moreover, each M-recursive function is \(\sum {_1^{{N^M}}} \)-definable in M, where the universal quantifier is a number quantifier ranging over the standard part of a nonstandard model M.  相似文献   

7.
Paper presents a unique novel online learning algorithm for eight popular nonlinear (i.e., kernel), classifiers based on a classic stochastic gradient descent in primal domain. In particular, the online learning algorithm is derived for following classifiers: L1 and L2 support vector machines with both a quadratic regularizer w t w and the l 1 regularizer |w|1; regularized huberized hinge loss; regularized kernel logistic regression; regularized exponential loss with l 1 regularizer |w|1 and Least squares support vector machines. The online learning algorithm is aimed primarily for designing classifiers for large datasets. The novel learning model is accurate, fast and extremely simple (i.e., comprised of few coding lines only). Comparisons of performances of the proposed algorithm with the state of the art support vector machine algorithm on few real datasets are shown.  相似文献   

8.
How do the k-core structures of real-world graphs look like? What are the common patterns and the anomalies? How can we exploit them for applications? A k-core is the maximal subgraph in which all vertices have degree at least k. This concept has been applied to such diverse areas as hierarchical structure analysis, graph visualization, and graph clustering. Here, we explore pervasive patterns related to k-cores and emerging in graphs from diverse domains. Our discoveries are: (1) Mirror Pattern: coreness (i.e., maximum k such that each vertex belongs to the k-core) is strongly correlated with degree. (2) Core-Triangle Pattern: degeneracy (i.e., maximum k such that the k-core exists) obeys a 3-to-1 power-law with respect to the count of triangles. (3) Structured Core Pattern: degeneracy–cores are not cliques but have non-trivial structures such as core–periphery and communities. Our algorithmic contributions show the usefulness of these patterns. (1) Core-A, which measures the deviation from Mirror Pattern, successfully spots anomalies in real-world graphs, (2) Core-D, a single-pass streaming algorithm based on Core-Triangle Pattern, accurately estimates degeneracy up to 12 \(\times \) faster than its competitor. (3) Core-S, inspired by Structured Core Pattern, identifies influential spreaders up to 17 \(\times \) faster than its competitors with comparable accuracy.  相似文献   

9.
Summarization is an important intermediate step for expediting knowledge discovery tasks such as anomaly detection. In the context of anomaly detection from data stream, the summary needs to represent both anomalous and normal data. But streaming data has distinct characteristics, such as one-pass constraint, for which conducting data mining operations are difficult. Existing stream summarization techniques are unable to create summary which represent both normal and anomalous instances. To address this problem, in this paper, a number of hybrid summarization techniques are designed and developed using the concept of reservoir for anomaly detection from network traffic. Experimental results on thirteen benchmark data streams show that the summaries produced from stream using pairwise distance (PSSR) and template matching (TMSSR) techniques can retain more anomalies than existing stream summarization techniques, and anomaly detection technique can identify the anomalies with high true positive and low false positive rate.  相似文献   

10.
11.
A novel algorithm for color constancy   总被引:8,自引:0,他引:8  
  相似文献   

12.
Existing definitions of the relativizations of NC 1, L and NL do not preserve the inclusions \({{\bf NC}^1 \subseteq {\bf L}, {\bf NL}\subseteq {\bf AC}^1}\). We start by giving the first definitions that preserve them. Here for L and NL we define their relativizations using Wilson’s stack oracle model, but limit the height of the stack to a constant (instead of log(n)). We show that the collapse of any two classes in \({\{{\bf AC}^0 (m), {\bf TC}^0, {\bf NC}^1, {\bf L}, {\bf NL}\}}\) implies the collapse of their relativizations. Next we exhibit an oracle α that makes AC k (α) a proper hierarchy. This strengthens and clarifies the separations of the relativized theories in Takeuti (1995). The idea is that a circuit whose nested depth of oracle gates is bounded by k cannot compute correctly the (k + 1) compositions of every oracle function. Finally, we develop theories that characterize the relativizations of subclasses of P by modifying theories previously defined by the second two authors. A function is provably total in a theory iff it is in the corresponding relativized class, and hence, the oracle separations imply separations for the relativized theories.  相似文献   

13.
Sequential pattern mining (SPM) is an important data mining problem with broad applications. SPM is a hard problem due to the huge number of intermediate subsequences to be considered. State of the art approaches for SPM (e.g., PrefixSpan Pei et al. 2001) are largely based on the pattern-growth approach, where for each frequent prefix subsequence, only its related suffix subsequences need to be considered, and the database is recursively projected into smaller ones. Many authors have promoted the use of constraints to focus on the most promising patterns according to the interests of the end user. The top-k SPM problem is also used to cope with the difficulty of thresholding and to control the number of solutions. State of the art methods developed for SPM and top-k SPM, though efficient, are locked into a rather rigid search strategy, and suffer from the lack of declarativity and flexibility. Indeed, adding new constraints usually amounts to changing the data-structures used in the core of the algorithm, and combining these new constraints often require new developments. Recent works (e.g. Kemmar et al. 2014; Négrevergne and Guns 2015) have investigated the use of Constraint Programming (CP) for SPM. However, despite their nice declarative aspects, all these modelings have scaling problems, due to the huge size of their constraint networks. To address this issue, we propose the Prefix-Projection global constraint, which encapsulates both the subsequence relation as well as the frequency constraint. Its filtering algorithm relies on the principle of projected databases which allows to keep in the variables domain, only values leading to a frequent pattern in the database. Prefix-Projection filtering algorithm enforces domain consistency on the variable succeeding the current frequent prefix in polynomial time. This global constraint also allows for a straightforward implementation of additional constraints such as size, item membership, regular expressions and any combination of them. Experimental results show that our approach clearly outperforms existing CP approaches and competes well with the state-of-the-art methods on large datasets for mining frequent sequential patterns, sequential patterns under various constraints, and top-k sequential patterns. Unlike existing CP methods, our approach achieves a better scalability.  相似文献   

14.
Vertices with high betweenness and closeness centrality represent influential entities in a network. An important problem for time varying networks is to know a-priori, using minimal computation, whether the influential vertices of the current time step will retain their high centrality, in the future time steps, as the network evolves. In this paper, based on empirical evidences from several large real world time varying networks, we discover a certain class of networks where the highly central vertices are part of the innermost core of the network and this property is maintained over time. As a key contribution of this work, we propose novel heuristics to identify these networks in an optimal fashion and also develop a two-step algorithm for predicting high centrality vertices. Consequently, we show for the first time that for such networks, expensive shortest path computations in each time step as the network changes can be completely avoided; instead we can use time series models (e.g., ARIMA as used here) to predict the overlap between the high centrality vertices in the current time step to the ones in the future time steps. Moreover, once the new network is available in time, we can find the high centrality vertices in the top core simply based on their high degree. To measure the effectiveness of our framework, we perform prediction task on a large set of diverse time-varying networks. We obtain F1-scores as high as 0.81 and 0.72 in predicting the top m closeness and betweenness centrality vertices respectively for real networks where the highly central vertices mostly reside in the innermost core. For synthetic networks that conform to this property we achieve F1-scores of 0.94 and 0.92 for closeness and betweenness respectively. We validate our results by showing that the practical effects of our predicted vertices match the effects of the actual high centrality vertices. Finally, we also provide a formal sketch demonstrating why our method works.  相似文献   

15.
The problem of kNN (k Nearest Neighbor) queries has received considerable attention in the database and information retrieval communities. Given a dataset D and a kNN query q, the k nearest neighbor algorithm finds the closest k data points to q. The applications of kNN queries are board, not only in spatio-temporal databases but also in many areas. For example, they can be used in multimedia databases, data mining, scientific databases and video retrieval. The past studies of kNN query processing did not consider the case that the server may receive multiple kNN queries at one time. Their algorithms process queries independently. Thus, the server will be busy with continuously reaccessing the database to obtain the data that have already been acquired. This results in wasting I/O costs and degrading the performance of the whole system. In this paper, we focus on this problem and propose an algorithm named COrrelated kNN query Evaluation (COKE). The main idea of COKE is an “information sharing” strategy whereby the server reuses the query results of previously executed queries for efficiently processing subsequent queries. We conduct a comprehensive set of experiments to analyze the performance of COKE and compare it with the Best-First Search (BFS) algorithm. Empirical studies indicate that COKE outperforms BFS, and achieves lower I/O costs and less running time.  相似文献   

16.
This paper presents raSAT SMT solver, which is aimed to handle polynomial constraints over both reals and integers with simple unified methodologies. Its three main features are (1) a raSAT loop for inequalities, which adds testing to interval constraint propagation to accelerate SAT detection, (2) a non-constructive reasoning for equations over reals based on the generalized intermediate value theorem, and (3) soundness of floating-point arithmetic that is guaranteed by (a) rounding up/down over-approximations of intervals, and (b) confirmation of a satisfying instance detected by testing using the iRRAM package, which guarantees error bounds.  相似文献   

17.
In this paper we present a novel methodology for sequence classification, based on sequential pattern mining and optimization algorithms. The proposed methodology automatically generates a sequence classification model, based on a two stage process. In the first stage, a sequential pattern mining algorithm is applied to a set of sequences and the sequential patterns are extracted. Then, the score of every pattern with respect to each sequence is calculated using a scoring function and the score of each class under consideration is estimated by summing the specific pattern scores. Each score is updated, multiplied by a weight and the output of the first stage is the classification confusion matrix of the sequences. In the second stage an optimization technique, aims to finding a set of weights which minimize an objective function, defined using the classification confusion matrix. The set of the extracted sequential patterns and the optimal weights of the classes comprise the sequence classification model. Extensive evaluation of the methodology was carried out in the protein classification domain, by varying the number of training and test sequences, the number of patterns and the number of classes. The methodology is compared with other similar sequence classification approaches. The proposed methodology exhibits several advantages, such as automated weight assignment to classes using optimization techniques and knowledge discovery in the domain of application.
Dimitrios I. FotiadisEmail:
  相似文献   

18.
One in a million: picking the right patterns   总被引:7,自引:6,他引:1  
Constrained pattern mining extracts patterns based on their individual merit. Usually this results in far more patterns than a human expert or a machine leaning technique could make use of. Often different patterns or combinations of patterns cover a similar subset of the examples, thus being redundant and not carrying any new information. To remove the redundant information contained in such pattern sets, we propose two general heuristic algorithms—Bouncer and Picker—for selecting a small subset of patterns. We identify several selection techniques for use in this general algorithm and evaluate those on several data sets. The results show that both techniques succeed in severely reducing the number of patterns, while at the same time apparently retaining much of the original information. Additionally, the experiments show that reducing the pattern set indeed improves the quality of classification results. Both results show that the developed solutions are very well suited for the goals we aim at.
Albrecht Zimmermann (Corresponding author)Email:
  相似文献   

19.
In this paper, we deal with mining sequential patterns in multiple time sequences. Building on a state-of-the-art sequential pattern mining algorithm PrefixSpan for mining transaction databases, we propose MILE (MIning in muLtiple sEquences), an efficient algorithm to facilitate the mining process. MILE recursively utilizes the knowledge of existing patterns to avoid redundant data scanning, and therefore can effectively speed up the new patterns’ discovery process. Another unique feature of MILE is that it can incorporate prior knowledge of the data distribution in time sequences into the mining process to further improve the performance. Extensive empirical results show that MILE is significantly faster than PrefixSpan. As MILE consumes more memory than PrefixSpan, we also present a solution to trade time efficiency in memory constrained environments.
Xingquan ZhuEmail:
  相似文献   

20.
Many real-world knowledge-based systems must deal with information coming from different sources that invariably leads to incompleteness, overspecification, or inherently uncertain content. The presence of these varying levels of uncertainty doesn’t mean that the information is worthless – rather, these are hurdles that the knowledge engineer must learn to work with. In this paper, we continue work on an argumentation-based framework that extends the well-known Defeasible Logic Programming (DeLP) language with probabilistic uncertainty, giving rise to the Defeasible Logic Programming with Presumptions and Probabilistic Environments (DeLP3E) model. Our prior work focused on the problem of belief revision in DeLP3E, where we proposed a non-prioritized class of revision operators called AFO (Annotation Function-based Operators) to solve this problem. In this paper, we further study this class and argue that in some cases it may be desirable to define revision operators that take quantitative aspects into account, such as how the probabilities of certain literals or formulas of interest change after the revision takes place. To the best of our knowledge, this problem has not been addressed in the argumentation literature to date. We propose the QAFO (Quantitative Annotation Function-based Operators) class of operators, a subclass of AFO, and then go on to study the complexity of several problems related to their specification and application in revising knowledge bases. Finally, we present an algorithm for computing the probability that a literal is warranted in a DeLP3E knowledge base, and discuss how it could be applied towards implementing QAFO-style operators that compute approximations rather than exact operations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号