首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we focus on mining surprising periodic patterns in a sequence of events. In many applications, e.g., computational biology, an infrequent pattern is still considered very significant if its actual occurrence frequency exceeds the prior expectation by a large margin. The traditional metric, such as support, is not necessarily the ideal model to measure this kind of surprising patterns because it treats all patterns equally in the sense that every occurrence carries the same weight towards the assessment of the significance of a pattern regardless of the probability of occurrence. A more suitable measurement, information, is introduced to naturally value the degree of surprise of each occurrence of a pattern as a continuous and monotonically decreasing function of its probability of occurrence. This would allow patterns with vastly different occurrence probabilities to be handled seamlessly. As the accumulated degree of surprise of all repetitions of a pattern, the concept of information gain is proposed to measure the overall degree of surprise of the pattern within a data sequence. The bounded information gain property is identified to tackle the predicament caused by the violation of the downward closure property by the information gain measure and in turn provides an efficient solution to this problem. Furthermore, the user has a choice between specifying a minimum information gain threshold and choosing the number of surprising patterns wanted. Empirical tests demonstrate the efficiency and the usefulness of the proposed model.  相似文献   

2.
Practical methods for constructing suffix trees   总被引:7,自引:0,他引:7  
Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.  相似文献   

3.
The goal of analyzing a time series database is to find whether and how frequent a periodic pattern is repeated within the series. Periodic pattern mining is the problem that regards temporal regularity. However, most of the existing algorithms have a major limitation in mining interesting patterns of users interest, that is, they can mine patterns of specific length with all the events sequentially one after another in exact positions within this pattern. Though there are certain scenarios where a pattern can be flexible, that is, it may be interesting and can be mined by neglecting any number of unimportant events in between important events with variable length of the pattern. Moreover, existing algorithms can detect only specific type of periodicity in various time series databases and require the interaction from user to determine periodicity. In this paper, we have proposed an algorithm for the periodic pattern mining in time series databases which does not rely on the user for the period value or period type of the pattern and can detect all types of periodic patterns at the same time, indeed these flexibilities are missing in existing algorithms. The proposed algorithm facilitates the user to generate different kinds of patterns by skipping intermediate events in a time series database and find out the periodicity of the patterns within the database. It is an improvement over the generating pattern using suffix tree, because suffix tree based algorithms have weakness in this particular area of pattern generation. Comparing with the existing algorithms, the proposed algorithm improves generating different kinds of interesting patterns and detects whether the generated pattern is periodic or not. We have tested the performance of our algorithm on both synthetic and real life data from different domains and found a large number of interesting event sequences which were missing in existing algorithms and the proposed algorithm was efficient enough in generating and detecting periodicity of flexible patterns on both types of data.  相似文献   

4.
Traditional information systems return answers after a user submits a complete query. Users often feel “left in the dark” when they have limited knowledge about the underlying data and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step toward solving this problem. In this paper, we study a new information-access paradigm, called “type-ahead search” in which the system searches the underlying data “on the fly” as the user types in query keywords. It extends autocomplete interfaces by allowing keywords to appear at different places in the underlying data. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incremental-search algorithms for both single-keyword queries and multi-keyword queries, using previously computed and cached results in order to achieve a high interactive speed. We develop novel techniques to support fuzzy search by allowing mismatches between query keywords and answers. We have deployed several real prototypes using these techniques. One of them has been deployed to support type-ahead search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency.  相似文献   

5.
Some results on the Collatz problem   总被引:1,自引:1,他引:0  
The paper refers to the Collatz's conjecture. In the first part, we present some equivalent forms of this conjecture and a slight generalization of a former result from [1]. Then, we present the notion of “chain subtrees” in Collatz's tree followed by a characterization theorem and some subclass of numbers which are labels for some chain subtrees. Next, we define the notion of “fixed points” and using this, we give another conjecture similar to Collatz's conjecture. Some new infinite sets of numbers for which the Collatz's conjecture holds are given. Finally, we present some interesting results related to the number of “even” and “odd” branches in the Collatz's tree. Received: 15 September 1999 / 2 June 2000  相似文献   

6.
This article presents a new interestingness measure for association rules called confidence gain (CG). Focus is given to extraction of human associations rather than associations between market products. There are two main differences between the two (human and market associations). The first difference is the strong asymmetry of human associations (e.g., the association “shampoo” → “hair” is much stronger than “hair” → “shampoo”), where in market products asymmetry is less intuitive and less evident. The second is the background knowledge humans employ when presented with a stimulus (input phrase). CG calculates the local confidence of a given term compared to its average confidence throughout a given database. CG is found to outperform several association measures since it captures both the asymmetric notion of an association (as in the confidence measure) while adding the comparison to an expected confidence (as in the lift measure). The use of average confidence introduces the “background knowledge” notion into the CG measure. Various experiments have shown that CG and local confidence gain (a low-complexity version of CG) successfully generate association rules when compared to human free associations. The experiments include a large-scale “free sssociation Turing test” where human free associations were compared to associations generated by the CG and other association measures. Rules discovered by CG were found to be significantly better than those discovered by other measures. CG can be used for many purposes, such as personalization, sense disambiguation, query expansion, and improving classification performance of small item sets within large databases. Although CG was found to be useful for Internet data retrieval, results can be easily used over any type of database. Edited by J. Srivastava  相似文献   

7.
We address the problem of detecting irregularities in visual data, e.g., detecting suspicious behaviors in video sequences, or identifying salient patterns in images. The term “irregular” depends on the context in which the “regular” or “valid” are defined. Yet, it is not realistic to expect explicit definition of all possible valid configurations for a given context. We pose the problem of determining the validity of visual data as a process of constructing a puzzle: We try to compose a new observed image region or a new video segment (“the query”) using chunks of data (“pieces of puzzle”) extracted from previous visual examples (“the database”). Regions in the observed data which can be composed using large contiguous chunks of data from the database are considered very likely, whereas regions in the observed data which cannot be composed from the database (or can be composed, but only using small fragmented pieces) are regarded as unlikely/suspicious. The problem is posed as an inference process in a probabilistic graphical model. We show applications of this approach to identifying saliency in images and video, for detecting suspicious behaviors and for automatic visual inspection for quality assurance. Patent Pending  相似文献   

8.
Suffix trees and suffix arrays are fundamental full-text index data structures to solve problems occurring in string processing. Since suffix trees and suffix arrays have different capabilities, some problems are solved more efficiently using suffix trees and others are solved more efficiently using suffix arrays. We consider efficient index data structures with the capabilities of both suffix trees and suffix arrays without requiring much space. When the size of an alphabet is small, enhanced suffix arrays are such index data structures. However, when the size of an alphabet is large, enhanced suffix arrays lose the power of suffix trees. Pattern searching in an enhanced suffix array takes O(m|Σ|) time while pattern searching in a suffix tree takes O(mlog |Σ|) time where m is the length of a pattern and Σ is an alphabet. In this paper, we present linearized suffix trees which are efficient index data structures with the capabilities of both suffix trees and suffix arrays even when the size of an alphabet is large. A linearized suffix tree has all the functionalities of the enhanced suffix array and supports the pattern search in O(mlog |Σ|) time. In a different point of view, it can be considered a practical implementation of the suffix tree supporting O(mlog |Σ|)-time pattern search. In addition, we also present two efficient algorithms for computing suffix links on the enhanced suffix array and the linearized suffix tree. These are the first algorithms that run in O(n) time without using the range minima query. Our experimental results show that our algorithms are faster than the previous algorithms.  相似文献   

9.
Restoring subsampled color images   总被引:1,自引:0,他引:1  
In some capturing devices, such as digital cameras, there is only one color sensor at each pixel. Usually, 50% of the pixels have only a green sensor, 25% only a red sensor, and 25% only a blue sensor. The problem is then to restore the two missing colors at each pixel – this is called “demosaicing”, because the original samples are usually arranged in a mosaic pattern. In this short paper, a few demosaicing algorithms are developed and compared. They all incorporate a notion of “smoothness in chroma space”, by imposing conditions not only on the behavior of each color channel separately, but also on the correlation between the three channels.  相似文献   

10.
We formulate the problem of constructing a tree which is the nearest on average to a given set of trees. The notion of “nearest” is formulated based on a conception of events such that counting their number makes it possible to distinguish each of the given trees from the desired one. These events are called divergence, duplication, loss, and transfer; other lists of events can also be considered. We propose an algorithm that solves this problem in cubic time with respect to the input data size. We prove correctness of the algorithm and a cubic estimate for its complexity.  相似文献   

11.
A framework for “improvisational” social acts and communication is introduced by referring to the idea of “relationalism” such as natural farming, permaculture and deep ecology. Based on this conception, the notion of Existential Graph by C. S. Peirce is introduced. The notion of extended self in deep ecology is substantiated based on the Roy Adaptation Model in Nursing Theory and Narrative approaches. By focusing on Leibnizian notions of space and time and by introducing Petri net, a spatio-temporal model of improvisation is constructed. This model is expected to substantiate the interesting notion of “Ba” proposed by H. Shimizu reflecting Japanese culture.  相似文献   

12.
A single-celled amoeboid organism, the true slime mold Physarum polycephalum, exhibits rich spatiotemporal oscillatory behavior and sophisticated computational capabilities. The authors previously created a biocomputer that incorporates the organism as a computing substrate to search for solutions to combinatorial optimization problems. With the assistance of optical feedback to implement a recurrent neural network model, the organism changes its shape by alternately growing and withdrawing its photosensitive branches so that its body area can be maximized and the risk of being illuminated can be minimized. In this way, the organism succeeded in finding the optimal solution to the four-city traveling salesman problem with a high probability. However, it remains unclear how the organism collects, stores, and compares information on light stimuli using the oscillatory dynamics. To study these points, we formulate an ordinary differential equation model of the amoeba-based neurocomputer, considering the organism as a network of oscillators that compete for a fixed amount of intracellular resource. The model, called the “Resource-Competing Oscillator Network (RCON) model,” reproduces well the organism’s experimentally observed behavior, as it generates a number of spatiotemporal oscillation modes by keeping the total sum of the resource constant. Designing the feedback rule properly, the RCON model comes to face a problem of optimizing the allocation of the resource to its nodes. In the problem-solving process, “greedy” nodes having the highest competitiveness are supposed to take more resource out of other nodes. However, the resource allocation pattern attained by the greedy nodes cannot always achieve a “socially optimal” state in terms of the public cost. We prepare four test problems including a tricky one in which the greedy pattern becomes “socially unfavorable” and investigate how the RCON model copes with these problems. Comparing problem-solving performances of the oscillation modes, we show that there exist some modes often attain socially favorable patterns without being trapped in the greedy one.  相似文献   

13.
Summary. Several years ago, Yang and Anderson presented an N-process algorithm for mutual exclusion under read/write atomicity that has time complexity, where “time” is measured by counting remote memory references. In this algorithm, instances of a two-process mutual exclusion algorithm are embedded within a binary arbitration tree. In the two-process algorithm that was used, all busy-waiting is done by “local spinning.” Performance studies presented by Yang and Anderson showed that their N-process algorithm exhibits scalable performance under heavy contention. One drawback of using an arbitration tree, however, is that each process is required to perform remote memory operations even when there is no contention. To remedy this problem, Yang and Anderson presented a variant of their algorithm that includes a “fast-path” mechanism that allows the arbitration tree to be bypassed in the absence of contention. This algorithm has the desirable property that contention-free time complexity is O(1). Unfortunately, the fast-path mechanism that was used caused time complexity under contention to rise to in the worst case. To this day, the problem of designing a read/write mutual exclusion algorithm with O(1) time complexity in the absence of contention and O(logN) time complexity under contention has remained open. In this paper, we close this problem by presenting a fast-path mechanism that achieves these time complexity bounds when used in conjunction with Yang and Anderson's arbitration-tree algorithm. Received: July 1999 / Accepted: July 2000  相似文献   

14.
Process mining includes the automated discovery of processes from event logs. Based on observed events (e.g., activities being executed or messages being exchanged) a process model is constructed. One of the essential problems in process mining is that one cannot assume to have seen all possible behavior. At best, one has seen a representative subset. Therefore, classical synthesis techniques are not suitable as they aim at finding a model that is able to exactly reproduce the log. Existing process mining techniques try to avoid such “overfitting” by generalizing the model to allow for more behavior. This generalization is often driven by the representation language and very crude assumptions about completeness. As a result, parts of the model are “overfitting” (allow only for what has actually been observed) while other parts may be “underfitting” (allow for much more behavior without strong support for it). None of the existing techniques enables the user to control the balance between “overfitting” and “underfitting”. To address this, we propose a two-step approach. First, using a configurable approach, a transition system is constructed. Then, using the “theory of regions”, the model is synthesized. The approach has been implemented in the context of ProM and overcomes many of the limitations of traditional approaches.  相似文献   

15.
后缀树的重要性可以为多年来学术界对它总是有新的发现而印证.它的结构简单,但可以在线性的时间里解决许多复杂的问题,被大量的使用在字符串及树的模式匹配中,对于XML标准,有很多基于关系库和对象库的索引技术和查询方案被提出来,我们试图给出一种基于后缀树进行路径导航的查询机制:用后缀树构造XML路径字典加速路径查询评价速度,我们提出可以在线地建立一个trie树的后缀树,讨论了XML路径字典中的后缀树建树算法,阐述了整个索引方案和查询机制,并探讨了包括RPE在内的它所支持的各种查询操作,XML路径字典被用于加快路径查询的评价速度.  相似文献   

16.
17.
We study the problem of efficiently extracting K entities, in a temporal database, which are most similar to a given search query. This problem is well studied in relational databases, where each entity is represented as a single record and there exist a variety of methods to define the similarity between a record and the search query. However, in temporal databases, each entity is represented as a sequence of historical records. How to properly define the similarity of each entity in the temporal database still remains an open problem. The main challenging is that, when a user issues a search query for an entity, he or she is prone to mix up information of the same entity at different time points. As a result, methods, which are used in relational databases based on record granularity, cannot work any further. Instead, we regard each entity as a set of “virtual records”, where attribute values of a “virtual record” can be from different records of the same entity. In this paper, we propose a novel evaluation model, based on which the similarity between each “virtual record” and the query can be effectively quantified, and the maximum similarity of its “virtual records” is taken as the similarity of an entity. For each entity, as the number of its “virtual records” is exponentially large, calculating the similarity of the entity is challenging. As a result, we further propose a Dominating Tree Algorithm (DTA), which is based on the bounding-pruning-refining strategy, to efficiently extract K entities with greatest similarities. We conduct extensive experiments on both real and synthetic datasets. The encouraging results show that our model for defining the similarity between each entity and the search query is effective, and the proposed DTA can perform at least two orders of magnitude improvement on the performance comparing with the naive approach.  相似文献   

18.
Video scene retrieval with interactive genetic algorithm   总被引:2,自引:1,他引:1  
This paper proposes a video scene retrieval algorithm based on emotion. First, abrupt/gradual shot boundaries are detected in the video clip of representing a specific story. Then, five video features such as “average color histogram,” “average brightness,” “average edge histogram,” “average shot duration,” and “gradual change rate” are extracted from each of the videos, and mapping through an interactive genetic algorithm is conducted between these features and the emotional space that a user has in mind. After the proposed algorithm selects the videos that contain the corresponding emotion from the initial population of videos, the feature vectors from them are regarded as chromosomes, and a genetic crossover is applied to those feature vectors. Next, new chromosomes after crossover and feature vectors in the database videos are compared based on a similarity function to obtain the most similar videos as solutions of the next generation. By iterating this process, a new population of videos that a user has in mind are retrieved. In order to show the validity of the proposed method, six example categories of “action,” “excitement,” “suspense,” “quietness,” “relaxation,” and “happiness” are used as emotions for experiments. This method of retrieval shows 70% of effectiveness on the average over 300 commercial videos.
Sung-Bae ChoEmail:
  相似文献   

19.
Being decades of study, the usability of database systems have received more attention in recent years. Now it is especially able to explain missing objects in a query result, which is called “why-not” questions, and is the focus of concern. This paper studies the problem of answering whynot questions on KNN queries. In our real life, many users would like to use KNN queries to investigate the surrounding circumstances. Nevertheless, they often feel disappointed when finding the result not including their expected objects. In this paper, we use the query refinement approach to resolve the problem. Given the original KNN query and a set of missing objects as input, our algorithm offer a refined KNN query that includes the missing objects to the user. The experimental results demonstrate the efficiency of our proposed optimizations and algorithms.  相似文献   

20.
This paper deals with the problem of derivational redundancy in scientific explanation, i.e. the problem that there can be extremely many different explanatory derivations for a natural phenomenon while students and experts mostly come up with one and the same derivation for a phenomenon (modulo the order of applying laws). Given this agreement among humans, we need to have a story of how to select from the space of possible derivations of a phenomenon the derivation that humans come up with. In this paper we argue that the problem of derivational redundancy can be solved by a new notion of “shortest derivation”, by which we mean the derivation that can be constructed by the fewest (and therefore largest) partial derivations of previously derived phenomena that function as “exemplars”. We show how the exemplar-based framework known as “Data-Oriented Parsing” or “DOP” can be employed to select the shortest derivation in scientific explanation. DOP’s shortest derivation of a phenomenon maximizes what is called the “derivational similarity” between a phenomenon and a corpus of exemplars. A preliminary investigation with exemplars from classical and fluid mechanics shows that the shortest derivation closely corresponds to the derivations that humans construct. Our approach also proposes a concrete solution to Kuhn’s problem of how we know on which exemplar a phenomenon can be modeled. We argue that humans model a phenomenon on the exemplar that is derivationally most similar to the phenomenon, i.e. the exemplar from which the largest subtree(s) can be used to derive the phenomenon.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号